Implementations
The above chart is drawn from the Boston Housing dataset, which predicts the median value of houses from a handful of features. The data is small and noisy â which is actually perfect for testing robustness.
Each model was trained 36 times. The measure used is Total Variance Explained (TVE):
Total Variance Explained (TVE), Sum Squares Error (SSE) are defined as
$SSE(Model) := Sum (y_i - \hat{y_i})^2$
$SSE(Data) := Sum (y_i - mean)^2$
$TotalVarianceExplained := 1 - \frac{SSE(model)}{SSE(Data)}$
All TVE metrics below refer to unseen test data.
Random Forest â Average TVE ~78%
Consistently strong across runs. The ensemble effect helps debias training data and generalize well.
Gradient Boosting â Average TVE ~62%
Overfits heavily due to small sample size. Gradient boosting shines with large, diverse datasets. Here, itâs unstable â high variance between best and worst runs.
Decision Tree â Average TVE ~67%
Transparent, interpretable, slightly overfits but remains solid. You can inspect node-by-node which features drive predictions.
To reproduce:
git clone https://github.com/cyancirrus/stellar-math
cargo run --example trees
open tve_chart.png
Machine learning models often emerge from a few core structures â perceptrons, decision trees, EM, clustering, and dimensionality reduction. Here, weâll explore how to implement three of the most classic ones:
Letâs start simple: a humble if
/ else
statement.
if it_is_raining {
items_to_carry += "umbrella";
} else {
// no umbrella needed
}
Thatâs a one-node decision tree. Now, imagine scaling that intuition up â predicting wine quality:
Features might include:
if country == "greece" && colour == "green" {
predicted_quality = 1000_f32;
}
Thatâs the backbone idea: a hierarchy of if-else
splits, each narrowing down the prediction.
Algorithmically
let mut best_split = None;
let mut explanatory_power = 0.0;
for _ in 0..desired_number_of_nodes {
for each_dimension in data {
if explanatory_power(this_split) > explanatory_power {
best_split = this_split;
explanatory_power = explanatory_power(this_split);
}
}
}
Each node only handles its subset of data â not the entire dataset â as we recursively partition.
A Random Forest is just a collection of decision trees â each trained on a subsample of data or features.
let mut prediction = 0.0;
for tree in forest {
prediction += tree.predict(unseen_data);
}
prediction /= forest.len() as f32;
This ensemble effect dramatically reduces variance and overfitting.
Instead of averaging, Gradient Boosting fits each new tree to the residual errors of the previous one. Itâs like an iterative âerror correctionâ process:
$prediction = y_0 + (\hat{y_0} - y_1) + (\hat{y_1} - y_2) + ⊠+ (\hat{y_{n-1}} - y_n)$
Each new model learns to predict what its predecessors missed.
Everyone wants to peek behind the curtain at the âcomplexâ methods â and then realize theyâre surprisingly elegant.
use crate::learning::decision_tree::{DecisionTree, DecisionTreeModel};
pub struct GradientBoost {
trees: usize,
forest: Vec<DecisionTreeModel>,
}
impl GradientBoost {
pub fn new(
data: &mut Vec<Vec<f32>>,
trees: usize,
nodes: usize,
obs_sample: f32,
dim_sample: f32
) -> Self {
if data.is_empty() || data[0].is_empty() {
panic!("data is empty");
}
let n_obs = data[0].len();
let dims = data.len();
let target_idx = data.len() - 1;
let mut sample = vec![0_f32; dims];
let mut forest = Vec::with_capacity(trees);
for _ in 0..trees {
let tree = DecisionTree::new(data, obs_sample, dim_sample).train(nodes);
for idx in 0..n_obs {
for d in 0..dims { sample[d] = data[d][idx]; }
let pred = tree.predict(&sample);
data[target_idx][idx] -= pred;
}
forest.push(tree);
}
Self { trees, forest }
}
pub fn predict(&self, data: &[f32]) -> f32 {
self.forest.iter().map(|t| t.predict(data)).sum()
}
}
use crate::learning::decision_tree::{DecisionTree, DecisionTreeModel};
pub struct RandomForest {
trees: usize,
forest: Vec<DecisionTreeModel>,
}
impl RandomForest {
pub fn new(
data: &Vec<Vec<f32>>,
trees: usize,
nodes: usize,
obs_sample: f32,
dim_sample: f32
) -> Self {
let forest = (0..trees)
.map(|_| {
let mut tree = DecisionTree::new(data, obs_sample, dim_sample);
tree.train(nodes)
})
.collect();
Self { trees, forest }
}
pub fn predict(&self, data: &[f32]) -> f32 {
self.forest.iter().map(|t| t.predict(data)).sum::<f32>() / self.trees as f32
}
}
See? Behind the magic â theyâre just wrappers around the core decision tree.
For each tree, we scan all possible splits. Each dimensionâs data is pre-sorted for efficiency.
Initial sort cost: $O(d \cdot n \log n)$
Split evaluation per node:
$ BaseNodeSSE = SumSquares - \frac{\text{sum_linear}^2}{\text{cardinality}} $ $ SplitNodeSSE = SSE(left) + SSE(right)$
Since we can compute these incrementally as we scan, the cost per split is linear.
Overall cost:
$O(d \cdot n \log n + s \cdot n \cdot d)$
$\approx O(d \cdot n \log n)$
Thatâs the piĂšce de rĂ©sistance: the whole algorithmâs cost is roughly that of the initial sort!
fn delta(&self, running: &Self) -> f32 {
if self.card == 0 || running.card == 0 || self.card == running.card { return 0_f32 };
let (card, l_card, r_card) = (
self.card as f32,
running.card as f32,
(self.card - running.card) as f32,
);
let sse_curr = self.sum_squares - self.sum_linear * self.sum_linear / card;
let sse_left = running.sum_squares - running.sum_linear * running.sum_linear / l_card;
let sse_right = (self.sum_squares - running.sum_squares)
- (self.sum_linear - running.sum_linear) * (self.sum_linear - running.sum_linear)
/ r_card;
// weighted variance
(sse_curr - sse_left - sse_right) / card
}
Decision Trees are elegant, interpretable, and surprisingly performant. Master the base, and the extensions â Random Forests, Gradient Boost â come almost for free.
The most powerful methods are often the simplest â once the base is solid.
Thanks for reading!
Randomized SVD Algorithm
Golub-Kahan Implementation
Optimized QR Decomposition
Givens Implementation
Algorithm Source :: N. Halko, P. G. Martinsson, A. Tropp
Singular Value Decomposition (SVD) is one of the cornerstones of numerical, scientific, and statistical computing. It underpins everything from PCA to large-scale linear algebra pipelines, enabling efficient updates and approximations without exploding computational costs.
In this post, weâll explore SVD from both a computational and statistical perspective. Specifically, weâll cover:
Weâll implement ideas from Halko, Martinsson, and Troppâs seminal work Finding Structure in Randomness, and highlight how careful computational tricks make SVD practical at scale.
At a high level, SVD transforms a matrix into three components:
Intuitively, SVD identifies the âdirectionsâ in your data that capture the most signal.
For example, suppose weâre modeling the likelihood of patients revisiting a hospital, given demographic and behavioral variables:
- age
- gender
- region
- prescription adherence
- preventive care
- previous-year activity
- BMI
- âŠplus 20+ other variables
The raw input space is large and complex. But often, a few combinations of variables dominate the main signal. Perhaps âage Ă preventive care Ă prescription adherenceâ explains most of the variance. SVD compresses this multivariate data into latent features: the first singular vector captures the strongest pattern, the next captures the next strongest, and so on.
By truncating to the top K
components, we remove noise while retaining most of the meaningful signal. This is the foundation of PCA, low-rank approximations, and efficient numerical computation in ML pipelines.
The classic algorithm involves two steps:
Bidiagonalization
Bulge chasing
Note: Direct diagonalization without bidiagonalization generally only converges for symmetric positive-definite matrices.
Randomized SVD accelerates computations for large matrices by approximating the subspace spanned by the top singular vectors. The procedure is:
1. Generate a random matrix Ω ~ (m à k)
2. Form Y = (A A') A * Ω
3. Orthonormalize Y via QR decomposition â Q
4. Project A into the smaller subspace: B = Qá” * A
5. Compute deterministic SVD on B
6. Recover approximate U, ÎŁ, V from the small SVD
This reduces computational cost while capturing the dominant signal, which is often all you need in practical applications.
QR decomposition is central to both randomized SVD and Golub-Kahan. Given a matrix A
, we find:
A = Q * R
Where Q
is orthogonal and R
is upper-triangular. Key ideas:
Q[i] = I - ÎČ * u * uá”
.Q = Q[1] * Q[2] * ... * Q[n]
.Naively, applying Q
to another matrix costs O(n^4)
for dense matrices. But by exploiting the structure of Householder reflections, we can reduce this to O(n^3)
:
// Householder-based QR update
let w = beta * Q_km1 * v_k;
A -= w * v_k';
This optimization is crucial in practice, especially for large-scale pipelines and in Golub-Kahan bidiagonalization.
Golub-Kahan extends QR ideas to bidiagonalization:
The same optimizations used in QR reduce the cost of these operations, making the algorithm tractable for large matrices.
Suppose we have two 10,000 Ă 10,000
matrices. Naively, computing A * B
requires O(n^3) â 10^12
FLOPSâa trillion operations!
With randomized SVD, if we approximate with rank k = 1,000
:
Flops â O(n^2 * k) = O(10^8 * 1e3) = 10^11
We do only ~10% of the work, while preserving the dominant signal. This dramatically accelerates pipelines in ML and scientific computing, compounding over repeated matrix operations.
SVD is a powerful tool bridging linear algebra, statistics, and computational optimization. Through careful QR and Golub-Kahan implementations, we can handle large matrices efficiently. Randomized SVD then provides a statistical shortcut, allowing us to approximate dominant structures with far less computation.
The combination of algebraic tricks, computational insights, and statistical reasoning makes this one of the most elegant examples of applied numerical linear algebra.
Explore the codebase to see these ideas in actionâand watch the FLOPS disappear.
Thanks for reading!
Machine learning and AI are everywhere, so letâs dive into one of the foundational topicsânot deep learning this time!
I wanted to explore one of the most elegant and straightforward algorithms: K-Nearest Neighbors (KNN).
KNN is versatile: it can be used for both regression and classification. It even has connections to electronic engineering through Voronoi diagrams and is ubiquitous in machine learning.
KNN has one central premise: data points close to each other in input space will produce similar outputs.
This method is extremely clear in its assumptions and behavior. The only requirement is that the data is somewhat continuous and smooth. Itâs a great way to start implementing ML from scratch, without worrying about layers or activation functions, and it also allows for interesting engineering optimizations.
Alright, letâs implement KNN. But how do we do it?
Knowns:
For the distance function, the natural choice is Euclidean distance (triangle distance):
fn distance(x: &[f64], y: &[f64]) -> f64 {
let mut dist = 0.0;
for i in 0..x.len() {
dist += (x[i] - y[i]).powi(2);
}
dist.sqrt()
}
But letâs think about the cost:
n * d
.n log n
.So naive complexity is:
O(n * d + n log n) -> O(n log n) (dominant term)
This is doable for small datasets but not optimal, especially for online or large-scale settings.
To find nearest neighbors efficiently, we can partition the space into âneighborhoods.â
LSH is a hashing technique that maps similar inputs to similar outputs. Hereâs the intuition:
Mathematically:
a: Standard Normal Vector ~ N(0, I)
b: Standard Uniform Scalar ~ U(0,1)
w: bucket width constant
hash := floor((a·x - b) / w)
To improve robustness, we use multiple hash functions:
H := (hash[0], hash[1], ..., hash[h])
Inference steps:
H
for input x
.Complexity:
let d = dimensions
let z = number of retrieved neighbors
let h = number of hash functions
O(d*h + h + z*log(z) + z*d) -> O(z log z + C)
Compared to naive:
O(n log n)
This approach drastically reduces inference cost for large datasets.
For each new vector:
Complexity: O(n * d * h)
Memory: O(n * h)
Each point is stored in a bucket for each hash function.
How do we know LSH will find the correct neighbors?
Let:
n
= total elementsb
= elements per bucketp
= probability a hash bucket contains the nearest neighborq
= 1 - pWith multiple hashes:
pr(1 hash) = p
pr(2 hashes) = 1 - q^2
pr(h hashes) = 1 - q^h
As h â â
, probability of finding a neighbor approaches 1.
For LSH with vectors:
hash(x) = floor((a·x - b)/w)
x* = x + Δ
|a·(x - x*) / w| = Δ / w
The difference between perturbed vectors scales with Δ / w
. Probability that two vectors fall into the same bucket:
Pr(|a·(x-y)/w| < 1) = 2 * NormalCDF(w/d) - 1
This gives us a statistical bound for neighbor retrieval.
Weâve explored:
There are further optimizations in Rust, like using Arc
to avoid cloning data, but those are beyond this postâs scope.
I hope this post shows how a simple algorithm can become fascinating once you dive into performance and probabilistic analysis. Implementing KNN from scratch with these techniques is both instructive and practical.
Happy coding!
Rust Fully Asynchronous BlueSky-like study
Recently, Iâve been trying to wrap my head around using async within Rust.
I had some prior experience using async in Python, mainly to make non-blocking calls for embeddings of identified terms when they were independent.
Although I had some background, I was still greatly intimidated - many developers, who appeared far more talented, spoke about how difficult it was to understand Rustâs async model.
Surprisingly, transitioning my message-board-like or BlueSky-like app from a LeetCode solution into a fully async backend wasnât terribly difficult.
First, I worked through several async problems focused on using notifications and streaming. Then, without much further practice, I jumped in.
The API structure - being mostly async - was really about learning the tools:
But my favorite package - the standout for me⊠â DashMap an amazing tool allowing you to manage interior mutability
One of the main APIs I needed to port was fn follow(...)
, which takes in a followee_id
and a follower_id
.
I wanted users to safely write to their own portion of the data - i.e., a user (the follower) clicks follow on another user (the followee), and we remember this information.
This was tricky to model. The user should only be able to modify their own data. It should also be fully async.
Originally, the data appeared in synchronous code as:
HashMap<UserId, HashSet<UserId>>
A mapping from user ID to the set of users they follow.
I used a set to enable quick unfollow
(O(1)) and to ensure no duplicate follows.
To model the problem, I initially reached for a Mutex
.
Mutexes were useful when solving async LeetCode problems, allowing mutation within async code - so I started there.
Mutex<HashSet<UserId>>
This worked, but it was blocking.
It was synchronous code masquerading as async.
Time to explore other structures beyond what Iâd seen in my limited Rust async exposure.
My API could naturally be partitioned into:
NewsFeed
Follow
, Publish
, Unfollow
This seemed like a natural fit for RwLock
, so I implemented:
RwLock<HashMap<UserId, RwLock<HashSet<UserId>>>>
There were only a few code changes: .lock().await
became .read().await
or .write().await
. Overall, the changes were minimal.
RwLock
was a major improvement over Mutex
- while Mutex
allows only a single user or thread to access the data at a time, RwLock
allowed multiple readers in parallel.
âŠbut the problem remained: most actions cause side effects (writes), and a single write on the outermost RwLock
blocked the entire backend - even reads!
Multiple users writing to different locations - they should be able to write independently.
This separation had to be modelable. How could I drive that separation?
The problem seemed so simple: just enable read-write locking on the interior data.
Enter DashMap - like Gandalf cresting the hill at Helmâs Deep!
DashMap allows users to mutate their private data without needing explicit mutability, and itâs a near drop-in replacement for HashMap
.
For example:
fn follow(&mut self, followee_id: UserId, follower_id: UserId)
became:
fn follow(&self, followee_id: UserId, follower_id: UserId)
This helped clean up parts of Axumâs server model and the guarantees needed to build the API tree.
DashMap enabled private mutation - as long as you handled the interior structure correctly, e.g.:
DashMap<UserId, RwLock<HashSet<UserId>>>
This was exactly the model I was searching for.
I cannot recommend the library enough if youâre facing a similar modeling problem where something feels like it should be possible.
Not only is Rust async - and its tooling - becoming ever more mature and viable in production, butâŠ
Donât be scared to jump in.
Youâve already solved problems that felt impossible at the time. This is just another challenge.
When I saw my project handle 1,000 posts from 1,000 users and retrieve sorted newsfeeds for 10 users in 12.620865ms seconds on my 2018 machine, I was thrilled.
Programming isnât just writing code for things you already know.
Programming is solving new problems, exploring the unknown, and discovering better solutions.
Thanks so much for reading - see you in the next post!
My name is Autumn, Iâm a mixture of Data Scientist, Software Engineer, and Machine Learning Engineer. Iâm passionate about mathematics, statistics, performant computing, and low-level code.
Iâve been exploring multiple projects focused on numerical computing and different strategies â recently concentrating on matrix multiplication techniques, neural networks, and implementing various algorithms at a low level.
Additionally, Iâve been brushing up on data structures and algorithms to design more performant systems. I have experience with databases, API development, predictive engines, model pipelines, and backend systems for applications.
Here, I hope to track my progress and provide a reference for others making the same journey. Thanks so much for taking a look at my blog!