🍂 Autumn Leaves

Leafnotes from a developer on a journey through code, math, and music

Implementation of Tree based Models (2025-10-08)

Implementations

A First Glance of the Statistical Performance

Model Performance

The above chart is drawn from the Boston Housing dataset, which predicts the median value of houses from a handful of features. The data is small and noisy — which is actually perfect for testing robustness.

Each model was trained 36 times. The measure used is Total Variance Explained (TVE):

Total Variance Explained (TVE), Sum Squares Error (SSE) are defined as

$SSE(Model) := Sum (y_i - \hat{y_i})^2$

$SSE(Data) := Sum (y_i - mean)^2$

$TotalVarianceExplained := 1 - \frac{SSE(model)}{SSE(Data)}$

All TVE metrics below refer to unseen test data.

  • Random Forest — Average TVE ~78%
    Consistently strong across runs. The ensemble effect helps debias training data and generalize well.

  • Gradient Boosting — Average TVE ~62%
    Overfits heavily due to small sample size. Gradient boosting shines with large, diverse datasets. Here, it’s unstable — high variance between best and worst runs.

  • Decision Tree — Average TVE ~67%
    Transparent, interpretable, slightly overfits but remains solid. You can inspect node-by-node which features drive predictions.

To reproduce:

git clone https://github.com/cyancirrus/stellar-math
cargo run --example trees
open tve_chart.png

Introduction

Machine learning models often emerge from a few core structures — perceptrons, decision trees, EM, clustering, and dimensionality reduction. Here, we’ll explore how to implement three of the most classic ones:

  • Decision Trees
  • Random Forests
  • Gradient Boosting

Preliminaries — What is a Decision Tree?

Let’s start simple: a humble if / else statement.

if it_is_raining {
    items_to_carry += "umbrella";
} else {
    // no umbrella needed
}

That’s a one-node decision tree. Now, imagine scaling that intuition up — predicting wine quality:

Features might include:

  • country of origin
  • year of production
  • rainfall, temperature
  • brand
  • colour (white/red/red-blend)
  • type (merlot, pinot, etc.)
  • price
  • quality ← target variable
if country == "greece" && colour == "green" {
    predicted_quality = 1000_f32;
}

That’s the backbone idea: a hierarchy of if-else splits, each narrowing down the prediction.

Algorithmically

let mut best_split = None;
let mut explanatory_power = 0.0;

for _ in 0..desired_number_of_nodes {
    for each_dimension in data {
        if explanatory_power(this_split) > explanatory_power {
            best_split = this_split;
            explanatory_power = explanatory_power(this_split);
        }
    }
}

Each node only handles its subset of data — not the entire dataset — as we recursively partition.

Decision Tree Extensions

Random Forest

A Random Forest is just a collection of decision trees — each trained on a subsample of data or features.

let mut prediction = 0.0;
for tree in forest {
   prediction += tree.predict(unseen_data);
}
prediction /= forest.len() as f32;

This ensemble effect dramatically reduces variance and overfitting.

Gradient Boosting

Instead of averaging, Gradient Boosting fits each new tree to the residual errors of the previous one. It’s like an iterative “error correction” process:

$prediction = y_0 + (\hat{y_0} - y_1) + (\hat{y_1} - y_2) + 
 + (\hat{y_{n-1}} - y_n)$

Each new model learns to predict what its predecessors missed.

Elphaba’s Look at the Great Wizard of Oz — Tree Extensions & Implementations

Everyone wants to peek behind the curtain at the “complex” methods — and then realize they’re surprisingly elegant.

Gradient Boosting (Rust)

Click to expand code
use crate::learning::decision_tree::{DecisionTree, DecisionTreeModel};

pub struct GradientBoost {
    trees: usize,
    forest: Vec<DecisionTreeModel>,
}

impl GradientBoost {
    pub fn new(
        data: &mut Vec<Vec<f32>>,
        trees: usize,
        nodes: usize,
        obs_sample: f32,
        dim_sample: f32
    ) -> Self {
        if data.is_empty() || data[0].is_empty() {
            panic!("data is empty");
        }
        let n_obs = data[0].len();
        let dims = data.len();
        let target_idx = data.len() - 1;
        let mut sample = vec![0_f32; dims];
        let mut forest = Vec::with_capacity(trees);

        for _ in 0..trees {
            let tree = DecisionTree::new(data, obs_sample, dim_sample).train(nodes);
            for idx in 0..n_obs {
                for d in 0..dims { sample[d] = data[d][idx]; }
                let pred = tree.predict(&sample);
                data[target_idx][idx] -= pred;
            }
            forest.push(tree);
        }
        Self { trees, forest }
    }

    pub fn predict(&self, data: &[f32]) -> f32 {
        self.forest.iter().map(|t| t.predict(data)).sum()
    }
}

Random Forest (Rust)

Click to expand code
use crate::learning::decision_tree::{DecisionTree, DecisionTreeModel};

pub struct RandomForest {
    trees: usize,
    forest: Vec<DecisionTreeModel>,
}

impl RandomForest {
    pub fn new(
        data: &Vec<Vec<f32>>,
        trees: usize,
        nodes: usize,
        obs_sample: f32,
        dim_sample: f32
    ) -> Self {
        let forest = (0..trees)
            .map(|_| {
                let mut tree = DecisionTree::new(data, obs_sample, dim_sample);
                tree.train(nodes)
            })
            .collect();
        Self { trees, forest }
    }

    pub fn predict(&self, data: &[f32]) -> f32 {
        self.forest.iter().map(|t| t.predict(data)).sum::<f32>() / self.trees as f32
    }
}

See? Behind the magic — they’re just wrappers around the core decision tree.

Decision Tree Implementation & Complexity

For each tree, we scan all possible splits. Each dimension’s data is pre-sorted for efficiency.

Initial sort cost: $O(d \cdot n \log n)$

Split evaluation per node:

$ BaseNodeSSE = SumSquares - \frac{\text{sum_linear}^2}{\text{cardinality}} $ $ SplitNodeSSE = SSE(left) + SSE(right)$

Since we can compute these incrementally as we scan, the cost per split is linear.

Overall cost:

$O(d \cdot n \log n + s \cdot n \cdot d)$
$\approx O(d \cdot n \log n)$

That’s the piĂšce de rĂ©sistance: the whole algorithm’s cost is roughly that of the initial sort!

Click to expand code
fn delta(&self, running: &Self) -> f32 {
    if self.card == 0 || running.card == 0 || self.card  == running.card { return 0_f32 };
    let (card, l_card, r_card) = (
        self.card as f32,
        running.card as f32,
        (self.card - running.card) as f32,
    );
    let sse_curr = self.sum_squares - self.sum_linear * self.sum_linear / card;
    let sse_left = running.sum_squares - running.sum_linear * running.sum_linear / l_card;
    let sse_right = (self.sum_squares - running.sum_squares)
        - (self.sum_linear - running.sum_linear) * (self.sum_linear - running.sum_linear)
            / r_card;
    // weighted variance
    (sse_curr - sse_left - sse_right) / card
}

Defying Gravity

Decision Trees are elegant, interpretable, and surprisingly performant. Master the base, and the extensions — Random Forests, Gradient Boost — come almost for free.

The most powerful methods are often the simplest — once the base is solid.

Thanks for reading!

Computational Implementation of Randomized SVD (2025-09-30)

Randomized SVD Algorithm
Golub-Kahan Implementation
Optimized QR Decomposition
Givens Implementation

Algorithm Source :: N. Halko, P. G. Martinsson, A. Tropp

Introduction

Singular Value Decomposition (SVD) is one of the cornerstones of numerical, scientific, and statistical computing. It underpins everything from PCA to large-scale linear algebra pipelines, enabling efficient updates and approximations without exploding computational costs.

In this post, we’ll explore SVD from both a computational and statistical perspective. Specifically, we’ll cover:

  • Optimizing matrix operations for performance.
  • Leveraging statistics for lower-rank approximations.

We’ll implement ideas from Halko, Martinsson, and Tropp’s seminal work Finding Structure in Randomness, and highlight how careful computational tricks make SVD practical at scale.

What is SVD?

At a high level, SVD transforms a matrix into three components:

  1. Input rotation (U)
  2. Scaling (ÎŁ)
  3. Output rotation (Vᔀ)

Intuitively, SVD identifies the “directions” in your data that capture the most signal.

For example, suppose we’re modeling the likelihood of patients revisiting a hospital, given demographic and behavioral variables:

- age
- gender
- region
- prescription adherence
- preventive care
- previous-year activity
- BMI
- 
plus 20+ other variables

The raw input space is large and complex. But often, a few combinations of variables dominate the main signal. Perhaps “age × preventive care × prescription adherence” explains most of the variance. SVD compresses this multivariate data into latent features: the first singular vector captures the strongest pattern, the next captures the next strongest, and so on.

By truncating to the top K components, we remove noise while retaining most of the meaningful signal. This is the foundation of PCA, low-rank approximations, and efficient numerical computation in ML pipelines.

Implementing SVD

Deterministic SVD

The classic algorithm involves two steps:

  1. Bidiagonalization

    • Golub-Kahan procedure using Householder reflections.
    • Columns below the diagonal and rows beyond the first superdiagonal are zeroed.
  2. Bulge chasing

    • Givens rotations and extensions diagonalize the bidiagonal matrix.
    • This produces the singular values and vectors.

Note: Direct diagonalization without bidiagonalization generally only converges for symmetric positive-definite matrices.

Randomized SVD

Randomized SVD accelerates computations for large matrices by approximating the subspace spanned by the top singular vectors. The procedure is:

1. Generate a random matrix Ω ~ (m × k)
2. Form Y = (A A') A * Ω
3. Orthonormalize Y via QR decomposition → Q
4. Project A into the smaller subspace: B = Qᔀ * A
5. Compute deterministic SVD on B
6. Recover approximate U, ÎŁ, V from the small SVD

This reduces computational cost while capturing the dominant signal, which is often all you need in practical applications.

QR Decomposition: The Core Building Block

QR decomposition is central to both randomized SVD and Golub-Kahan. Given a matrix A, we find:

A = Q * R

Where Q is orthogonal and R is upper-triangular. Key ideas:

  • Householder reflections rotate vectors so that elements below the diagonal become zero.
  • Reflections are rank-one symmetric matrices: Q[i] = I - ÎČ * u * uᔀ.
  • By chaining reflections, we compute Q = Q[1] * Q[2] * ... * Q[n].

Computational Optimization

Naively, applying Q to another matrix costs O(n^4) for dense matrices. But by exploiting the structure of Householder reflections, we can reduce this to O(n^3):

// Householder-based QR update
let w = beta * Q_km1 * v_k;
A -= w * v_k';

This optimization is crucial in practice, especially for large-scale pipelines and in Golub-Kahan bidiagonalization.

QR → Golub-Kahan

Golub-Kahan extends QR ideas to bidiagonalization:

  • Columns below the diagonal are zeroed (like QR).
  • Rows beyond the superdiagonal are zeroed.
  • Iterating column and row zeroing produces the bidiagonal matrix.

The same optimizations used in QR reduce the cost of these operations, making the algorithm tractable for large matrices.

Statistical Optimization: The Randomized Payoff

Suppose we have two 10,000 × 10,000 matrices. Naively, computing A * B requires O(n^3) ≈ 10^12 FLOPS—a trillion operations!

With randomized SVD, if we approximate with rank k = 1,000:

Flops ≈ O(n^2 * k) = O(10^8 * 1e3) = 10^11

We do only ~10% of the work, while preserving the dominant signal. This dramatically accelerates pipelines in ML and scientific computing, compounding over repeated matrix operations.

Wrapping Up

SVD is a powerful tool bridging linear algebra, statistics, and computational optimization. Through careful QR and Golub-Kahan implementations, we can handle large matrices efficiently. Randomized SVD then provides a statistical shortcut, allowing us to approximate dominant structures with far less computation.

The combination of algebraic tricks, computational insights, and statistical reasoning makes this one of the most elegant examples of applied numerical linear algebra.

Explore the codebase to see these ideas in action—and watch the FLOPS disappear.

Thanks for reading!

K-Nearest Neighbors from Scratch (2025-09-18)

KNN Machine Learning Study

Introduction

Machine learning and AI are everywhere, so let’s dive into one of the foundational topics—not deep learning this time!

I wanted to explore one of the most elegant and straightforward algorithms: K-Nearest Neighbors (KNN).

KNN is versatile: it can be used for both regression and classification. It even has connections to electronic engineering through Voronoi diagrams and is ubiquitous in machine learning.

What is KNN

KNN has one central premise: data points close to each other in input space will produce similar outputs.

  • For regression, we can average the outputs of the neighbors (either simple or weighted).
  • For classification, we can either compute probabilities (via softmax) or return the majority vote for a hard classification.

This method is extremely clear in its assumptions and behavior. The only requirement is that the data is somewhat continuous and smooth. It’s a great way to start implementing ML from scratch, without worrying about layers or activation functions, and it also allows for interesting engineering optimizations.

KNN Implementation: First Thoughts

Alright, let’s implement KNN. But how do we do it?

Knowns:

  • We need to find the top k closest neighbors for a point.
  • We need a distance function.

For the distance function, the natural choice is Euclidean distance (triangle distance):

fn distance(x: &[f64], y: &[f64]) -> f64 {
    let mut dist = 0.0;
    for i in 0..x.len() {
        dist += (x[i] - y[i]).powi(2);
    }
    dist.sqrt()
}

But let’s think about the cost:

  • For every point, computing distances to all other points gives us n * d.
  • Sorting distances gives n log n.

So naive complexity is:

O(n * d + n log n) -> O(n log n)  (dominant term)

This is doable for small datasets but not optimal, especially for online or large-scale settings.

The Interesting Part of KNN

To find nearest neighbors efficiently, we can partition the space into “neighborhoods.”

  • One approach: k-d trees.
  • Another approach: Locality Sensitive Hashing (LSH).

What is LSH?

LSH is a hashing technique that maps similar inputs to similar outputs. Here’s the intuition:

  1. Choose a subset of dimensions.
  2. Add a small random perturbation to avoid boundary misalignment.
  3. Divide by a bucket width to assign items to discrete buckets.

Mathematically:

a: Standard Normal Vector ~ N(0, I)
b: Standard Uniform Scalar ~ U(0,1)
w: bucket width constant

hash := floor((a·x - b) / w)

To improve robustness, we use multiple hash functions:

H := (hash[0], hash[1], ..., hash[h])

Performance Analysis

Inference

Inference steps:

  1. Compute the hash for each function in H for input x.
  2. Retrieve all points from the corresponding buckets.
  3. Remove duplicates, sort by Euclidean distance, and select top k.

Complexity:

let d = dimensions
let z = number of retrieved neighbors
let h = number of hash functions

O(d*h + h + z*log(z) + z*d) -> O(z log z + C)

Compared to naive:

O(n log n)

This approach drastically reduces inference cost for large datasets.

Insertion (Parsing)

For each new vector:

  1. Compute each hash.
  2. Insert into the hashtable.

Complexity: O(n * d * h) Memory: O(n * h)

Each point is stored in a bucket for each hash function.

Probabilistic Analysis

How do we know LSH will find the correct neighbors?

Let:

  • n = total elements
  • b = elements per bucket
  • p = probability a hash bucket contains the nearest neighbor
  • q = 1 - p

With multiple hashes:

pr(1 hash) = p
pr(2 hashes) = 1 - q^2
pr(h hashes) = 1 - q^h

As h → ∞, probability of finding a neighbor approaches 1.

For LSH with vectors:

hash(x) = floor((a·x - b)/w)
x* = x + Δ

|a·(x - x*) / w| = Δ / w

The difference between perturbed vectors scales with Δ / w. Probability that two vectors fall into the same bucket:

Pr(|a·(x-y)/w| < 1) = 2 * NormalCDF(w/d) - 1

This gives us a statistical bound for neighbor retrieval.

Wrapping Up

We’ve explored:

  • Basic KNN
  • Efficient inference with LSH
  • Complexity and memory considerations
  • Probabilistic guarantees for correctness

There are further optimizations in Rust, like using Arc to avoid cloning data, but those are beyond this post’s scope.

I hope this post shows how a simple algorithm can become fascinating once you dive into performance and probabilistic analysis. Implementing KNN from scratch with these techniques is both instructive and practical.

Happy coding!

First Taste of Learnings (2025-06-12)

Rust Fully Asynchronous BlueSky-like study

First Async Application in Rust (and first fully async backend ever!)

Recently, I’ve been trying to wrap my head around using async within Rust.
I had some prior experience using async in Python, mainly to make non-blocking calls for embeddings of identified terms when they were independent.
Although I had some background, I was still greatly intimidated - many developers, who appeared far more talented, spoke about how difficult it was to understand Rust’s async model.

Don’t Be Scared - Jump In

Surprisingly, transitioning my message-board-like or BlueSky-like app from a LeetCode solution into a fully async backend wasn’t terribly difficult.
First, I worked through several async problems focused on using notifications and streaming. Then, without much further practice, I jumped in.

The API structure - being mostly async - was really about learning the tools:

  • Axum for the API tree and server
  • Serde to handle JSON and the barrier between client and server
  • Tokio for the async runtime and its async-aware RwLock
  • std::sync::Arc for atomic reference counting so I could clone handles without cloning the actual data

But my favorite package - the standout for me
 – DashMap an amazing tool allowing you to manage interior mutability

What the Data Looked Like on the Backend

One of the main APIs I needed to port was fn follow(...), which takes in a followee_id and a follower_id.
I wanted users to safely write to their own portion of the data - i.e., a user (the follower) clicks follow on another user (the followee), and we remember this information.

This was tricky to model. The user should only be able to modify their own data. It should also be fully async.

Originally, the data appeared in synchronous code as:

HashMap<UserId, HashSet<UserId>>

A mapping from user ID to the set of users they follow.
I used a set to enable quick unfollow (O(1)) and to ensure no duplicate follows.

First Iteration

To model the problem, I initially reached for a Mutex.
Mutexes were useful when solving async LeetCode problems, allowing mutation within async code - so I started there.

Mutex<HashSet<UserId>>

This worked, but it was blocking.
It was synchronous code masquerading as async.
Time to explore other structures beyond what I’d seen in my limited Rust async exposure.

Enter RwLock (Read-Write Lock)

My API could naturally be partitioned into:

  • Read actions: NewsFeed
  • Write actions: Follow, Publish, Unfollow

This seemed like a natural fit for RwLock, so I implemented:

RwLock<HashMap<UserId, RwLock<HashSet<UserId>>>>

There were only a few code changes: .lock().await became .read().await or .write().await. Overall, the changes were minimal.

RwLock was a major improvement over Mutex - while Mutex allows only a single user or thread to access the data at a time, RwLock allowed multiple readers in parallel.


but the problem remained: most actions cause side effects (writes), and a single write on the outermost RwLock blocked the entire backend - even reads!
Multiple users writing to different locations - they should be able to write independently.
This separation had to be modelable. How could I drive that separation?

DashMap: The Sleeper Wizzard

The problem seemed so simple: just enable read-write locking on the interior data.
Enter DashMap - like Gandalf cresting the hill at Helm’s Deep!

DashMap allows users to mutate their private data without needing explicit mutability, and it’s a near drop-in replacement for HashMap.

For example:

fn follow(&mut self, followee_id: UserId, follower_id: UserId)

became:

fn follow(&self, followee_id: UserId, follower_id: UserId)

This helped clean up parts of Axum’s server model and the guarantees needed to build the API tree.
DashMap enabled private mutation - as long as you handled the interior structure correctly, e.g.:

DashMap<UserId, RwLock<HashSet<UserId>>>

This was exactly the model I was searching for.
I cannot recommend the library enough if you’re facing a similar modeling problem where something feels like it should be possible.

Takeaway

Not only is Rust async - and its tooling - becoming ever more mature and viable in production, but


Don’t be scared to jump in.
You’ve already solved problems that felt impossible at the time. This is just another challenge.

When I saw my project handle 1,000 posts from 1,000 users and retrieve sorted newsfeeds for 10 users in 12.620865ms seconds on my 2018 machine, I was thrilled.

Programming isn’t just writing code for things you already know.
Programming is solving new problems, exploring the unknown, and discovering better solutions.

Thanks so much for reading - see you in the next post!

First Taste of Learnings (2025-06-11)

Welcome to my blog! Inaugural post!

My name is Autumn, I’m a mixture of Data Scientist, Software Engineer, and Machine Learning Engineer. I’m passionate about mathematics, statistics, performant computing, and low-level code.

I’ve been exploring multiple projects focused on numerical computing and different strategies — recently concentrating on matrix multiplication techniques, neural networks, and implementing various algorithms at a low level.

Additionally, I’ve been brushing up on data structures and algorithms to design more performant systems. I have experience with databases, API development, predictive engines, model pipelines, and backend systems for applications.

Here, I hope to track my progress and provide a reference for others making the same journey. Thanks so much for taking a look at my blog!

Current Studies