al-folio

a simple whitespace theme for academics

a distill-style blog post

an example of a distill-style blog post and main elements

25 min read · 2021

a post with code

an example of a blog post with some code

4 min read · 2015

Mamba and State Space Models: The Sequence Modelling Revolution

State Space Models and Mamba's input-selective mechanism — linear-time sequence modelling that rivals Transformers on long sequences.

6 min read · April 22, 2026

2026 · ssm mamba recurrence linear sequence · foundation-models
Mixture of Experts: Scaling AI Without Breaking the Bank

How Mixture-of-Experts architectures let language models reach trillion-parameter scale while keeping per-token compute tractable.

7 min read · April 21, 2026

2026 · moe scaling llm efficiency sparse · foundation-models
Flash Attention: Making Transformers Faster Than Ever

A deep dive into Flash Attention — the IO-aware exact attention algorithm that makes training large language models dramatically faster while using far less memory.

7 min read · April 20, 2026

2026 · attention transformers efficiency hardware · foundation-models
In-Context Learning: How LLMs Learn Without Gradient Updates

The mysterious emergent ability of large language models to perform new tasks from just a handful of examples in the prompt — no gradient updates required.

7 min read · April 19, 2026

2026 · icl few-shot prompting meta-learning llm · foundation-models
Knowledge Distillation: Teaching Small Models to Think Big

How knowledge distillation, pruning, and quantization compress state-of-the-art models into deployable systems — without sacrificing capability.

6 min read · April 18, 2026

2026 · distillation compression pruning quantization · efficiency

al-folio

a simple whitespace theme for academics

a distill-style blog post

a post with code

Mamba and State Space Models: The Sequence Modelling Revolution

Mixture of Experts: Scaling AI Without Breaking the Bank

Flash Attention: Making Transformers Faster Than Ever

In-Context Learning: How LLMs Learn Without Gradient Updates

Knowledge Distillation: Teaching Small Models to Think Big