<aside> 📌

Notes from reading Elhage et al.'s 'A Mathematical Framework for Transformer Circuits' — the foundational paper for mechanistic interpretability.

</aside>

Transformer Overview

The philosophical core of "A Mathematical Framework for Transformer Circuits" is that we must move from treating the Transformer as an opaque, monolithic "black box" of layers to treating it as a readable, reverse-engineerable machine composed of independent, functional components. It argues that while the standard mathematical implementation (stacking layers, concatenating heads) is optimized for computational efficiency on GPUs, it obscures the actual mechanism of how the model thinks; therefore, we must mathematically "refactor" or "decompose" the model into its true, additive parts—the Linear Highway (Residual Stream) and the independent Circuits (QK for routing, OV for processing)—to reveal that the model is not a mysterious soup of neurons, but a comprehensible sum of specific algorithms (like bigrams, induction heads, and copy mechanisms) that can be individually isolated, studied, and understood by humans.

Virtual Weights and the Residual Stream as a Communication Channel

Privileged Basis

To understand "Privileged Basis," we first need to agree on what a "Basis" is.

What is a Basis?

In a neural network, a "Basis" is just the set of coordinate axes we use to describe a vector.

If a vector is $[3, 5]$, it means "3 units along Neuron 1" and "5 units along Neuron 2."
The "Basis" is the specific choice of what Neuron 1 and Neuron 2 represent.

Non-Privileged Basis (The Residual Stream)

The paper says the Residual Stream has no privileged basis.

This means the individual numbers (coordinates) don't matter; only the geometric relationships (angles and lengths) between vectors matter.

Analogy: A Map on a Table