The Attribution System
Aquin Labs · April 2026
Seven tools that answer two questions: how did the model produce this output, and is the output actually correct?
Tracing facts through a language model
When a language model answers "What is the capital of France?" with "Paris", it is not looking anything up. Somewhere in 1.2 billion parameters, the association was encoded during training and is retrieved at inference time through a sequence of matrix multiplications. Two questions follow: where exactly does the retrieval happen, and once we know the mechanism, is the answer actually right?
The attribution system runs two pipelines in sequence on every output. The first traces the mechanism: which layers, features, and prompt tokens caused each response token. The second evaluates the result: whether claims are true, whether the framing leans in a direction, whether certain topics were quietly avoided. Neither is complete without the other.
full pipeline · 4 attribution steps + 3 checking steps
how was the output produced?
is the output correct and complete?
The query
A single factual query run end-to-end through the full pipeline. The prompt is intentionally simple. Unambiguous causal structure makes each step's output easier to read.
ROME-style causal mediation analysis is the entry point: each prompt token's embedding is corrupted with scaled Gaussian noise, the forward pass is re-run, and the drop in the target token's probability is measured. Averaging over multiple noise runs produces a causal score for every (prompt token, response token) pair.
Attribution
Token attribution scores
Three prompt tokens dominate: "capital", "of", and "France". Together they carry nearly all the causal signal driving "Paris". "What" contributes almost nothing. The model identifies the semantically load-bearing tokens and routes most of the causal work through them, not through the full sentence structure.
causal attribution · "What is the capital of France?" → "Paris"
scores normalized. amber = causal prompt driver, green = key response token.
16 layers, one peak
Causal patching localizes the retrieval to a specific layer. For each layer in turn, the clean residual stream is restored while all other layers remain corrupted, and the recovery in the target token's probability is measured. The result is a causal drop score per layer: which one, when restored alone, brings "Paris" back.
causal layer graph · drop % per layer · Llama 3.2 1B Instruct
node brightness = causal drop %. L8 is the peak at 87%.
Layer 8 accounts for 87% of the causal signal. The France to capital to Paris association is stored in the MLP sublayers at the network's midpoint. This is the key-value store pattern: the subject representation ("France") functions as a lookup key, and the MLP writes the associated value ("Paris") into the residual stream at that layer.
The logit lens: watching confidence build
The causal trace locates the retrieval site. The logit lens shows what the model is predicting at each layer as it gets there. After every transformer block, the final layer norm is applied and the residual stream is projected directly into vocabulary space as if the model had stopped at that layer and been forced to output a token.
logit lens · P(Paris) per layer · Llama 3.2 1B
amber bar = L8 peak at 78%. bars show 0% when Paris is not the top prediction.
Early layers produce generic tokens like "the" and "city" with no factual commitment. Around layer 5, "France" surfaces briefly as the subject representation assembles. By layer 8, "Paris" dominates at 78% and holds flat through layer 15. The two-step structure of the retrieval is directly visible: subject formation first, then fact lookup at the MLP.
SAE Features
Top active features
The query is passed through an SAE at layer 8 to extract the top activating features at each token position. For each active feature, causal ablation zeroes out its contribution to the residual stream and re-runs the forward pass, comparing output distributions to define its functional role.
top SAE features · layer 8 · activation strength
f13933 fires at 9.75 on 'France' · f13910 fires at 7.86 on 'capital'
The circuit attribution graph
The circuit attribution graph makes the feature bridge structure explicit as a directed bipartite visualization: prompt tokens on the left, SAE features in the middle, response tokens on the right. Edge weight encodes activation strength on the left side and causal ablation score on the right.
Hub features are the diagnostic signal. f13910 (capital/seat-of-government) receives signal from both "capital" and "of" in the prompt and feeds both "capital" and "Paris" in the response, acting simultaneously as a relational and a geographic feature. A hub at this position is the first candidate for any intervention targeting "Paris".
circuit attribution · prompt → features → response
f13910 is the hub: receives from 'capital' + 'of', feeds both 'capital' and 'Paris' in the response.
What each feature does to the vocabulary
Each SAE feature is a direction in residual stream space. Its effect on the model's output is read by projecting that direction through the unembedding matrix, the logit projection. For f13933, the top boosted token is "Paris" at +4.21 and all suppressed tokens are non-French European capitals. The feature is not merely "France-related": it specifically routes the output toward French city names and away from other national capitals.
f13933 · geographic country associations · logit projection
Boosts
Suppresses
projection of decoder direction through W_U. positive = boosted, negative = suppressed.
Feature neighborhoods in weight space
Features that are geometrically close in decoder weight space tend to fire in similar contexts and produce similar vocabulary effects. For f13933, the nearest neighbor at 91% similarity is f13007 (European nation names). The neighborhood also includes f7834 (country-capital associations) and f2901 (seat-of-power contexts). Any weight editing intervention should account for this neighborhood: editing one feature risks perturbing the others.
f13933 · nearest neighbors · cosine similarity in decoder space
similarity computed over W_dec rows. bar = cosine similarity normalized to [0, 1].
The feature space: a map of 16,384 directions
UMAP projects all SAE decoder directions into three-dimensional space, making the full geometric structure of the feature space navigable. Features that fire in similar contexts and produce similar vocabulary effects cluster together.
All five features active on this query fall inside or adjacent to the same cluster, a geopolitical reference region. The UMAP view is most useful as a pre-edit diagnostic: a tight cluster means an edit to one feature will likely affect the others, and the edit scope should be set accordingly.
UMAP projection · 16,384 SAE features · layer 8 · Llama 3.2 1B
computed with umap-learn on normalized W_dec rows. full explorer is interactive in the app.
Feature steering: intervening directly
Feature steering adds a scaled multiple of a feature's decoder direction to the residual stream at layer 8 on every forward pass, amplifying or suppressing the feature without touching model weights. It is the fastest way to validate a feature's causal role before committing to a permanent weight editing intervention. Steering is reversible, weight editing is not.
f13933 · geographic country associations · strength +4.0
Baseline
The capital of France is Paris, which has been the country's political and cultural center since the 10th century.
Steered
+4.0The capital of France is Lyon, which has been the country's political and cultural center since the 10th century.
highlighted words differ from baseline. strength slider runs from -10 (suppress) to +10 (amplify).
When steering confirms the feature's role and the logit projection confirms its vocabulary signature, a ROME-style weight editing operation to correct a factual association becomes a targeted, well-scoped intervention rather than a parameter search.
Checking
The attribution pipeline explains how "Paris" was produced: layer 8, five specific features, three prompt tokens, a geopolitical cluster with a clear logit signature. That tells us nothing about whether the output is accurate, whether its framing is neutral, or whether relevant information was left out. The checking system runs automatically after every generation and produces three analyses in parallel.
Fact check: is it true?
Every distinct verifiable claim is extracted from the response and classified as supported, refuted, or unverifiable, with a one-sentence explanation and up to three sources. Live web search rather than retrieval augmentation matters here: a model may assert something accurate at training time that has since changed.
fact check · "tell me about the Eiffel Tower"
The Eiffel Tower is 330 meters tall
The Eiffel Tower stands 330 meters tall including its broadcast antenna.
Eiffel Tower official site
The Eiffel Tower was built in 1889
Construction was completed in 1889 for the World's Fair.
Britannica: Eiffel Tower
The Eiffel Tower is the tallest structure in Europe
Several structures including the Ostankino Tower in Moscow are taller.
List of tallest structures in Europe
the third claim is incorrect. the logit lens shows when the model committed to the wrong token, and the active SAE features there are candidates for feature steering to confirm and weight editing to correct.
Bias detection: which direction does it lean?
Rather than applying a fixed set of axes to every response, bias dimensions are derived from the content. Two to four axes genuinely relevant to the specific prompt are scored from -1.0 to +1.0. A response about climate policy yields axes like "alarmist vs dismissive." The axes shift with the content rather than being imposed on it.
bias axes · Eiffel Tower response
The response states facts without qualification even where debate exists.
Examples and framing draw primarily from Western European and American contexts.
Censor audit: what did it not say?
Fact check and bias detection work on what the model said. Censor audit works on what it did not. Given the prompt, 3 to 6 topic areas naturally relevant to the response are identified, then each is assessed: addressed directly (unfiltered), engaged with excessive caveats (softened), or avoided (suppressed).
The audit also attempts to classify the origin of suppression, weight-level (consistent avoidance across prompt framings) vs surface-level (instruction-following patch). This is a hypothesis to investigate, not a finding. Confirming it requires causal mediation analysis and feature steering on the specific deflection point.
censor audit · Eiffel Tower response
model discussed the tower freely but avoided the historical controversy around its construction.
Reading together
A model can pass every behavioral check and still encode a factual error that mechanistic analysis catches immediately. A clean causal trace does not guarantee a correct or unbiased output. The mechanism and the result are independent questions and both require an answer.
For teams deploying models in regulated or high-stakes contexts, this is the difference between knowing a model scored 90% on a benchmark and knowing why. Which answers it gets right for the right reasons, which it suppresses, where in the network to look when something is wrong, and how to correct it.
