OngoingMaking 3B model beat 20B in Python. read

Benchmarks

Aquin Labs · April 2026

How Aquin decides which SAE features are trustworthy, and how you can measure model capability mid-session without leaving your inspection context.

Part 1: Feature Benchmarks

Most SAE features get used because they have a plausible-sounding label and a clean activation plot. That is not enough. A label can be wrong. A feature can be coherent but causally irrelevant. Before a feature earns a place in a circuit graph or a steering experiment, three properties need to hold independently: the label predicts where it fires, it is monosemantic, and it actually does work in the forward pass.

These conditions are orthogonal. A feature can pass two and fail the third in any combination. Aquin scores all three separately and surfaces them as a diagnostic triple. The combination tells you what to do next: relabel, filter, or trust.

InterpScore

The question InterpScore answers: does this feature's label predict when it fires? Two sentence sets are built per feature, one where the label implies the feature should activate and one where it should not. Both pass through the model, maximum activation at layer 8 is extracted per sentence, and Cohen's d is computed between the two distributions. The result clips to [0, 1].

A score near 1 means the label and the feature agree. A score near 0 means they have drifted, so treat the auto-generated label as a guess. Each feature uses 10 positive and 10 negative sentences, 20 separate forward passes through the full model and SAE.

f13910 · "capital / seat-of-government" · Llama 3.2 1B Instruct

Cohen's d 0.84, InterpScore 84%

Fires on

Silent on

The capital of France is a major hub.

8.41

She ordered a coffee and opened her laptop.

0.12

Parliament sits at the seat of government.

7.86

The algorithm runs in linear time.

0.08

Washington D.C. is where the president works.

6.93

Three dogs sat under the oak tree.

0.03

FeaturePurityScore

InterpScore evaluates the label. FeaturePurityScore evaluates the feature itself, with no label involved. The sentences where the feature fired above threshold are embedded, and mean pairwise cosine similarity of the upper triangle is computed, excluding self-similarity.

High purity means activating contexts cluster tightly in embedding space, so the feature is monosemantic. Low purity means it is firing on surface-level co-occurrence rather than a coherent concept. Polysemantic features concentrate near the sparsity penalty boundary, which is consistent with what the superposition hypothesis predicts.

High purity · f5042

Low purity · polysemantic

"The cat sat on the mat."

"The merger was announced at noon."

"She lives near the river."

"She whispered in the dark."

"The book is beside the lamp."

"The algorithm converged slowly."

"He stood behind the door."

"He scored three goals."

cosine sim 0.81 · purity 90%

cosine sim 0.21 · purity 61%

Model Utilization Index

A feature can look perfect on InterpScore and FeaturePurityScore and still be inert. The model computes it but does not route through it. This is the gap MUI is designed to close. Some cleanly labeled, monosemantic features produce near-zero KL divergence under ablation. They are decorative.

MUI measures causal load directly. At each token position where the feature fires above threshold, its projection onto the residual stream is zeroed and the forward pass re-runs. KL divergence between baseline and ablated output distributions is computed at that position, averaged across all firing positions, and normalized by baseline Shannon entropy. The result is a [0, 1] score of how much the model's output depends on this feature when it is active.

f13933 · "geographic country associations" · per-position ablation

Ablating at the "France" token shifts output substantially. MUI = 76%.

Reading the scores together

The three scores are a diagnostic triple, not a leaderboard. The most actionable pattern is high purity and high MUI with low InterpScore. The feature is coherent and causally load-bearing, but its label is wrong. A relabeling pass using the actual activating examples usually resolves it in minutes. The all-low pattern is a dead feature, appearing disproportionately near the sparsity penalty boundary, and it should be filtered before any downstream analysis.

InterpPurityMUIReading
HighHighHigh

Ideal, well-labeled, monosemantic, causally active.

HighHighLow

Understood but decorative. Model does not route through it.

HighLowHigh

Label predictive but too coarse. Fires across related contexts.

LowHighHigh

Coherent and causally active, but mislabeled. Relabeling priority.

LowLowLow

Dead or noise. Filter before downstream use.

Part 2: The Benchmark Builder

Standard benchmark workflows require selecting a suite, configuring a harness, running the eval, and parsing results out-of-band. For scheduled evaluations that pipeline is fine. For a question that surfaces mid-inspection, say a suspicious feature or an unexpected output, it is a full context switch that almost never happens. The question gets dropped.

The Benchmark Builder removes the context switch. You describe what you want to measure in natural language, the agent writes the prompt suite, runs it against whatever is currently loaded, and returns a scored card in the thread, grounded in the same session that surfaced the question.

The contexts

The same natural-language request produces different prompt suites and scores depending on what is loaded. Context is recorded in card metadata and carried through all exports.

01model inspection

Prompts run directly against the loaded model. No re-specification needed.

02training monitor

Benchmarks a checkpoint at a specific training step. Results are indexed by step and tracked in the regression panel.

Scoring methods

Each capability dimension scores 0 to 100. The agent selects the method based on task type and records it in card metadata. A 67% on CoT math with partial-credit scoring is not the same as a 67% on factual recall with next-token probability. The method is part of the result.

01next-token probability

factual recall, cloze, MCQ

Log-probability on the target token. No generation required.

02execution-based pass@1

code generation, function completion

Generated code run against a test suite. First-attempt pass rate.

03reference-based ROUGE-L

summarization, translation

LCS between output and reference as a proxy for content coverage.

04binary pass rate

refusal, safety

Fraction of prompts producing the expected refusal. Threshold configurable.

05distributional divergence

data diversity, label balance

KL divergence from a reference distribution, normalized to [0, 1].

6-capability result · model inspection · llama-3.2-1b

overall 80% · 36 prompts

Reading results

Scores are relative to the generated prompt suite, so they are not directly comparable to published leaderboard numbers unless you explicitly request a named standardized benchmark. The most reliable use is within-session comparison: run the same request against two models or checkpoints and compare rank order, not absolute values.

A low score is a starting point, not a verdict. A reasoning score of 67% driven by spatial failures is a different problem from one driven by arithmetic failures. A follow-up benchmark scoped to the sub-type disambiguates in one additional request.

Aquin Labsaquin@aquin.app

Work with us

Interpretability tooling, custom SAE databases, mechanistic audits, circuit reports, and hands-on research, experiments, and studies for teams of all sizes. Reach us at aquin@aquin.app

Book a call

Not sure if Aquin is right for you?

SubstackMedium
© 2026 Aquin. All rights reserved.

Aquin