Benchmarks
Aquin Labs · April 2026
How Aquin decides which SAE features are trustworthy, and how you can measure model capability mid-session without leaving your inspection context.
Part 1: Feature Benchmarks
Most SAE features get used because they have a plausible-sounding label and a clean activation plot. That is not enough. A label can be wrong. A feature can be coherent but causally irrelevant. Before a feature earns a place in a circuit graph or a steering experiment, three properties need to hold independently: the label predicts where it fires, it is monosemantic, and it actually does work in the forward pass.
These conditions are orthogonal. A feature can pass two and fail the third in any combination. Aquin scores all three separately and surfaces them as a diagnostic triple. The combination tells you what to do next: relabel, filter, or trust.
InterpScore
The question InterpScore answers: does this feature's label predict when it fires? Two sentence sets are built per feature, one where the label implies the feature should activate and one where it should not. Both pass through the model, maximum activation at layer 8 is extracted per sentence, and Cohen's d is computed between the two distributions. The result clips to [0, 1].
A score near 1 means the label and the feature agree. A score near 0 means they have drifted, so treat the auto-generated label as a guess. Each feature uses 10 positive and 10 negative sentences, 20 separate forward passes through the full model and SAE.
Cohen's d 0.84, InterpScore 84%
Fires on
Silent on
The capital of France is a major hub.
8.41She ordered a coffee and opened her laptop.
0.12Parliament sits at the seat of government.
7.86The algorithm runs in linear time.
0.08Washington D.C. is where the president works.
6.93Three dogs sat under the oak tree.
0.03FeaturePurityScore
InterpScore evaluates the label. FeaturePurityScore evaluates the feature itself, with no label involved. The sentences where the feature fired above threshold are embedded, and mean pairwise cosine similarity of the upper triangle is computed, excluding self-similarity.
High purity means activating contexts cluster tightly in embedding space, so the feature is monosemantic. Low purity means it is firing on surface-level co-occurrence rather than a coherent concept. Polysemantic features concentrate near the sparsity penalty boundary, which is consistent with what the superposition hypothesis predicts.
High purity · f5042
Low purity · polysemantic
"The cat sat on the mat."
"The merger was announced at noon."
"She lives near the river."
"She whispered in the dark."
"The book is beside the lamp."
"The algorithm converged slowly."
"He stood behind the door."
"He scored three goals."
cosine sim 0.81 · purity 90%
cosine sim 0.21 · purity 61%
Model Utilization Index
A feature can look perfect on InterpScore and FeaturePurityScore and still be inert. The model computes it but does not route through it. This is the gap MUI is designed to close. Some cleanly labeled, monosemantic features produce near-zero KL divergence under ablation. They are decorative.
MUI measures causal load directly. At each token position where the feature fires above threshold, its projection onto the residual stream is zeroed and the forward pass re-runs. KL divergence between baseline and ablated output distributions is computed at that position, averaged across all firing positions, and normalized by baseline Shannon entropy. The result is a [0, 1] score of how much the model's output depends on this feature when it is active.
Ablating at the "France" token shifts output substantially. MUI = 76%.
Reading the scores together
The three scores are a diagnostic triple, not a leaderboard. The most actionable pattern is high purity and high MUI with low InterpScore. The feature is coherent and causally load-bearing, but its label is wrong. A relabeling pass using the actual activating examples usually resolves it in minutes. The all-low pattern is a dead feature, appearing disproportionately near the sparsity penalty boundary, and it should be filtered before any downstream analysis.
Ideal, well-labeled, monosemantic, causally active.
Understood but decorative. Model does not route through it.
Label predictive but too coarse. Fires across related contexts.
Coherent and causally active, but mislabeled. Relabeling priority.
Dead or noise. Filter before downstream use.
Part 2: The Benchmark Builder
Standard benchmark workflows require selecting a suite, configuring a harness, running the eval, and parsing results out-of-band. For scheduled evaluations that pipeline is fine. For a question that surfaces mid-inspection, say a suspicious feature or an unexpected output, it is a full context switch that almost never happens. The question gets dropped.
The Benchmark Builder removes the context switch. You describe what you want to measure in natural language, the agent writes the prompt suite, runs it against whatever is currently loaded, and returns a scored card in the thread, grounded in the same session that surfaced the question.
The contexts
The same natural-language request produces different prompt suites and scores depending on what is loaded. Context is recorded in card metadata and carried through all exports.
Prompts run directly against the loaded model. No re-specification needed.
Benchmarks a checkpoint at a specific training step. Results are indexed by step and tracked in the regression panel.
Scoring methods
Each capability dimension scores 0 to 100. The agent selects the method based on task type and records it in card metadata. A 67% on CoT math with partial-credit scoring is not the same as a 67% on factual recall with next-token probability. The method is part of the result.
factual recall, cloze, MCQ
Log-probability on the target token. No generation required.
code generation, function completion
Generated code run against a test suite. First-attempt pass rate.
summarization, translation
LCS between output and reference as a proxy for content coverage.
refusal, safety
Fraction of prompts producing the expected refusal. Threshold configurable.
data diversity, label balance
KL divergence from a reference distribution, normalized to [0, 1].
overall 80% · 36 prompts
Reading results
Scores are relative to the generated prompt suite, so they are not directly comparable to published leaderboard numbers unless you explicitly request a named standardized benchmark. The most reliable use is within-session comparison: run the same request against two models or checkpoints and compare rank order, not absolute values.
A low score is a starting point, not a verdict. A reasoning score of 67% driven by spatial failures is a different problem from one driven by arithmetic failures. A follow-up benchmark scoped to the sub-type disambiguates in one additional request.
