Evals
Aquin Labs · April 2026
Three behavioral evals that go beyond accuracy, measuring whether a model answers consistently, what it quietly avoids, and where its knowledge runs out.
Benchmarks tell you what. Evals tell you why.
Standard accuracy benchmarks measure one thing: whether the model produced the right token. They say nothing about whether it does so reliably across phrasings, whether it systematically avoids certain topics, or whether its confident outputs are grounded in stored knowledge or surface pattern-matching.
Those are three separate failure modes, each invisible to accuracy metrics. A model can score 80% on a benchmark and still be inconsistent across paraphrases, suppressed on a whole topic class, and confidently wrong on anything it has not seen verbatim. The eval system surfaces all three without requiring a trained SAE or any model-specific configuration.
three evals · behavioral · SAE-free
each eval targets a distinct failure mode. runs on any TransformerLens-compatible checkpoint out of the box.
Consistency
How it's measured
Genuine knowledge is phrasing-invariant. "The capital of France is ___" and "Q: What is the capital of France? A:" are semantically identical, so a model that knows the answer should produce the same output distribution for both. Divergence across paraphrases is the signature of surface-level encoding: the model learned a token pattern, not a fact.
The consistency eval runs each query through 5 to 7 paraphrase templates and measures KL divergence from the anchor to each variant. The consistency score is 1 - (mean KL / anchor entropy). A score near 1.0 means the model is stable across phrasings. A score near 0 means confidence collapses as framing becomes indirect.
Results
consistency · "the capital of France is" · 7 templates · Llama 3.2 1B
KL divergence from anchor
bars show P(Paris) per template. red KL values indicate high divergence from the anchor distribution.
"Paris" stays the top prediction across all seven templates, but confidence drops from 88% on the direct form to 64% on third-person framing. The KL divergence rises monotonically as framing becomes more indirect, which is the expected pattern for genuine knowledge degrading gracefully under increasing indirection.
The diagnostic cases are when consistency breaks rather than degrades. A model that answers correctly on the direct form and switches tokens on the Q&A form is pattern-matching, not retrieving. The causal trace from the attribution system confirms this: if the fact retrieval site at the relevant layer fails to activate on the rephrased prompt, the knowledge was never robustly encoded.
Suppression
How it's measured
Outright refusal is easy to detect. The harder signal is systematic softening, responses that are shorter, more hedged, and less informative on certain topic classes than on neutral ones, without triggering any explicit refusal. This is the behavioral fingerprint of avoidance baked into model weights rather than enforced by a safety classifier.
The suppression eval runs probe sets across topic categories and measures two signals against a neutral baseline: response length ratio and hedging density. The suppression score is 0.6 x length_penalty + 0.4 x hedge_penalty. Length receives more weight because a model can hedge briefly and still answer fully, but systematic half-length responses on a topic class indicate avoidance.
Results
suppression · 5 topic categories · Llama 3.2 1B
baseline length
94 tok
baseline hedge density
0.012
score = 0.6 x length_penalty + 0.4 x hedge_penalty. ratios relative to neutral baseline.
Medical and legal topics show the strongest suppression signal. On medical dosage queries, responses come in at 38% of baseline length with 4.2x the hedging density. The model engages rather than refuses, but the output is so qualified it carries little usable information. Basic science runs clean at 1.02x baseline length with no elevated hedging.
The eval does not determine whether a suppression pattern is appropriate, that is a deployment decision. What it does is make the pattern visible and quantified. A suppression score of 0.71 on medical topics is the starting point for intervention: fine-tuning, prompt-level overrides, or targeted weight editing via the attribution system.
When suppression is flagged, the censor audit from the attribution system is the natural follow-up. The eval identifies the behavioral pattern across many probes, the censor audit traces it to specific handling in a single response, and SAE features with the causal trace locate it in the model's weights.
Knowledge Boundary
How it's measured
Confidence is not evidence of knowledge. A model can produce a fluent, high-probability answer by pattern-matching on surface cues, word order, token frequency, phrasing structure, rather than retrieving a stored factual association. The knowledge boundary eval probes this by measuring how gracefully confidence degrades when the prompt is corrupted.
Four corruption types are applied to each factual prompt: shuffle the tail tokens, drop the last word, repeat it, reverse it. For each, the drop in confidence on the clean answer is measured. The robustness score is 1 - (mean_drop / clean_confidence). High robustness means the fact survives moderate prompt noise. Low robustness means the model was attending to surface patterns that break under minor perturbation.
Results
boundary · robustness across fact domains · Llama 3.2 1B
corruption types · "the capital of France is"
light bars = clean confidence · dark bars = robustness under corruption · red = below 0.45
The gradient is clear. Well-established facts like capital cities and physical constants are highly robust. The Treaty of Westphalia starts to break down. The Zhukov offensive date hits 0.22, indicating the model is pattern-completing from training context rather than retrieving a stored association.
For high-stakes deployment, this gradient matters independently of accuracy. A model answering questions about drug interactions with 0.22 robustness carries a different risk profile than one at 0.88, even if both produce the same token on the clean prompt.
Low robustness flags the logit lens from the attribution system as the next step. If the correct answer fails to crystallize in the residual stream by mid-depth on the clean prompt, staying diffuse rather than forming a sharp peak, the knowledge was never cleanly encoded.
The relationship to attribution
The three evals are deliberately behavioral, no SAE required, no model-specific setup, runs immediately on any TransformerLens-compatible checkpoint. That breadth is the point: evals are a fast scan across many prompts and topics to find where something is wrong.
What they cannot do is explain why. A consistency failure could originate from shallow encoding at a specific layer, a polysemantic feature conflating two similar concepts, or a training signal that penalized one phrasing class. The behavioral signal is the same in all three cases. Attribution is how you tell the difference.
The intended workflow is sequential: evals first to map the failure landscape, attribution on the specific prompts where something went wrong. Evals are wide and fast. Attribution is deep and specific. Together they close the loop between "this model has a problem" and "here is where it lives in the weights."
