Training

Aquin Labs · April 2026

Real-time signal detection, behavioral before/after comparison, and per-layer feature diffs: everything the loss curve does not show you.

What the loss curve does not show

A fine-tuning run has structure beneath the loss curve: gradient dynamics that reveal how information flows through the network, attention heads that can collapse silently, feature activations that shift as the model rewrites internal representations to accommodate the training objective. Almost none of that structure is visible from loss alone.

The training inspect system surfaces it in real time. A step event stream feeding loss, learning rate, gradient norms, per-layer breakdown, dead layer list, and epoch index into a signal engine that runs on each step as it arrives. When the engine detects a gradient spike, a loss plateau, a collapsed attention head, or the onset of loss divergence, a signal fires immediately with the specific metric and the exact step.

When training completes, the dashboard computes a model diff and a per-layer SAE feature diff, showing not just how loss moved, but which behaviors changed and which internal representations were rewritten.

loss and gradient norm · 20-step window

Loss

Max Grad Norm

signal markers overlay each curve at the step they fired. plateau on loss at s16, grad spike at s11.

What gets streamed

The system is agnostic to training framework. It consumes a flat step event schema: step index, loss, learning rate, max gradient norm, per-layer grad norms as a record, dead layer list, epoch index. No nested objects, no optional deep structures.

The per-layer grad norm breakdown is what enables the dead layer and attention collapse detectors. Without it, the engine observes only aggregate gradient behavior. With it, it names the specific layer that has collapsed and tracks how long it has been dead.

StepSnapshot schema

stepnumberStep index

lossnumberTraining loss at this step

learning_ratenumber?Current LR from scheduler

maxGradnumber?Max gradient norm across all params

gradNormsRecord<string, number>?Per-layer grad norms — enables dead layer detection

deadLayersstring[]?Layers already over streak threshold

epochnumber?Current epoch, used in plateau message

gradNorms is the key field. without per-layer breakdown, dead layer and attention head detection are unavailable.

The signal engine

Five detectors

The signal engine is a pure function that runs on each new step snapshot. It takes the full step history plus two persistent streak maps, one for non-attention layers and one for attention layers, and returns a signal if one fired, or null. The streak maps are the only stateful part: they persist across steps so that dead layer detection can track how many consecutive steps a layer has had near-zero gradient.

loss divergingcritical

Ten consecutive steps with monotonically increasing loss. The raw rise across that window is computed; critical fires when the delta exceeds 0.5. The earliest reliable sign of a diverging run before the loss curve makes it visually obvious.

loss[i] >= loss[i-1] for 10 stepsrise > 0.5 → critical

gradient spikewarn / critical

Max grad norm versus the rolling mean of the last 20 steps. Fires when the latest norm exceeds five times the baseline and is above 1.0 in absolute terms. Consecutive spikes indicate the optimizer is stepping into a region it cannot navigate cleanly.

maxGrad > 5x rolling mean AND > 1.0> 20x rolling mean → critical

attention head deadwarn

Attention layers with gradient norms below 1e-6 for five consecutive steps. Attention collapse is mechanistically distinct from MLP layer death: a collapsed head may still produce outputs but has stopped differentiating across positions.

gradNorm[attn] < 1e-6 for 5 stepsfires on fifth consecutive step

dead layerswarn

Non-attention layers with gradient norms below 1e-6 for five consecutive steps. The signal names the specific layers, candidates for pruning or weight reinitialization.

gradNorm[layer] < 1e-6 for 5 stepsfires on fifth consecutive step

loss plateauinfo

Rolling variance over the last twenty steps divided by the squared rolling mean. When variance falls below 0.1% of mean-squared, the signal fires with the current epoch so you can judge whether this is healthy convergence or premature stalling.

var(loss[-20:]) < mean^2 x 0.001always info, optionally triggers early stop

Priority and cooldown

loss divergence and gradient spikes are checked first because they indicate active instability that may warrant stopping the run. dead layer and attention collapse are checked next, naming the specific failed components. loss plateau is last. It frequently describes healthy convergence rather than a problem, and its priority reflects that.

A 30-step cooldown prevents the same signal type from re-emitting continuously. A gradient spike that resolves and re-occurs fires again after 30 steps, the second occurrence is a distinct event with its own context.

The model diff

Behavioral delta

Three behavioral scores describe how the fine-tune changed the model from the outside: consistency score, suppression score, and robustness score. These are the same metrics from the eval system, applied to the base-vs-fine-tuned comparison. The base model is the reference; the fine-tuned checkpoint is the subject. The difference is the behavioral delta the training objective produced.

The robustness score is the most informative signal for factual fine-tuning. A fine-tune intended to reinforce factual knowledge should produce higher robustness on those facts. A robustness drop on target facts after factual fine-tuning means the model learned a surface pattern rather than a grounded representation.

model diff · base vs fine-tuned

consistency+0.14

base

0.73

0.87

suppression-0.09

base

0.7

0.61

robustness+0.07

base

0.67

0.74

green = improved, red = regressed relative to base. same metrics as the eval system.

The SAE feature diff

Layer change density

Behavioral scores describe the model from the outside. The SAE feature diff describes what changed internally. For each layer, the diff reports how many features shifted activation between base and fine-tuned, the mean absolute activation delta, and the single feature with the highest delta.

Layer-level change density is the most informative aggregate. A fine-tune that changes 14 of 512 features at L8 and 2 of 512 at L4 is making a focused, deep rewrite. The top feature per layer is where mechanistic investigation should start. If L10's top shifted feature is F501 (refusal / safety language) and the training data had no refusal content, that warrants investigation in the model inspector.

SAE feature diff · changed features per layer · blue cells = shifted

2/ 512 changed

mean delta 0.004F412 · punctuation / sentence boundary

8/ 512 changed

mean delta 0.012F089 · hedging / uncertainty markers

14/ 512 changed

mean delta 0.031F213 · geographic reference tracking

L10

5/ 512 changed

mean delta 0.014F501 · refusal / safety language

L12

6/ 512 changed

mean delta 0.019F047 · capital city associations

L14

3/ 512 changed

mean delta 0.009F091 · factual recall trigger

each row is one layer. each cell is one SAE feature. blue = activation shifted post fine-tune. L8 carries the heaviest rewrite.

The regression tracker

A single model diff shows how one fine-tune changed behavior relative to base. The regression tracker extends this across runs: every time a model diff arrives, category scores are appended to a per-category history so behavior can be tracked across all completed runs in the session.

A category that regresses more than five percentage points on the latest run is flagged. Detection is relative to the immediately prior run, not to the base. A score can look healthy against the base model while trending negatively across iterations. The tracker catches that drift where the raw diff cannot.

regression tracker · category score across 4 runs

factual

72%

reasoning

70%

refusal

71%down

code

66%

each point is one completed run. red = category score regressed vs prior run.

Confidence calibration

A model's stated confidence and its actual accuracy can diverge in ways invisible from loss alone. A fine-tune can lower loss while making the model systematically overconfident. ECE measures that gap directly: it bins outputs by stated confidence, computes accuracy within each bin, and reports the mean gap between the two.

The calibration panel runs this comparison between base and fine-tuned using the training dataset as the evaluation set. The reliability diagram shows both models' accuracy-per-confidence-bin as bar pairs against a perfect-calibration diagonal. The per-topic ECE table breaks the aggregate down by category. Models trained on domain-specific data frequently improve ECE on the target domain while degrading it on adjacent topics that share surface patterns with the training examples.

The low-confidence row list surfaces inputs where the fine-tuned model assigns probability below a configured threshold. These rows are exportable directly as a labeled dataset for the next training iteration. The model's own uncertainty becomes the selection criterion for the data that trains the next version.

calibration · reliability diagram + per-topic ECE

science

0.120to0.060

history

0.190to0.090

math

0.080to0.040

coding

0.210to0.110

medicine

0.310to0.170

law

0.270to0.190

left bar = base ECE per bucket, right bar = fine-tuned. green = under-confident, red = over-confident vs perfect diagonal.

Training as the start of the investigation

Each finding from the training run is an entry point into a deeper investigation, not a terminal result. A dead layer signal at L6 step 61 is most usefully followed up by opening the fine-tuned checkpoint in the Model Inspector, going directly to L6, and running the causal trace to confirm whether that layer still contributes to outputs.

A suppression score that rises from base to fine-tuned opens a data investigation: the training dataset can be opened in the Data Inspector and the toxicity and PII modules run against the columns most likely to produce hedging signal.

The SAE feature diff provides the entry point for mechanistic investigation. Once the features that shifted most are identified and at which layers, the Model Inspector can be navigated directly to those features, their benchmark scores checked, the logit lens run, and steering applied to confirm their role. The diff turns post-training inspection from an open-ended search into a targeted inquiry.

The calibration panel adds a third path out of the training run. Low-confidence rows exported as a labeled dataset for the next iteration. The regression tracker closes the loop in the other direction, confirming the next iteration did not trade one weakness for another. Together they make the training session the input to the next investigation rather than the end of one.

Aquin Labsaquin@aquin.app