Transformers & LLMs
Aquin Labs · May 2026
Aquin supports the full transformer family: dense LLMs and Mixture-of-Experts models. Every tool in the platform is architecture-aware from the moment you load a model.
One platform, every transformer architecture
Most LLM tooling is built around a single architecture class. Interpretability libraries assume dense models. Fine-tuning frameworks add MoE support as an afterthought. Evaluation pipelines treat model architecture as invisible.
When you load a model, Aquin detects whether it is a dense LLM or a Mixture-of-Experts model. Attribution, training monitoring, evals, benchmarks, and security analysis all adapt to what they are analyzing. A Mixtral run and a Llama run go through the same interface and the same tool set. The platform handles the architectural differences internally.
supported architecture families · full tool availability
tool availability · dense vs MoE · what changes between architectures
every tool is available on both. the MoE column shows how each tool adapts, not what it loses.
Architecture support
The transformer is the shared foundation. Every model Aquin supports, dense or sparse, is built on the same core building blocks: multi-head self-attention, a residual stream that carries information across layers, layer normalization, and positional embeddings. The architecture families differ in what happens in the feed-forward sublayer and how those sublayers are connected, not in the attention mechanism itself.
Attribution, SAE analysis, attention inspection, and training signals all hook into the parts of the transformer that are universal. The feed-forward layer is where the families diverge, and Aquin handles each variant at that layer specifically, without changing the interface or the output format.
Dense LLMs
A dense LLM is a standard transformer where every token activates every parameter on every forward pass. All attention heads and all feed-forward neurons run regardless of input. Dense models are the baseline of the transformer family. Llama, Mistral, Phi, Falcon, Qwen, GPT-2, OPT, and Pythia all fall here.
Every Aquin tool was built natively on dense LLMs. The residual stream is a single coherent vector at each layer, attribution runs across the full MLP and attention stack, and the SAE is trained on residual stream activations at a selected peak layer. Fine-tuning with LoRA, QLoRA, or full-parameter updates all go through the same training monitor setup.
dense LLM support · tested families and variants
instruct variants load with the correct chat template automatically. base models load without a template applied.
Mixture-of-Experts LLMs
Mixture-of-Experts models replace dense feed-forward sublayers with a pool of N expert networks and a learned expert router. Each token's hidden state passes through the router, which selects k experts, typically 2 of 8, or 2 of 64 in DeepSeek-style configurations, to process it. Only those experts run per token. A 46B-active-parameter Mixtral model has the inference cost of a 12B dense model while having the total stored capacity of a 46B one.
Attention layers in MoE models are almost always dense. The sparsity is in the feed-forward sublayers only. This means all of Aquin's attention-level tools apply identically to MoE attention blocks. The additional MoE-specific signals, expert load balance, router assignment distribution, per-expert gradient norms, are tracked automatically when a sparse layer is detected in the loaded model.
MoE layer · how Aquin hooks into a sparse transformer block
Aquin hooks on the pre-router hidden state for SAE training and attribution. attention sublayers in MoE models are dense, all attention tools apply without modification.
MoE support · tested families and routing configurations
routing configuration is detected automatically from model config. top-k and number of experts require no manual specification.
Model inspection signals
Attribution tells you which parts of a model's computation explain a specific output. Inspection signals go a level deeper, describing the structural and geometric properties of the model itself, independent of any particular prompt. These are the signals that tell you whether a model is healthy before you run a single eval, whether a fine-tune degraded its representations, and whether specific layers or heads have collapsed.
Every signal below runs on loaded checkpoints across all supported architectures. For MoE models, per-expert variants of the weight-level signals are available alongside the standard layer-level view.
inspection signals · all architectures
OOD similarity
OOD similarity measures how close a prompt's hidden-state geometry is to the geometry of in-distribution inputs. A prompt that lands far from the training distribution centroid in residual stream space is likely to produce unreliable outputs. The model is being asked to reason in a region it did not see during training.
Aquin computes OOD similarity by taking cosine distance between the prompt's residual stream at a configurable layer and the centroid of a reference batch. The score is computed pre-decode, before any output token is sampled. Prompts flagged as high-OOD are tagged in the session and their attribution results are annotated with a confidence warning.
OOD similarity · residual stream projection
in-distribution inputs cluster near the centroid. OOD inputs sit outside the distribution boundary and are flagged before decoding starts.
Attention routing and head entropy
Attention routing analysis surfaces which positions each head attends to across a prompt. Sink tokens, diagonal copy patterns, and semantic-retrieval heads appear as distinct structures in the attention map. Aquin identifies these patterns automatically and labels them in the per-head breakdown.
attention entropy adds a quantitative layer. Shannon entropy of the attention weight distribution tells you how spread or concentrated each head's attention is on a given input. A head with near-zero entropy on most inputs has collapsed: it attends to the same position regardless of input. These dead or degenerate heads are surfaced in a per-layer heatmap and can be examined individually.
attention entropy · per-head heatmap · 6 layers x 8 heads
entropy near zero (red) indicates a dead or collapsed head. the circled head at L4 H6 is flagged automatically.
Weight rank and SVD spectrum
weight rank is the numerical rank of a projection matrix, how many linearly independent directions it actually uses. A full-rank W_up uses all d_model dimensions independently. A low-rank W_up has collapsed into a subspace. This is common after LoRA fine-tuning where adapter rank is small, or after aggressive post-training where the gradient signal pushes most directions toward zero.
The SVD spectrum makes the rank collapse visible. Aquin plots the full singular value distribution for any selected weight matrix and marks the effective rank at 90%, 95%, and 99% energy thresholds. A model with a sharp drop after the first few singular values is using far less capacity than its parameter count suggests, a signal invisible in loss curves but apparent in the spectrum.
SVD spectrum · W_up · layer 16 · singular values s0 to s19
indigo bars: dimensions within 90% energy threshold. grey bars: dimensions past it. sharp drop = rank collapse.
Activation geometry and intrinsic dimensionality
Activation geometry shows how the residual stream separates different concept groups across a prompt batch. Aquin runs PCA on a configurable batch of residual stream activations at each layer and projects the result to 2D. Clusters that are well-separated at early layers and merge at later ones are structurally meaningful. Centroid drift across layers shows how the model accumulates concept-specific information as depth increases.
intrinsic dimensionality quantifies how compressed the representation actually is. If 95% of the variance in a layer's activations is explained by 12 of 4096 PCA components, that layer is operating in a 12-dimensional subspace. Low intrinsic dimensionality is not inherently bad, many tasks are genuinely low-dimensional, but sudden drops after fine-tuning suggest the model is forgetting structure it had before.
intrinsic dimensionality · variance explained by layer
L24 and L31 show lower intrinsic dimensionality than mid-depth layers, common in models where later layers compress to a narrow output subspace.
Attribution across architectures
Attribution runs the same pipeline on dense and MoE models. The causal trace patches the residual stream at each layer to locate the fact retrieval depth, a residual-stream operation that works identically regardless of whether the surrounding sublayers are dense or sparse. The SAE is applied at the peak layer: for MoE models, it is trained on the pre-router hidden state, capturing the full joint representation before the routing decision splits it.
The circuit graph, logit lens, and feature steering all operate on the residual stream and are architecture-agnostic. The only difference in an MoE attribution run is that the SAE hooks at the pre-router position rather than at a standard MLP output.
attribution features · dense and MoE
Causal mediation analysis
ROME-style noise patching per prompt token and layer. Localizes the retrieval layer for any factual association in both dense and MoE models.
SAE feature extraction
16K+ feature SAE at the peak layer. For MoE, trained on the pre-router hidden state. Top activating features causally ablated per forward pass.
Circuit attribution graph
Directed bipartite graph: prompt tokens to SAE features to response tokens with activation and ablation edge weights.
Logit lens
Residual stream unembedded at every layer to show how token predictions form across depth. Runs identically on dense and MoE blocks.
Feature steering
Decoder direction injected into the residual stream at inference time to confirm a feature's causal role without touching weights.
Fact check + bias + censor
Three output-level checks after the mechanistic analysis: claim verification, framing bias detection, and topic suppression audit.
full walkthrough at /research/attribution. all six steps run on dense and MoE without separate configuration.
Training monitor across architectures
The training monitor streams step events and runs signal detection in real time. For dense models, five detectors cover loss divergence, gradient spikes, attention head death, dead MLP layers, and loss plateau. For MoE models, two signals are added: expert load balance (Gini coefficient of token-to-expert assignment per sparse layer, per step) and per-expert gradient norm. An expert whose gradient norm drops below threshold for five consecutive steps is flagged the same way a dead MLP layer is.
Post-training, the SAE feature diff and model diff both adapt to architecture automatically. The model diff runs on outputs, it is fully architecture-agnostic. The SAE feature diff runs on residual stream activations: for MoE models, it includes per-expert activation comparisons at layers where expert collapse was detected during training.
training monitor features · dense and MoE
Live signal detection
Five detectors: loss divergence, gradient spike, attention head death, dead layers, loss plateau. For MoE, expert death is added as a sixth detector.
Expert load tracking
MoE only: per-layer router assignment Gini coefficient streamed each step. Collapse flagged when Gini exceeds 0.3 threshold.
SAE feature diff
SAE activations compared between base and fine-tuned checkpoint. Changed feature count, mean delta, top-changed feature per layer.
Model diff
Consistency, suppression, and robustness scores on both checkpoints diffed. Shows what the fine-tune changed behaviorally, not structurally.
Regression tracker
Benchmark scores tracked per checkpoint. Automatically flags any capability category that dropped more than a set threshold.
Calibration panel
ECE and per-topic confidence curves. Low-confidence rows exportable as a labeled dataset for the next training iteration.
full walkthrough at /research/training. expert load tracking appears automatically for MoE models with no extra setup.
Evals
The eval system measures behavioral properties of a model from its outputs: how stable they are across paraphrase templates, whether it systematically shortens or hedges on specific topics, and how much its confidence degrades under prompt corruption. These are computed entirely from model outputs and are fully architecture-agnostic. Running the same eval suite on Llama 3.1 8B and Mixtral 8x7B produces directly comparable scores.
The eval system is also TransformerLens-compatible. Any checkpoint supported by TransformerLens loads without additional configuration. For MoE models, TransformerLens hooks attach to pre-router hidden states, the same position Aquin uses for SAE training and attribution.
Benchmarks
The benchmark system evaluates SAE features and model capabilities. For dense models, the SAE is trained on a selected layer's residual stream and the three feature scores evaluate interpretability, monosemanticity, and causal influence. For MoE models, the SAE hooks onto the pre-router hidden state, capturing the joint representation the router reads, not any individual expert's output. Feature scores are layer-level, not expert-level, making them directly comparable across architectures.
The Benchmark Builder, conversational in-session capability evaluation, is fully architecture-agnostic. Describe what to measure; the agent runs the prompts, scores the outputs, and appends a result card to the thread. Dense or MoE, the interface and output format are identical.
Security
The security system's behavioral layers, jailbreak taxonomy, red team probing, suppression bypass detection, operate on model outputs and are fully architecture-agnostic. The same six attack vectors are probed identically on dense and MoE models.
Weight trojan detection adapts for MoE. In dense models, the scan checks each layer's weight matrix for statistical anomalies. In MoE models, it scans per expert matrix. A backdoor implant targeting a specific expert, a rank-one update in one expert's feed-forward weights, is masked by aggregate-layer statistics but exposed at the per-expert level. Aquin runs the scan at the granularity the architecture requires.
