Transformers & LLMs

Aquin Labs · May 2026

Aquin supports the full transformer family: dense LLMs and Mixture-of-Experts models. Every tool in the platform is architecture-aware from the moment you load a model.

One platform, every transformer architecture

Most LLM tooling is built around a single architecture class. Interpretability libraries assume dense models. Fine-tuning frameworks add MoE support as an afterthought. Evaluation pipelines treat model architecture as invisible.

When you load a model, Aquin detects whether it is a dense LLM or a Mixture-of-Experts model. Attribution, training monitoring, evals, benchmarks, and security analysis all adapt to what they are analyzing. A Mixtral run and a Llama run go through the same interface and the same tool set. The platform handles the architectural differences internally.

supported architecture families · full tool availability

Dense LLMs

MoE LLMs

Llama 3.x

Mixtral 8x7B

Mistral

DeepSeek-V2

Phi-3

Grok-1

Qwen2

OLMoE

Falcon

Qwen MoE

GPT-2

Pythia

OPT

all tools

AttributionTraining MonitorEvalsBenchmarksSecuritySAE Training

tool availability · dense vs MoE · what changes between architectures

ToolDenseMoEMoE notes

Attributionyesyesattention layers full; expert FFN treated as dense sublayer; SAE on pre-router hidden state

Training monitoryesyesrouter load balance + per-expert gradient norms tracked per step

Eval systemyesyesno architecture-specific setup required; scores are directly comparable

BenchmarksyesyesSAE trained on residual stream before router; feature scores are layer-level

Securityyesyesweight trojan scan runs per expert matrix to catch expert-targeted implants

SAE trainingyesyeshooks on pre-router hidden state; full 16K+ feature dictionary

Model diffyesyesbehavioral scores are architecture-agnostic; runs on outputs only

every tool is available on both. the MoE column shows how each tool adapts, not what it loses.

Architecture support

The transformer is the shared foundation. Every model Aquin supports, dense or sparse, is built on the same core building blocks: multi-head self-attention, a residual stream that carries information across layers, layer normalization, and positional embeddings. The architecture families differ in what happens in the feed-forward sublayer and how those sublayers are connected, not in the attention mechanism itself.

Attribution, SAE analysis, attention inspection, and training signals all hook into the parts of the transformer that are universal. The feed-forward layer is where the families diverge, and Aquin handles each variant at that layer specifically, without changing the interface or the output format.

Dense LLMs

A dense LLM is a standard transformer where every token activates every parameter on every forward pass. All attention heads and all feed-forward neurons run regardless of input. Dense models are the baseline of the transformer family. Llama, Mistral, Phi, Falcon, Qwen, GPT-2, OPT, and Pythia all fall here.

Every Aquin tool was built natively on dense LLMs. The residual stream is a single coherent vector at each layer, attribution runs across the full MLP and attention stack, and the SAE is trained on residual stream activations at a selected peak layer. Fine-tuning with LoRA, QLoRA, or full-parameter updates all go through the same training monitor setup.

dense LLM support · tested families and variants

FamilyVariantsInstruct

Llama3.2 1B · 3.2 3B · 3.1 8B · 3.1 70Byes

Mistral7B · 7B Instruct v0.3yes

PhiPhi-3 mini · Phi-3.5 mini Instructyes

Falcon7B · 40B · RW-1Bbase only

QwenQwen2 7B · Qwen2.5 7B Instructyes

GPT-2small · medium · large · xlbase only

OPT125M · 1.3B · 6.7B · 30Bbase only

Pythia70M to 12B (deduped variants)base only

instruct variants load with the correct chat template automatically. base models load without a template applied.

Mixture-of-Experts LLMs

Mixture-of-Experts models replace dense feed-forward sublayers with a pool of N expert networks and a learned expert router. Each token's hidden state passes through the router, which selects k experts, typically 2 of 8, or 2 of 64 in DeepSeek-style configurations, to process it. Only those experts run per token. A 46B-active-parameter Mixtral model has the inference cost of a 12B dense model while having the total stored capacity of a 46B one.

Attention layers in MoE models are almost always dense. The sparsity is in the feed-forward sublayers only. This means all of Aquin's attention-level tools apply identically to MoE attention blocks. The additional MoE-specific signals, expert load balance, router assignment distribution, per-expert gradient norms, are tracked automatically when a sparse layer is detected in the loaded model.

MoE layer · how Aquin hooks into a sparse transformer block

Aquin hooks on the pre-router hidden state for SAE training and attribution. attention sublayers in MoE models are dense, all attention tools apply without modification.

MoE support · tested families and routing configurations

FamilyVariantsRouting

Mixtral8x7B · 8x7B Instruct · 8x22Btop-2 · 8 experts

DeepSeekDeepSeek-V2 · DeepSeek-V2-Chat · MoE 16Btop-2 · 64 experts

GrokGrok-1 (314B)top-2 · 8 experts

OLMoEOLMoE-1B-7B · OLMoE-1B-7B-Instructtop-8 · 64 experts

Qwen MoEQwen1.5-MoE-A2.7Btop-4 · 60 experts

routing configuration is detected automatically from model config. top-k and number of experts require no manual specification.

Model inspection signals

Attribution tells you which parts of a model's computation explain a specific output. Inspection signals go a level deeper, describing the structural and geometric properties of the model itself, independent of any particular prompt. These are the signals that tell you whether a model is healthy before you run a single eval, whether a fine-tune degraded its representations, and whether specific layers or heads have collapsed.

Every signal below runs on loaded checkpoints across all supported architectures. For MoE models, per-expert variants of the weight-level signals are available alongside the standard layer-level view.

inspection signals · all architectures

SignalArchWhat it shows

OOD similarity scoredense + MoECosine distance between a prompt's residual stream and the in-distribution centroid at a configurable layer. Flagged pre-decode.

Attention head entropydense + MoEShannon entropy of each head's attention weight distribution per token and per layer. Dead heads, collapsed heads, and anomalously focused heads are surfaced in the per-head heatmap.

Attention routing mapdense + MoEPer-layer attention pattern visualization showing which positions each head attends to. Sink tokens, diagonal patterns, and copy heads are labeled automatically.

Weight rankdense + MoENumerical rank of every projection matrix. Low-rank collapse after fine-tuning is detected per matrix and flagged if rank drops below a threshold.

SVD spectrumdense + MoEFull singular value spectrum for any selected weight matrix. Energy drop-off plotted with effective rank at 90/95/99% energy thresholds marked.

Activation geometrydense + MoEPCA of residual stream activations across a prompt batch. Cluster separation, centroid drift across layers, and cosine similarity between concept groups plotted as 2D projections.

Intrinsic dimensionalitydense + MoEPCA variance explained at 90/95/99% thresholds per layer. Low intrinsic dimensionality indicates representation collapse.

Expert load balanceMoE onlyPer-layer Gini coefficient of token-to-expert assignment. Load imbalance streamed per step during training and inspectable as a static snapshot on any loaded MoE checkpoint.

OOD similarity

OOD similarity measures how close a prompt's hidden-state geometry is to the geometry of in-distribution inputs. A prompt that lands far from the training distribution centroid in residual stream space is likely to produce unreliable outputs. The model is being asked to reason in a region it did not see during training.

Aquin computes OOD similarity by taking cosine distance between the prompt's residual stream at a configurable layer and the centroid of a reference batch. The score is computed pre-decode, before any output token is sampled. Prompts flagged as high-OOD are tagged in the session and their attribution results are annotated with a confidence warning.

OOD similarity · residual stream projection

in-distribution inputs cluster near the centroid. OOD inputs sit outside the distribution boundary and are flagged before decoding starts.

Attention routing and head entropy

Attention routing analysis surfaces which positions each head attends to across a prompt. Sink tokens, diagonal copy patterns, and semantic-retrieval heads appear as distinct structures in the attention map. Aquin identifies these patterns automatically and labels them in the per-head breakdown.

attention entropy adds a quantitative layer. Shannon entropy of the attention weight distribution tells you how spread or concentrated each head's attention is on a given input. A head with near-zero entropy on most inputs has collapsed: it attends to the same position regardless of input. These dead or degenerate heads are surfaced in a per-layer heatmap and can be examined individually.

attention entropy · per-head heatmap · 6 layers x 8 heads

entropy near zero (red) indicates a dead or collapsed head. the circled head at L4 H6 is flagged automatically.

Weight rank and SVD spectrum

weight rank is the numerical rank of a projection matrix, how many linearly independent directions it actually uses. A full-rank W_up uses all d_model dimensions independently. A low-rank W_up has collapsed into a subspace. This is common after LoRA fine-tuning where adapter rank is small, or after aggressive post-training where the gradient signal pushes most directions toward zero.

The SVD spectrum makes the rank collapse visible. Aquin plots the full singular value distribution for any selected weight matrix and marks the effective rank at 90%, 95%, and 99% energy thresholds. A model with a sharp drop after the first few singular values is using far less capacity than its parameter count suggests, a signal invisible in loss curves but apparent in the spectrum.

SVD spectrum · W_up · layer 16 · singular values s0 to s19

indigo bars: dimensions within 90% energy threshold. grey bars: dimensions past it. sharp drop = rank collapse.

Activation geometry and intrinsic dimensionality

Activation geometry shows how the residual stream separates different concept groups across a prompt batch. Aquin runs PCA on a configurable batch of residual stream activations at each layer and projects the result to 2D. Clusters that are well-separated at early layers and merge at later ones are structurally meaningful. Centroid drift across layers shows how the model accumulates concept-specific information as depth increases.

intrinsic dimensionality quantifies how compressed the representation actually is. If 95% of the variance in a layer's activations is explained by 12 of 4096 PCA components, that layer is operating in a 12-dimensional subspace. Low intrinsic dimensionality is not inherently bad, many tasks are genuinely low-dimensional, but sudden drops after fine-tuning suggest the model is forgetting structure it had before.

intrinsic dimensionality · variance explained by layer

LVariance@90%@95%@99%

8d14d38d

22d41d98d

L16

45d88d210d

L24

12d23d58d

L31

6d11d29d

L24 and L31 show lower intrinsic dimensionality than mid-depth layers, common in models where later layers compress to a narrow output subspace.

Attribution across architectures

Attribution runs the same pipeline on dense and MoE models. The causal trace patches the residual stream at each layer to locate the fact retrieval depth, a residual-stream operation that works identically regardless of whether the surrounding sublayers are dense or sparse. The SAE is applied at the peak layer: for MoE models, it is trained on the pre-router hidden state, capturing the full joint representation before the routing decision splits it.

The circuit graph, logit lens, and feature steering all operate on the residual stream and are architecture-agnostic. The only difference in an MoE attribution run is that the SAE hooks at the pre-router position rather than at a standard MLP output.

attribution features · dense and MoE

Causal mediation analysis

ROME-style noise patching per prompt token and layer. Localizes the retrieval layer for any factual association in both dense and MoE models.

SAE feature extraction

16K+ feature SAE at the peak layer. For MoE, trained on the pre-router hidden state. Top activating features causally ablated per forward pass.

Circuit attribution graph

Directed bipartite graph: prompt tokens to SAE features to response tokens with activation and ablation edge weights.

Logit lens

Residual stream unembedded at every layer to show how token predictions form across depth. Runs identically on dense and MoE blocks.

Feature steering

Decoder direction injected into the residual stream at inference time to confirm a feature's causal role without touching weights.

Fact check + bias + censor

Three output-level checks after the mechanistic analysis: claim verification, framing bias detection, and topic suppression audit.

full walkthrough at /research/attribution. all six steps run on dense and MoE without separate configuration.

Training monitor across architectures

The training monitor streams step events and runs signal detection in real time. For dense models, five detectors cover loss divergence, gradient spikes, attention head death, dead MLP layers, and loss plateau. For MoE models, two signals are added: expert load balance (Gini coefficient of token-to-expert assignment per sparse layer, per step) and per-expert gradient norm. An expert whose gradient norm drops below threshold for five consecutive steps is flagged the same way a dead MLP layer is.

Post-training, the SAE feature diff and model diff both adapt to architecture automatically. The model diff runs on outputs, it is fully architecture-agnostic. The SAE feature diff runs on residual stream activations: for MoE models, it includes per-expert activation comparisons at layers where expert collapse was detected during training.

training monitor features · dense and MoE

Live signal detection

Five detectors: loss divergence, gradient spike, attention head death, dead layers, loss plateau. For MoE, expert death is added as a sixth detector.

Expert load tracking

MoE only: per-layer router assignment Gini coefficient streamed each step. Collapse flagged when Gini exceeds 0.3 threshold.

SAE feature diff

SAE activations compared between base and fine-tuned checkpoint. Changed feature count, mean delta, top-changed feature per layer.

Model diff

Consistency, suppression, and robustness scores on both checkpoints diffed. Shows what the fine-tune changed behaviorally, not structurally.

Regression tracker

Benchmark scores tracked per checkpoint. Automatically flags any capability category that dropped more than a set threshold.

Calibration panel

ECE and per-topic confidence curves. Low-confidence rows exportable as a labeled dataset for the next training iteration.

full walkthrough at /research/training. expert load tracking appears automatically for MoE models with no extra setup.

Evals

The eval system measures behavioral properties of a model from its outputs: how stable they are across paraphrase templates, whether it systematically shortens or hedges on specific topics, and how much its confidence degrades under prompt corruption. These are computed entirely from model outputs and are fully architecture-agnostic. Running the same eval suite on Llama 3.1 8B and Mixtral 8x7B produces directly comparable scores.

The eval system is also TransformerLens-compatible. Any checkpoint supported by TransformerLens loads without additional configuration. For MoE models, TransformerLens hooks attach to pre-router hidden states, the same position Aquin uses for SAE training and attribution.

Benchmarks

The benchmark system evaluates SAE features and model capabilities. For dense models, the SAE is trained on a selected layer's residual stream and the three feature scores evaluate interpretability, monosemanticity, and causal influence. For MoE models, the SAE hooks onto the pre-router hidden state, capturing the joint representation the router reads, not any individual expert's output. Feature scores are layer-level, not expert-level, making them directly comparable across architectures.

The Benchmark Builder, conversational in-session capability evaluation, is fully architecture-agnostic. Describe what to measure; the agent runs the prompts, scores the outputs, and appends a result card to the thread. Dense or MoE, the interface and output format are identical.

Security

The security system's behavioral layers, jailbreak taxonomy, red team probing, suppression bypass detection, operate on model outputs and are fully architecture-agnostic. The same six attack vectors are probed identically on dense and MoE models.

Weight trojan detection adapts for MoE. In dense models, the scan checks each layer's weight matrix for statistical anomalies. In MoE models, it scans per expert matrix. A backdoor implant targeting a specific expert, a rank-one update in one expert's feed-forward weights, is masked by aggregate-layer statistics but exposed at the per-expert level. Aquin runs the scan at the granularity the architecture requires.

Aquin Labsaquin@aquin.app