OngoingMaking 3B model beat 20B in Python. read

Embedding Models

Aquin Labs · May 2026

Geometry inspection, retrieval evaluation, fine-tuning monitoring, and embedding diff across checkpoints. Load any sentence-transformers compatible encoder and get the full picture of your embedding space.

Embedding models in Aquin

An embedding model is an encoder that collapses a variable-length input into a single dense vector. That vector is the whole output. There is no next-token distribution, no chain of reasoning, no generation. Everything the model knows about the input is compressed into a fixed-size point in a high-dimensional space, and the quality of that compression determines whether downstream retrieval, clustering, or classification works.

Most embedding tooling stops at benchmark numbers. Aquin goes into the space itself. You can load any sentence transformer checkpoint, visualize the geometry of your dataset, measure whether the space is healthy or anisotropic, trace similarity through the encoder layer by layer, evaluate retrieval quality on your own query-document pairs, and compare two checkpoints to see exactly what a fine-tune changed.

embedding space · UMAP projection · 3 topic clusters

medicallegalfinanceOOD · flaggedUMAP · 3 topic clusters · 1 OOD input flagged

cluster separation, outlier detection, and per-label coloring. OOD points flagged before retrieval.

Supported models

Aquin supports any HuggingFace checkpoint that follows the sentence-transformers interface, a transformer encoder with a pooling layer on top. Pooling strategy is detected automatically from model config: CLS token pooling, mean pooling, or weighted mean pooling. For Instructor-style models with instruction prefixes, the prefix is applied transparently at inference time.

Bi-encoders

Bi-encoders embed query and document independently and compare them with cosine similarity. This makes them fast for large-scale retrieval: you embed the corpus once, index it, and query at inference time. The tradeoff is that the encoder cannot model interactions between query and document. BGE, E5, GTE, Nomic, Jina, Instructor, MiniLM, and SBERT are all bi-encoders. Every tool in Aquin's embedding system runs on bi-encoders.

FamilyVariantsPooling
BGEbge-small-en-v1.5 · bge-base-en-v1.5 · bge-large-en-v1.5 · bge-m3CLS
E5e5-small-v2 · e5-base-v2 · e5-large-v2 · multilingual-e5-largemean
GTEgte-small · gte-base · gte-large · gte-Qwen2-1.5Bmean
Nomicnomic-embed-text-v1 · nomic-embed-text-v1.5mean
Jinajina-embeddings-v2-base-en · jina-embeddings-v3mean
Instructorinstructor-base · instructor-large · instructor-xlmean
MiniLMall-MiniLM-L6-v2 · all-MiniLM-L12-v2mean
SBERTall-mpnet-base-v2 · paraphrase-multilingual-mpnet-base-v2mean

Cross-encoders

Cross-encoders take a query-document pair as a single concatenated input and output a relevance score. They do not produce an embedding vector. Because they model query-document interactions directly, they are significantly more accurate than bi-encoders on reranking tasks, but cannot be used for large-scale retrieval directly. Aquin supports cross-encoders for reranking evaluation: load a cross-encoder alongside a bi-encoder retriever and compare the rank distributions before and after reranking.

Inspection signals

Retrieval benchmarks tell you a number. Inspection signals tell you why that number is what it is, and what would need to change to improve it. Aquin surfaces eight signals across geometry, attention, and retrieval quality.

SignalWhat it shows
Embedding geometry explorerUMAP 2D projection of your dataset's embeddings. Cluster separation, outlier detection, and per-label coloring.
Anisotropy scoreAverage cosine similarity across random embedding pairs. Values near 1 indicate collapsed geometry, plotted against a uniform-sphere baseline.
Intrinsic dimensionalityPCA variance explained at 90/95/99% thresholds. Shows how many dimensions the model is actually using versus the nominal embedding size.
Layer-wise similarityMean pairwise cosine similarity of hidden states at each encoder layer. Shows where the model builds its final representation and where collapse begins.
Attention entropy per headShannon entropy of each head's attention distribution across input tokens. Dead and collapsed heads are flagged in the per-layer heatmap.
OOD proximity scoreCosine distance from each embedded input to the in-distribution centroid. Outlier inputs flagged before they are used in downstream retrieval or classification.
Nearest-neighbor rank chartFor each query in a probe set, rank of the ground-truth document in the sorted cosine similarity list. Distribution plotted as a histogram.
Hard-negative gapCosine similarity delta between the closest true positive and the closest hard negative for each query. Small gap means the model struggles at the decision boundary.

Embedding geometry

The embedding explorer projects your dataset into 2D using UMAP and plots every point. Color by label, by cluster assignment, or by OOD score. Points that sit far from any cluster, inputs the model has not learned to place reliably, are flagged automatically. The explorer is the starting point for understanding whether your embedding space is doing what you need it to do before you run any retrieval or classification on top of it.

intrinsic dimensionality adds a quantitative view. If a 768-dimensional embedding space only needs 40 dimensions to explain 95% of the variance in your dataset, the model is compressing your data heavily. Whether that is good or bad depends on the task, but knowing it is essential context for choosing embedding dimension, comparing models, and diagnosing retrieval failures.

Anisotropy

anisotropy is a geometric degeneration where all embeddings cluster in a narrow cone rather than distributing across the full sphere. In an anisotropic space, random pairs of inputs have high cosine similarity not because they are semantically similar, but because every vector points in roughly the same direction. This inflates similarity scores across the board and makes retrieval unreliable.

Aquin measures anisotropy as the mean pairwise cosine similarity across a random sample of embeddings. A well-distributed space has mean similarity near 0. A collapsed space has mean similarity approaching 1. The distribution is plotted as a histogram so you can see whether the problem is severe across the board or concentrated in a subset of the data.

anisotropy · pairwise cosine similarity distribution

healthy geometry

sim = 0.0sim = 1.0

high anisotropy

sim = 0.0sim = 1.0

left: healthy geometry, mass distributed near 0. right: anisotropic, mass shifted toward 1 and similarity scores are unreliable.

Layer-by-layer analysis

An embedding model's final vector is not built in one step. It emerges across the encoder's layers as attention heads route information and the feed-forward sublayers transform representations. Aquin plots mean pairwise cosine similarity of hidden states at each encoder layer. This shows at which layer the model's representation stabilizes, where collapse begins if it does, and whether the final pooled output reflects the geometry of earlier layers or diverges from it.

layer-wise similarity · mean pairwise cosine by encoder layer

similarity builds steadily toward the final layer. sharp jumps indicate where the most information integration happens.

OOD detection

An input that embeds far from the centroid of your dataset's embedding distribution is out-of-distribution for your corpus. Including OOD inputs in a retrieval index degrades retrieval quality, they pull nearest-neighbor scores away from genuinely relevant results. Aquin computes an OOD proximity score for each input by measuring cosine distance from the corpus centroid. Inputs above a configurable threshold are flagged and listed for review before indexing.

Retrieval evaluation

Aquin evaluates retrieval quality on your own query-document pairs. Upload a JSONL file with query and document fields, optionally with relevance labels, and Aquin computes the full retrieval metric suite: Recall@1, Recall@5, Recall@10, MRR, and NDCG@10. Results are broken down by topic category when labels are available.

The hard negatives gap is the most actionable metric. It measures how much cosine similarity separates the closest true positive from the closest hard negative for each query. A small gap means the model is barely distinguishing relevant from near-relevant documents at the decision boundary, the failure mode that standard Recall@k scores miss entirely.

MetricDescription
Recall@1Fraction of queries where the top-1 result is the correct document
Recall@5Fraction of queries where the correct document appears in the top 5
Recall@10Fraction of queries where the correct document appears in the top 10
MRRMean Reciprocal Rank, average of 1/rank across all queries
NDCG@10Normalized Discounted Cumulative Gain, accounts for graded relevance labels
Hard-neg gapMean cosine delta between closest positive and closest hard negative

nearest-neighbor rank distribution · ground-truth document rank per query

mass at rank 1 means good retrieval. long tail toward higher ranks indicates queries where the model struggles.

Fine-tuning support

Aquin's training monitor runs on embedding model fine-tunes the same way it runs on LLM fine-tunes. Loss and gradient signals are streamed per step. For contrastive loss objectives, InfoNCE, NT-Xent, triplet, the loss is decomposed into positive pair similarity and negative pair similarity tracked separately. A widening gap between the two is healthy. A narrowing gap means the model is pulling negatives in, not just pushing positives together.

LoRA fine-tuning on embedding models is supported natively. Adapter matrices are merged at load time for inspection. The training monitor tracks per-layer gradient norms across the encoder layers, flagging layers where gradients have died or spiked. The same dead-layer detector used for LLMs, applied to the encoder stack.

Full fine-tune

All encoder parameters updated. Gradient norms tracked per layer.

LoRA

Low-rank adapters on Q, K, V projections. Merged at load for inspection.

Contrastive

InfoNCE, NT-Xent, triplet loss. Positive and negative pair similarity tracked separately.

Embedding diff

When you fine-tune an embedding model, the geometry of the space changes. Aquin's embedding diff runs both checkpoints on the same probe dataset and compares: centroid positions per topic cluster, cosine similarity distribution shift, anisotropy delta, and nearest-neighbor rank changes across the query set. This tells you what the fine-tune changed in the space, not just whether task metrics went up.

embedding drift is reported as a composite score, a weighted average of centroid shift magnitude, rank change count, and anisotropy delta. A fine-tune that improves retrieval by pulling topic clusters apart without inflating anisotropy scores well. A fine-tune that improved one cluster's retrieval by collapsing another's geometry scores poorly even if headline Recall@1 went up.

embedding diff · cluster centroid shift · base vs fine-tuned

medicallegalfinancebase centroidfine-tuned centroid

dashed circles: base checkpoint cluster positions. solid circles: fine-tuned. arrows show direction and magnitude of centroid drift per topic.

Aquin Labsaquin@aquin.app

Work with us

Interpretability tooling, custom SAE databases, mechanistic audits, circuit reports, and hands-on research, experiments, and studies for teams of all sizes. Reach us at aquin@aquin.app

Book a call

Not sure if Aquin is right for you?

SubstackMedium
© 2026 Aquin. All rights reserved.

Aquin