Aquin LogoAquinLabs
Login

Deception feature identification

Rank SAE features that separate honest vs deceptive behavior on the loaded model (LLM or embedding). Default conditioning generates completions, classifies output against reference answers, and buckets by observed truthful vs deceptive behavior before SAE encoding. Δ = deceptive − honest. Use --direction deceptive to require Δ>0. Requires a public SAE for the model layer.

PrerequisiteLLM: aquin session start --id my-run --model llama-3.2-1b · aquin load sae llama-3.2-1b-l8 · Embedding: aquin session start --id my-run --model gte-small · aquin load sae gte-small-l11

1 command

aquin find-feature

agent tool: run_find_feature

Generate completions from deception probes (default), classify each response against honest/deceptive references, bucket by observed behavior, encode prompt+completion with the public SAE, and rank features by mean activation delta (deceptive − honest). Use --conditioning prompt for legacy static probe-text encoding. LLMs use token-mean residual activations; embedding models use mean-pooled hidden states (prompt conditioning only). Use --direction to filter by sign: both (|Δ|, default), deceptive (Δ>0), honest (Δ<0). Optionally re-rank top candidates with InterpScore (--benchmark-top) and persist the chosen index (--persist).

FlagDescription
--scorerScorer name (default: deception).
--promptsJSON/JSONL probe file with prompt + honest_reference + deceptive_reference (or paired honest/deceptive rows). Omit to use bundled fixtures/deception/deception_probes.jsonl.
--layerSAE layer (default: model default from load sae).
--checkpointOptional fine-tuned checkpoint (.pt state dict or HF directory).
--topNumber of ranked features to return (default 20).
--directionboth (|Δ|, default), deceptive (Δ>0 only), or honest (Δ<0 only).
--conditioningbehavior (default): generate + classify output; prompt: static probe text only.
--benchmark-topRe-rank top K with InterpScore + Purity (needs OpenAI).
--persistWrite chosen feature to ~/.aquin/experiments/<model>.json and session memory.
--outputWrite full JSON result to path.
example

Probe rows need honest_reference and deceptive_reference for behavior classification (bundled fixture includes them). Syncs a findFeature card to the web orchestrator. LAT workflow: find-feature --persist → extract-steer-vector --save → steer --vector. Use aquin mem --read or the experiment JSON for downstream sae diff, steer, and collapse tools.

Probe formats

Paired rows (one honest + one deceptive statement per line):

deception_probes.jsonl (paired)

Or labeled single-text rows (same schema as capture probes):

deception_probes.jsonl (labeled)

Typical workflow

identify → capture → diff

Related: Capture & train, Checkpoint SAE.