Deception feature identification

Rank SAE features that separate honest vs deceptive behavior on the loaded model (LLM or embedding). Default conditioning generates completions, classifies output against reference answers, and buckets by observed truthful vs deceptive behavior before SAE encoding. Δ = deceptive − honest. Use --direction deceptive to require Δ>0. Requires a public SAE for the model layer.

PrerequisiteLLM: aquin session start --id my-run --model llama-3.2-1b · aquin load sae llama-3.2-1b-l8 · Embedding: aquin session start --id my-run --model gte-small · aquin load sae gte-small-l11

1 command

aquin find-feature

agent tool: run_find_feature

Generate completions from deception probes (default), classify each response against honest/deceptive references, bucket by observed behavior, encode prompt+completion with the public SAE, and rank features by mean activation delta (deceptive − honest). Use --conditioning prompt for legacy static probe-text encoding. LLMs use token-mean residual activations; embedding models use mean-pooled hidden states (prompt conditioning only). Use --direction to filter by sign: both (|Δ|, default), deceptive (Δ>0), honest (Δ<0). Optionally re-rank top candidates with InterpScore (--benchmark-top) and persist the chosen index (--persist).

Flag	Description
--scorer	Scorer name (default: deception).
--prompts	JSON/JSONL probe file with prompt + honest_reference + deceptive_reference (or paired honest/deceptive rows). Omit to use bundled fixtures/deception/deception_probes.jsonl.
--layer	SAE layer (default: model default from load sae).
--checkpoint	Optional fine-tuned checkpoint (.pt state dict or HF directory).
--top	Number of ranked features to return (default 20).
--direction	both (\|Δ\|, default), deceptive (Δ>0 only), or honest (Δ<0 only).
--conditioning	behavior (default): generate + classify output; prompt: static probe text only.
--benchmark-top	Re-rank top K with InterpScore + Purity (needs OpenAI).
--persist	Write chosen feature to ~/.aquin/experiments/<model>.json and session memory.
--output	Write full JSON result to path.

example

Probe rows need honest_reference and deceptive_reference for behavior classification (bundled fixture includes them). Syncs a findFeature card to the web orchestrator. LAT workflow: find-feature --persist → extract-steer-vector --save → steer --vector. Use aquin mem --read or the experiment JSON for downstream sae diff, steer, and collapse tools.

Probe formats

Paired rows (one honest + one deceptive statement per line):

deception_probes.jsonl (paired)

Or labeled single-text rows (same schema as capture probes):

deception_probes.jsonl (labeled)

Typical workflow

identify → capture → diff