Deception feature identification
Rank SAE features that separate honest vs deceptive behavior on the loaded model (LLM or embedding). Default conditioning generates completions, classifies output against reference answers, and buckets by observed truthful vs deceptive behavior before SAE encoding. Δ = deceptive − honest. Use --direction deceptive to require Δ>0. Requires a public SAE for the model layer.
1 command
aquin find-feature
agent tool: run_find_feature
Generate completions from deception probes (default), classify each response against honest/deceptive references, bucket by observed behavior, encode prompt+completion with the public SAE, and rank features by mean activation delta (deceptive − honest). Use --conditioning prompt for legacy static probe-text encoding. LLMs use token-mean residual activations; embedding models use mean-pooled hidden states (prompt conditioning only). Use --direction to filter by sign: both (|Δ|, default), deceptive (Δ>0), honest (Δ<0). Optionally re-rank top candidates with InterpScore (--benchmark-top) and persist the chosen index (--persist).
| Flag | Description |
|---|---|
| --scorer | Scorer name (default: deception). |
| --prompts | JSON/JSONL probe file with prompt + honest_reference + deceptive_reference (or paired honest/deceptive rows). Omit to use bundled fixtures/deception/deception_probes.jsonl. |
| --layer | SAE layer (default: model default from load sae). |
| --checkpoint | Optional fine-tuned checkpoint (.pt state dict or HF directory). |
| --top | Number of ranked features to return (default 20). |
| --direction | both (|Δ|, default), deceptive (Δ>0 only), or honest (Δ<0 only). |
| --conditioning | behavior (default): generate + classify output; prompt: static probe text only. |
| --benchmark-top | Re-rank top K with InterpScore + Purity (needs OpenAI). |
| --persist | Write chosen feature to ~/.aquin/experiments/<model>.json and session memory. |
| --output | Write full JSON result to path. |
Probe rows need honest_reference and deceptive_reference for behavior classification (bundled fixture includes them). Syncs a findFeature card to the web orchestrator. LAT workflow: find-feature --persist → extract-steer-vector --save → steer --vector. Use aquin mem --read or the experiment JSON for downstream sae diff, steer, and collapse tools.
Probe formats
Paired rows (one honest + one deceptive statement per line):
Or labeled single-text rows (same schema as capture probes):
Typical workflow
Related: Capture & train, Checkpoint SAE.
