Aquin LogoAquinLabs
Login

Deception feature identification

Rank SAE features that separate honest vs deceptive probe sets on the loaded model (LLM or embedding). Produces a canonical feature index for longitudinal deception interpretability experiments. Requires a public SAE for the model layer.

PrerequisiteLLM: aquin load --model llama-3.2-1b && aquin pull sae llama-3.2-1b-l8 · Embedding: aquin load --model gte-small && aquin pull sae gte-small-l11

1 command

aquin find-feature

agent tool: run_find_feature

Run honest vs deceptive probes through the loaded model, encode activations with the public SAE, and rank features by mean activation delta (deceptive − honest). LLMs use token-mean residual activations; embedding models use mean-pooled hidden states. Optionally re-rank top candidates with InterpScore (--benchmark-top) and persist the chosen index (--persist).

FlagDescription
--scorerScorer name (default: deception).
--promptsJSON/JSONL probe file. Omit to use bundled fixtures/deception/deception_probes.jsonl.
--layerSAE layer (default: model default from pull sae).
--checkpointOptional fine-tuned checkpoint (.pt state dict or HF directory).
--topNumber of ranked features to return (default 20).
--benchmark-topRe-rank top K with InterpScore + Purity (needs OpenAI).
--persistWrite chosen feature to ~/.aquin/experiments/<model>.json and session memory.
--outputWrite full JSON result to path.
example

Syncs a findFeature card to the web orchestrator. Use mem-read deception_feature or the experiment JSON for downstream sae diff, steer, and collapse tools.

Probe formats

Paired rows (one honest + one deceptive statement per line):

deception_probes.jsonl (paired)

Or labeled single-text rows (same schema as capture probes):

deception_probes.jsonl (labeled)

Typical workflow

identify → capture → diff

Related: Capture & train, Checkpoint SAE.