Inspection (SAE): Embedding
Sparse autoencoder tools for embedding encoders. Decomposes final-layer activations into sparse features, compares texts at the feature level, traces circuits, and measures dictionary health. Requires embedding mode plus a pulled embedding SAE.
11 commands
aquin embed-sae-features
agent tool: run_embed_sae_features
Runs text through the encoder and SAE encoder, returns the top-k active sparse features with activation strengths. Entry point for understanding what concepts the embedding contains.
| Flag | Description |
|---|---|
| --text* | Input text. |
| --top_k | Number of features to return (default: 10). |
aquin embed-sae-contrastive
agent tool: run_embed_sae_contrastive
Compares two texts at the SAE feature level. Returns features with the largest activation delta: what the encoder represents differently between the two inputs.
| Flag | Description |
|---|---|
| --text_a* | First text. |
| --text_b* | Second text. |
| --top_k | Top diverging features to report. |
| --corpus | Optional corpus for feature labeling. |
aquin embed-sae-interp
agent tool: run_embed_sae_interp_score
Scores the interpretability of one SAE feature over a corpus: how consistently it fires on semantically related vs unrelated texts.
| Flag | Description |
|---|---|
| --feature_idx* | Feature index. |
| --corpus* | JSON array of corpus strings. |
| --n_samples | Samples per scoring pass. |
aquin embed-sae-browser
agent tool: run_embed_sae_browser
Browses the most frequently active SAE features across a corpus. Surfaces the dominant concepts the encoder uses for that text collection.
| Flag | Description |
|---|---|
| --corpus* | JSON array of strings. |
| --top_n_features | Features to list. |
aquin embed-sae-graph
agent tool: run_embed_sae_network_graph
Builds a co-activation graph: nodes are SAE features, edges connect features that fire together above a threshold. Reveals feature communities in the dictionary.
| Flag | Description |
|---|---|
| --corpus* | JSON array of strings. |
| --threshold | Co-activation threshold. |
| --top_n_features | Limit graph to top-N active features. |
aquin embed-sae-circuit
agent tool: run_embed_sae_circuit
Traces how one target SAE feature's activation builds up layer-by-layer through the encoder. Shows where in the stack the concept first appears and how it strengthens.
| Flag | Description |
|---|---|
| --text* | Input text. |
| --target_feature_idx* | Feature to trace. |
aquin embed-sae-steer
agent tool: run_embed_sae_steer
Boosts or suppresses one SAE feature activation and measures cosine shift in the output embedding. Optionally re-ranks a corpus to show retrieval impact.
| Flag | Description |
|---|---|
| --text* | Input text. |
| --feature_idx* | Feature to steer. |
| --delta* | Activation delta (positive = boost, negative = suppress). |
| --corpus | Corpus for retrieval re-ranking after steer. |
| --top_k_retrieval | Top-k for retrieval comparison. |
aquin embed-sae-absorption
agent tool: run_embed_sae_absorption
Scans for feature absorption pairs (one feature's decoder absorbed into another) and near-duplicate decoder directions. Flags dictionary redundancy.
| Flag | Description |
|---|---|
| --corpus* | JSON array of strings. |
| --top_n | Top features to scan. |
aquin embed-sae-polysemy
agent tool: run_embed_sae_polysemy
Finds features that fire strongly on semantically unrelated sentences: polysemous or entangled features that hurt interpretability.
| Flag | Description |
|---|---|
| --corpus* | JSON array of strings. |
| --top_n | Top features to analyze. |
aquin embed-sae-faithfulness
agent tool: run_embed_sae_retrieval_faithfulness
Ablates SAE features one at a time and measures NDCG drop on a query set. Identifies which features are load-bearing for retrieval quality.
| Flag | Description |
|---|---|
| --queries* | JSON array of query strings. |
| --corpus* | JSON array of document strings. |
| --top_k | Retrieval top-k. |
| --n_features_to_test | How many top features to ablate. |
aquin embed-space-decomp
agent tool: run_embed_space_decomposition
Decomposes a set of texts into their dominant shared SAE features: which concepts span the whole collection vs which are text-specific.
| Flag | Description |
|---|---|
| --texts* | JSON array of strings. |
| --top_n | Dominant features to report. |
