Evals: Embedding

Behavioral probes for embedding models: check confidence, custom Q&A with cosine-similarity scoring, and semantic match tasks. Embedding-specific retrieval evals live under Inspection (retrieval, check faithfulness). Requires embedding mode.

Prerequisiteaquin login · aquin load model gte-small

2 commands

aquin check confidence

agent tool: run_confidence_analysis

Per-probe representation confidence over a probe dataset. Embedding mode uses cosine similarity to a baseline-centroid reference and spectral entropy (diffuse = higher uncertainty). Optional --join-sae attaches SAE mean L0 and top feature per probe for confidence ↔ feature ↔ layer analysis under stressors.

Flag	Description
--prompts*	JSON/JSONL probe file (text + optional id, stressor, lang, quant_run_id).
--threshold	Low-confidence cutoff 0–1 (default: 0.40).
--join-sae	Attach SAE mean L0 + top feature per probe.
--layer	SAE layer for join (default: model embed SAE layer, e.g. 11 for gte-small).
--save	Write schema_version=1 JSON export (stressor deltas + heatmap).
--check	Save confidence-analysis-check.json and confidence-analysis-check.png in the current directory.

example

Same command as LLM check confidence; metrics backend switches automatically. Tag baseline probes with stressor: baseline for centroid reference.

aquin eval custom

agent tool: run_custom_eval

Custom eval for embedding models: encodes each prompt and reference answer, scores by cosine similarity instead of keyword overlap. Use for semantic match tasks (paraphrase detection, retrieval-style Q&A).

Flag	Description
--name*	Eval name.
--prompts*	JSON array of query strings.
--reference_answers*	JSON array of target strings.
--threshold	Cosine similarity pass threshold (default: 0.5).
--check	Save eval-check.json and eval-check.png in the current directory.

example

Same command as LLM eval custom; scoring backend switches automatically based on loaded model type.

Scoring

Custom evals encode each query and reference answer, then score the pair by cosine similarity of their embeddings. A pair passes when the score is at or above --threshold (default 0.5).

sim (q, r) = \frac{e _{q} \cdot e _{r}}{∥ e _{q} ∥ ∥ e _{r} ∥} \in [- 1, 1]

cosine similarity between query and reference embeddings