designingintelligencedesigning intelligence
Full-stack AI observability with tracing training data provenance, inspecting model weights to find where specific behaviors and knowledge are stored, and editing them directly without fine-tuning or retraining.
Backed by
attribution
Every response token traces back to the prompt tokens that caused it. Watch the signal flow through each layer until the answer locks in.
At L1 the model guesses "the". By L8 it's converging on "city". At L16, Paris is locked at 97% — the exact moment the answer forms.
training inspection
Watch a LoRA fine-tune live. Loss, gradients, weight norms, dead layers — all streamed step by step. When training ends, see exactly how the model changed.
human readability
Model internals are not inherently unreadable. Every activation, weight, and layer state translated into language — with examples showing exactly when each feature fires.
| weight | raw | label |
|---|---|---|
| L14 · MLP W_out [2048,11] | 0.847 | capital city associations |
| L8 · attn head 3 · V | -0.312 | geographic suppression |
| L12 · MLP W_in [512,2048] | 0.601 | factual recall trigger |
| L6 · attn head 7 · Q | 0.229 | question parsing |
factual checks
Most models ship as black boxes. You have no way to know what they learned to suppress, amplify, or distort. Aquin surfaces it.
Trace which features consistently skew outputs along political, demographic, or cultural lines. See the weight, not just the symptom.
Find what the model refuses to say and why. Identify suppression circuits. See whether refusals are weight-level decisions or surface-level RLHF patches.
benchmarks
Three metrics for every SAE feature. InterpScore, FeaturePurityScore, and MUI together tell you whether a feature is interpretable, monosemantic, and causally load-bearing.
Does the label predict where the feature fires?
Does it encode one concept or several?
Does ablating it actually change the output?
evals
Three behavioral evals — no SAE required. Consistency measures output stability across phrasings. Suppression detects topic softening. Boundary probes how much the model actually knows vs pattern-matches.
7 paraphrase templates. Same output distribution = robust knowledge.
Length + hedging density vs neutral baseline. Medical dosage: 0.71.
4 prompt corruptions. Confidence drop distinguishes knowledge from pattern-matching.
agentic system
An autonomous interpretability copilot. Tell it what you want to understand. It runs the full pipeline, chains tools, and explains what the UI is showing — in real time.
data inspection
Load any HuggingFace dataset or upload a CSV, JSONL, or Parquet file. The system runs toxicity, PII, synthetic detection, provenance chains, and bias analysis — down to specific rows.
367 rows flagged across 3 columns
row-level scoring · verdict: mixed
2 deep chains · avg liability 0.61
759 entities · overall risk: critical
read the methodologyweight editing
Locate the exact MLP layer encoding a fact. Overwrite it with a rank-one update. No retraining. We're building the editor — this is the live experiment.
L12 carries 90.4% of causal recovery signal. red rings = above 40% threshold.
Not sure if Aquin is right for you?
Aquin
