Answer path
Pattern B
One Sonnet 4.6 subagent per question with full VaultCrux retrieval. No drain-worker heuristics, no scripted answer extraction.
Open Evaluation
Independent, reproducible tests of VaultCrux memory and reasoning. Every question, every retrieved chunk, every reasoning step — inspectable here.
500-question episodic memory benchmark from Stanford/CMU. Tests temporal reasoning, knowledge updates, multi-session aggregation, and preference recall across 6 question types.
175-question stratified sample across 10 multi-session conversations. Tests single-hop, multi-hop, temporal, open-domain, and adversarial retrieval with full per-question traces.
Methodology
Each benchmark runs the full VaultCrux answer path — real retrieval, real reasoning, no shortcuts. Results are stored as immutable run receipts so every question and every retrieved chunk can be inspected.
Answer path
One Sonnet 4.6 subagent per question with full VaultCrux retrieval. No drain-worker heuristics, no scripted answer extraction.
Retrieval
HNSW vector search (ef_search=800) with chunk-context enrichment headers and session-level supersession markers.
Scoring
Format-tolerant judge — accepts verbose/first-person/N-inclusive variants. Gold errors are audited separately and excluded from the score.
Transparency
Every question links to its retrieval trace: queries issued, chunks returned with context headers, reasoning synthesis, and raw receipt JSON.
Public Surface
The benchmark browser is read-only. The verifier flow, passport decode, and methodology notes live on separate routes so the public surface stays inspectable without hiding the harness contract.
Methodology
Pattern B, receipts, and snapshot pinning in one place.
Open methodology →Passport
See how the verification passport is scoped and why it is read-only.
Open passport →Verify
Manual issuance, curl samples, and the public verification contract.
Open verify →