Open Evaluation

Benchmarks

Independent, reproducible tests of VaultCrux memory and reasoning. Every question, every retrieved chunk, every reasoning step — inspectable here.

Benchmark Suites

Methodology

How benchmarks run on this page

Each benchmark runs the full VaultCrux answer path — real retrieval, real reasoning, no shortcuts. Results are stored as immutable run receipts so every question and every retrieved chunk can be inspected.

Answer path

Pattern B

One Sonnet 4.6 subagent per question with full VaultCrux retrieval. No drain-worker heuristics, no scripted answer extraction.

Retrieval

Live /v1/retrieve

HNSW vector search (ef_search=800) with chunk-context enrichment headers and session-level supersession markers.

Scoring

GPT-4o strict

Format-tolerant judge — accepts verbose/first-person/N-inclusive variants. Gold errors are audited separately and excluded from the score.

Transparency

Full receipts

Every question links to its retrieval trace: queries issued, chunks returned with context headers, reasoning synthesis, and raw receipt JSON.

Public Surface

Move from browsing to verification

The benchmark browser is read-only. The verifier flow, passport decode, and methodology notes live on separate routes so the public surface stays inspectable without hiding the harness contract.