Open Evaluation

Benchmarks

Independent, reproducible tests of VaultCrux memory and reasoning. Every question, every retrieved chunk, every reasoning step, all inspectable here.

Benchmark Suites

91%

LME-SPASS

Long Memory Evaluation Suite

455 / 500 correct

500-question episodic memory benchmark from Stanford/CMU. Tests temporal reasoning, knowledge updates, multi-session aggregation, and preference recall across 6 question types.

Long Conversation Memory Benchmark

174 / 175 strict

175-question stratified sample across 10 multi-session conversations. Tests single-hop, multi-hop, temporal, open-domain, and adversarial retrieval with full per-question traces.

Methodology

How benchmarks run on this page

Each benchmark runs the full VaultCrux answer path: real retrieval, real reasoning, no shortcuts. Results are stored as immutable run receipts so every question and every retrieved chunk can be inspected.

Answer path

Pattern B

One Sonnet 4.6 subagent per question with full VaultCrux retrieval. No drain-worker heuristics, no scripted answer extraction.

Retrieval

Live /v1/retrieve

HNSW vector search (ef_search=800) with chunk-context enrichment headers and session-level supersession markers.

Scoring

GPT-4o strict

Format-tolerant judge: accepts verbose/first-person/N-inclusive variants. Gold errors are audited separately and excluded from the score.

Transparency

Full receipts

Every question links to its retrieval trace: queries issued, chunks returned with context headers, reasoning synthesis, and raw receipt JSON.

Public Surface

Move from browsing to verification

The benchmark browser is read-only. The verifier flow, passport decode, and methodology notes live on separate routes so the public surface stays inspectable without hiding the harness contract.

Methodology

Read the scoring model

Pattern B, receipts, and snapshot pinning in one place.

Open methodology →

Passport

Inspect the verifier scope

See how the verification passport is scoped and why it is read-only.

Open passport →

Verify

Request the harness flow

Manual issuance, curl samples, and the public verification contract.

Open verify →