Imagine a student named Mira who never studied. She noticed that on past exams, “C” is the most common correct answer. So she writes C every time and scores 80%. She looks smart on paper. She has zero actual knowledge — she gamed the pattern.
A retrieval or ranking system can do the same thing. If the most popular document in a corpus happens to be relevant for most queries, a system that always returns that popular document will score well on aggregate metrics — without using the query at all. (This is not a hypothetical: see the CS01 NFCorpus case study where this exact predictor scores nDCG@10 = 0.066 on a published BEIR benchmark while ignoring every query.)
The published number looks great. It does not mean what you think it means.
falsify-eval is a Mira-check for retrieval and ranking systems. It compares your system’s score against four “fake students” — four null distributions, including one (Null D, the marginal-matched random) that is original to this work and that the previous standard nulls miss. If your system can’t beat all four by a calibrated margin, the gate fails.
→ Case studies (real numbers, two public benchmarks):
Across both: Mira and popularity-only fail at Δ_D ≈ 0; BM25 and dense MiniLM pass at Δ_D = +0.14 to +0.73. Reproducible in 5 minutes each on M1 laptop. Joint finding: graded metrics (nDCG) on dense-relevance benchmarks can flatten the gate — pair them with single-gold strict metrics (recall@K against top-1).
pip install git+https://github.com/spalsh-spec/falsify-eval
python -c "from falsify_eval.demo import run; run()"
Three systems graded on a 50-query synthetic bench:
═══ constant_predictor (deliberately broken) ═══
real mean nDCG@5 = 0.20
Δ_A (gold-permuted) = +0.000 ✗
Δ_B (uniform random) = +0.001 ✗
Δ_C (random retrieval) = +0.18 ✓
Δ_D (marginal-matched) = +0.000 ✗ ← the gate that catches Mira
GATE: ✗ FAIL (correctly rejected)
═══ mock_engine (plausible retrieval, 70% top-1) ═══
real mean nDCG@5 = 0.62
Δ across all 4 nulls ≥ +0.40 ✓
GATE: ✓ PASS (correctly accepted)
═══ oracle (perfect top-1) ═══
real mean nDCG@5 = 1.00
GATE: ✓ PASS by maximum margin
%%{init: {'theme': 'base', 'themeVariables': {
'fontFamily': 'Garamond, EB Garamond, Georgia, serif',
'primaryColor': '#f3eee5',
'primaryTextColor': '#1c1611',
'primaryBorderColor': '#9c4a1a',
'lineColor': '#9d8147',
'tertiaryColor': '#faf6ed',
'tertiaryBorderColor': '#d4c8b2',
'edgeLabelBackground': '#f3eee5'
}}}%%
flowchart LR
R([your retriever]) -->|top-K per query| S[real score]
G([gold labels]) --> S
G -->|permute π| A[Null A · label-permuted]
G -->|iid uniform| B[Null B · uniform random]
P([item pool]) -->|sample K| C[Null C · random retrieval]
G -->|sample by class freq| D[Null D · marginal-matched ★]
S --> Δ{Δ ≥ τ on<br/>all four?}
A --> Δ
B --> Δ
C --> Δ
D --> Δ
Δ -->|yes| PASS([✓ PASS])
Δ -->|no| FAIL([✗ FAIL])
classDef ok fill:#eef3e8,stroke:#3d7a4a,color:#1a3d22,stroke-width:1.5px;
classDef fail fill:#f7e9e3,stroke:#9c4a1a,color:#5a1c0c,stroke-width:1.5px;
classDef novel fill:#fef9e7,stroke:#9d8147,color:#5a4720,stroke-width:2px;
classDef gate fill:#f3eee5,stroke:#1c1611,color:#1c1611,stroke-width:2px;
class PASS ok
class FAIL fail
class D novel
class Δ gate
| Null | What it tests | Catches |
|---|---|---|
| A — gold-permuted | bijection π over class labels | systems that learned label distribution shape, not relevance |
| B — uniform random | iid uniform draw of gold per query | systems that exploit class-prior assumption |
| C — random retrieval | replace engine output with K random items from pool | systems that score by retrieval coverage, not ranking quality |
| D — marginal-matched ★ | iid draw of gold from the empirical class frequency | predictors matched to the gold marginal — new in this work |
Null D is the load-bearing contribution. It correctly rejects the constant-most-frequent predictor that A and B can false-positive. (Definition 1 of the preprint.)
# 1. Library
from falsify_eval import four_null_gate
result = four_null_gate(
retrieved_lists, gold_list, rel_list, my_metric,
item_pool=corpus_ids, k=5, n_trials=50, tau=0.05,
progress=True, # stderr per-stage timing
)
assert result["gate_passes"]
# 2. CLI on JSONL benches — no Python knowledge needed
falsify-eval grade --input bench.jsonl --metric ndcg@5 --pool corpus.txt
falsify-eval doctor # end-to-end install verification
falsify-eval quickstart ./demo # writes a sample bench + pool
// 3. MCP server — Claude Code, Cursor, any MCP-compatible client
{
"mcpServers": {
"falsify-eval": {
"command": "python",
"args": ["-m", "falsify_eval.mcp_server"]
}
}
}
Claude can then call grade_retrieval directly on any retrieval pipeline output you give it — no glue code, no separate scoring service.
A non-exhaustive list of failure modes the gate flags:
| Broken predictor | Δ_A | Δ_B | Δ_C | Δ_D | Gate |
|---|---|---|---|---|---|
| Constant most-frequent class | ≈ 0 | ≈ 0 | + | ≈ 0 | ✗ |
| Marginal-matched random | ≈ 0 | + | + | ≈ 0 | ✗ |
| Popularity-only ranker (no query feature) | + | + | + | small | ✗ |
| Lexical-match-only on bag-of-words | + | + | + | + | ✓ |
| Full retriever (BM25 / dense / hybrid) | + | + | + | + | ✓ |
| Full retriever on drifted corpus | varies | varies | varies | varies | ✗ via verify_state |
The first three score well on bare aggregate metrics (nDCG, MRR, recall@K). The standard reporting practice publishes those numbers. The four-null gate rejects them.
A passing gate is necessary for credible reporting, not sufficient. It does not prove:
bootstrap_ci, paired_permutation_p, cohens_d_paired)The library is calibrated for retrieval and ranking evaluation — search, recommendation top-K, RAG retrieval-side, classification-as-retrieval. It is not yet generalised to LLM free-text generation, summarisation, or open-ended QA. Those domains need their own null distributions and are planned for v0.3+.
from falsify_eval import four_null_gate
# Replace this with whatever your retriever returns. The library doesn't
# care if it's BM25, FAISS, Pinecone, Weaviate, Vespa, or a homegrown
# bag-of-words. It grades the OUTPUT, not the engine.
def my_rag_retriever(query: str) -> list[str]:
"""Return top-K document IDs for a query."""
...
retrieved = [my_rag_retriever(q) for q in queries]
def recall_at_5(r, g, _rel): return 1.0 if g in r[:5] else 0.0
res = four_null_gate(
retrieved, gold, [3]*len(gold), recall_at_5,
item_pool=pool, k=5, n_trials=100, tau=0.05, seed=2026,
)
print("GATE:", "PASS" if res["gate_passes"] else "FAIL", res["deltas"])
A complete Claude-API worked example with a 50-query bench is in examples/llm_rag_validation.py. To adapt it to GPT-4 / Llama / Mistral / Gemini: swap the API call inside my_rag_retriever. The gate is identical.
The gate calls your metric_fn exactly N × (1 + 4 × n_trials) times.
| Metric cost / call | N=500, n_trials=50 |
|---|---|
| In-memory check (~1 µs) | 0.1 s |
| Embedding lookup (~1 ms) | 1.7 min |
| LLM-judge call (~200 ms) | ~5.6 hours |
If your run is taking hours, your metric is the bottleneck — not the gate (which finishes N=5,000 × pool=100k × n_trials=50 in under 2 seconds with a fast metric). Pass progress=True to see per-stage timing on stderr. Three options to speed up: (1) drop n_trials from 50 → 20 — statistically defensible; (2) cache metric_fn calls; (3) parallelise the four nulls with multiprocessing — pure CPU, no shared state.
| Capability | DVC | MLflow | W&B | Ragas | TruLens | falsify-eval |
|---|---|---|---|---|---|---|
| Vendor-free | ✓ | ✓ | ✗ | ✓ | partial | ✓ |
| Pure-text human-readable lock | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Couples artifact hash + verified score | ✗ | ✗ | partial | ✗ | partial | ✓ |
| Falsification gate (CI-enforceable) | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Marginal-matched null ★ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Positive-control self-validation | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
The tools above solve different problems (versioning, tracking, observability). They complement falsify-eval; they don’t replace it.
PREPRINT.md — Calibrated Falsification Harnesses for Retrieval Evaluation (v7, with N=10,000 validation, broken-predictor suite, sensitivity grid, soundness proposition).SUPPLEMENTARY.md — extended tables, ablations, bench-size calibration curve.Submission to arXiv is pending. The DOI will be added to CITATION.cff on acceptance. In the interim, the markdown is the canonical source; both files are immutable for v0.1.0 (verifiable via lock_state against the v0.1.0 tag).
@article{sharma2026calibrated,
title = {Calibrated Falsification Harnesses for Retrieval Evaluation},
author = {Sharma, Sparsh},
year = {2026},
eprint = {<arxiv-id-when-published>},
archivePrefix = {arXiv},
primaryClass = {cs.IR}
}
Vāk-Kaṇaja is the Sanskrit / Pāṇinian retrieval engine built alongside falsify-eval. It is the first retriever (to my knowledge) adversarially verified by the four-null gate via cross-falsification, and the first to wire the 6 classical Pramāṇas of Nyāya / Mīmāṃsā into a retrieval engine as a router — detecting the query’s epistemological type (Pratyakṣa, Anumāna, Upamāna, Arthāpatti, Anupalabdhi, Śabda) and routing evidence channels accordingly.
It also implements an Anupalabdhi (non-perception) confidence floor: when the corpus does not contain the answer, the engine returns “corpus does not contain this knowledge” as a positive verdict, refusing to leak weak chunks. Pairs with falsify-eval’s Null A naturally — the silent-failure failure mode that load-bearing AI-safety arguments rely on assuming away.
The engine ships with a calibrated negative result: bench expansion N=21 → N=141 falsified the lift from the novel rerankers (Poincaré, topological persistence, fractal affinity), which now ship at production weight 0 and are documented as opt-in research components. The 3-channel φ-RRF baseline is the production default. This is the falsify-eval discipline applied to the authoring engine — same calibration that earned three clean rounds of adversarial review on this library.
Public release imminent at github.com/spalsh-spec/vak-kanaja, Apache 2.0, under the Bhardwaj & Sons brand. Priority announcement dated 2026-05-08.
import falsify_eval before the package was installed and failed at the version-check step; now reads __version__ and pyproject.toml’s version directly via grep/sed so the tag, source files, and built artefact are cross-checked three ways without requiring an install..github/workflows/publish.yml for OIDC trusted publishing to PyPI on every v* tag push; added tools/build_arxiv.sh for converting PREPRINT.md to an arXiv-submittable LaTeX bundle via pandoc; added [tool.mutmut] config + docs/MUTATION_TESTING.md documenting the deferred status; added [project.optional-dependencies] dev bucket pinning mutmut, build, and twine.case_studies/cs03_aikosh_rag/) for the AIKosh internal RAG integration (Jasmeet Singh, in flight); added Tested-platforms log to README; renumbered v0.2 case studies (CS03 = AIKosh, CS04 = FiQA, CS05 = Quora).real_mean are exactly equivariant under arbitrary bijections.hypothesis>=6.0 as a test dep so CI installs it. (Caught by CI matrix the moment v0.1.6.6 landed.)--input my-bench\bench.jsonl is copy-pasted into zsh and the backslash gets eaten, the CLI now suggests the corrected forward-slash path instead of a bare FileNotFoundError.UnicodeEncodeError on the Δ glyph): reconfigure stdout to UTF-8 with errors='replace' at CLI entry, with auto-fallback to ASCII glyphs (Δ→d, τ→tau, ✓→[ok]) when the post-reconfigure stream still can’t encode them. Also --ascii flag and FALSIFY_ASCII=1 env var._validate_inputs.--input - now reads from stdin (was FileNotFoundError: '-').progress=True flag to four_null_gate after Mayank’s 5-hour AIKosh silent-run incident.null_a defect class for tuple / dataclass labels.CHANGELOG.md.case_studies/cs03_aikosh_rag/), CS04 (FiQA) and CS05 (Quora) for metric-sensitivity triangulation; broken-predictor zoo as a public artifact; label_order_seed parameter to break dependency on adversarial label ordering (see PREPRINT §5.9).External-verification log. Each entry is a real run by a real person who is not the package author, dated, with the exact version they ran. New entries go at the top.
| Date | Tester | OS | Python | Shell | Version | Notes |
|---|---|---|---|---|---|---|
| 2026-05-08 | Jasmeet Singh (AIKosh) | Windows 10 (19045) | 3.14.3 | PowerShell | 0.1.6.7 | install / upgrade 0.1.6.2→0.1.6.7 / doctor / quickstart / grade all clean; original cp1252 defect closed. CS03 integration with AIKosh’s internal RAG retriever in flight. |
| 2026-05-07 | Mayank Singh | macOS 14 (M1) | 3.12 | zsh | 0.1.5 → 0.1.6.2 | adversarial 14-defect battery; all closed. |
Issues and PRs welcome. The reference implementation is intentionally minimal; the goal is for the protocol to be small enough that adopters audit the entire library before depending on it.