A small open-source library that catches a class of false positive that the standard way of grading retrieval systems silently accepts. This page exists so a sharp critic can decide in ten minutes whether to engage.
If you build a retrieval system, a search engine, a recommender, the retriever inside a RAG pipeline, you usually grade it with one number: nDCG@10 or recall@5 or similar. The number does not tell you whether your engine is using the query at all. A "predictor" that ignores every query and always returns the most-frequent class can score above a sensible threshold on every metric in standard use. That's the failure mode falsify-eval catches.
The library runs your engine's output against four hypothetical baselines (we call them nulls) and reports which baselines your engine beats by enough to matter. The fourth baseline is the contribution: a marginal-matched random predictor that simulates "the system that exploits gold-label distribution without using the query." If your engine doesn't beat that one, it's not really retrieving, it's pattern-matching the test.
Imagine a corpus where 80% of the gold answers are "Document A" and 20% are split across "Documents B–F." A predictor that always returns "Document A" without reading the query will score around 0.80 recall@1, well above the noise floor of any benchmark. It is not retrieving. It is exploiting that the gold-label distribution is skewed.
This sounds contrived. It is not. Real-world retrieval benchmarks routinely have skewed gold distributions because some documents are queried more often than others, some categories dominate the corpus, and some questions have many equivalent answers. The published numbers look impressive against a uniform-random baseline. They are sometimes barely above the marginal-matched baseline, meaning the engine is doing little more than knowing the popularity distribution.
Three nulls in standard use will fail to detect this:
The fourth null fixes it.
The gate runs your engine's output and four null-distribution outputs through the same metric. Your engine "passes" each null if its score exceeds the null's score by at least τ (default 0.05). The gate as a whole passes only if all four nulls pass.
Below: pick a predictor type. The bars show how each predictor scores against each null. Watch Null D in particular.
examples/synthetic_demo.py). Δ values are the per-null deltas after 100 trials. Real numbers from your own bench will look similar in shape but different in magnitude.
Null D, gold-marginal-matched random. For each query, draw the gold label from the empirical frequency distribution of gold labels in the bench, then score the engine's output against that drawn gold. Repeat for n_trials, average. A predictor that exploits the marginal distribution will produce a Δ_D near zero, exactly what we want it to flag.
The methodology was developed against and validated on Vāk-Kaṇaja, a private retrieval engine over a 6,309-chunk Sanskrit and Tamil literary corpus. The bench: 10,000 queries, top-K = 10, metric nDCG@10. Here are the actual numbers from the v0.1 release validation:
| System | real mean | Δ_A | Δ_B | Δ_C | Δ_D | verdict |
|---|---|---|---|---|---|---|
| Anti-oracle (returns wrong) | 0.000 | −0.25 | −0.25 | −0.25 | −0.25 | FAIL (correctly) |
| Constant predictor | 0.180 | +0.09 | +0.10 | −0.07 | +0.00 | FAIL (Null C, D) |
| Marginal-rank (top-K freq) | 0.343 | +0.19 | +0.19 | +0.19 | +0.00 | FAIL (Null D only) |
| Vāk-Kaṇaja v0.1 | 0.789 | +0.55 | +0.55 | +0.55 | +0.52 | PASS |
| Oracle | 1.000 | +0.75 | +0.75 | +0.75 | +0.74 | PASS |
The marginal-rank row is the gate's contribution: a predictor that looks good on Nulls A, B, C (Δ ≈ +0.19) but is unmasked by Null D (Δ ≈ +0.00). Without the fourth null, you would conclude marginal-rank is a real retrieval system. It is not. It is the gold-distribution rank order, returned without reading the query.
This isn't a hypothetical. We discovered our own first published number had this exact shape and retracted it publicly. That retraction is what gave the validated numbers their weight.
Five honest non-claims:
falsify_eval.stats.STRESS_TEST_LADDER.md Tier 5d.Three honest failures from the development of this work, in chronological order:
Multifractal-detrended-fluctuation features (MFDFA) over the Sanskrit corpus were a hypothesised reranking signal. After a stratified sweep at N=10,000 the result was a statistically significant small negative Δ at the optimal weight. The methodology let us catch this rather than ship it. The feature was disabled in production. Documented in the v6 preprint §5.8.c.
The first published validation numbers were inflated because of a Python closure-capture bug, default arguments were being evaluated at module-import time, causing some configurations to be silently shadowed by older values. The fix produced a +0.0154 lift at the corrected weight, which was the first validated positive lift. Public retraction of the pre-fix numbers came with the fix. This is what every methodology of this type promises and few deliver.
An external user in India, Akosh, tried to follow the README's install instructions and reported "failed basic bench tests." Reproducing on a fresh clone revealed three bugs in the README: pip install falsify-eval referenced an unpublished PyPI package, the demo command used the wrong import path, and the Quick-demo URL had a literal <your-handle> placeholder. The library code was fine. v0.1.2 shipped 45 minutes after the report. Akosh credited in changelog.
The original methodology paper offered a $2,000 cash bounty for any of three classes of counterexample. After the install-path bugs surfaced, I suspended the bounty pending broader internal validation. The suspension is documented in the README, CONTRIBUTING, SECURITY, and the preprint section 10. Suspending a public reward is the calibrated move when the implementation isn't yet hardened to the level the offer implied.
The four-null gate is not the first attempt to discipline retrieval evaluation. It builds on three lines of work:
The contribution of falsify-eval is narrow and specific: the marginal-matched null (Null D) and its packaging as a four-test gate that any retrieval evaluator can run in ~30 seconds. Everything else is lineage.
You should. The library is designed to drop into any existing pipeline because it is metric-agnostic, engine-agnostic, and corpus-agnostic. The MCP server (python -m falsify_eval.mcp_server) means Claude Code or any MCP client can call it on any retriever's output, including those wrapped by BEIR/MTEB/OLMES.
What this library is, specifically, is the missing fourth null. If you already use BEIR or MTEB or OLMES, the right move is to add the four-null gate as a per-system pre-flight check before publishing your numbers. It takes one Python import.
It is not a competitor to existing suites. It is the verification gate they don't currently include.
| When | What |
|---|---|
| 2026-02 | Vāk-Kaṇaja retrieval engine drafted; first naive nDCG numbers published internally |
| 2026-03 | Suspicion that nDCG was misleading. First three nulls (A/B/C) drafted |
| 2026-03 (mid) | Constant predictor passed three-null gate. The gap that became Null D. |
| 2026-04 (early) | Closure-capture bug found; pre-fix numbers retracted publicly |
| 2026-04 (mid) | N=21 → N=141 → N=10,000 bench expansion. UNDER-NS verdict confirmed. |
| 2026-04 (late) | Industrial v6 release, broken-predictor suite, sensitivity grid, bench-size curve, isolated baselines, all PASS |
| 2026-05-01 | v0.1.0 → public GitHub release as Apache 2.0 standalone library |
| 2026-05-04 | Akosh report → v0.1.2 fix → v0.1.3 input validation + CLI + MCP → v0.1.4 doctor + LLM-RAG example. Current. |
No. Bootstrap CIs and permutation tests exist in scipy and statsmodels. The contribution is the specific construction of the marginal-matched null and the packaging of all four nulls as a single PASS/FAIL gate that runs in 30 seconds without statistical expertise on the user's part. The math is undergraduate. The discipline is the product.
Sometimes, but with much more effort and no guarantee. Diversifying benchmarks (BEIR's approach) catches some failures the four-null gate misses. The four-null gate catches some failures benchmark-diversification misses. The two are complementary.
Because the library is small enough to audit in an afternoon (~1,000 lines of pure Python + numpy), the math is in a public preprint, the implementation is Apache-2.0, and every published claim has a one-command reproduction path. Trust the verifiability, not the person.
As of 2026-05-04, no one. Akosh in India found install-path bugs; that's the only external interaction so far. Independent validation by an external lab is the highest-priority pending milestone, not a finished one.
Pricing for commercial audit work is published at bhardwajandsons.com/#services with explicit promotion criteria (e.g. external validation lifts the price 20%). Zero clients today. The library is free. The methodology is in a public preprint. The brand exists in part to protect the methodology from the pressure to inflate. If any specific claim on this page is unverifiable, name it and I will retract it.
git clone https://github.com/spalsh-spec/falsify-eval.git cd falsify-eval python3 -m venv .venv && source .venv/bin/activate pip install -e . falsify-eval doctor
Expected output: "All systems green" with the embedded synthetic bench passing all four nulls. If you see anything else, that is a bug; file an issue and I will fix it within 24 hours per the incident-response runbook.
To grade your own pipeline:
falsify-eval grade --input bench.jsonl --pool labels.txt --metric ndcg@5