falsify-eval, what, how, why · Bhardwaj & Sons

1 · What this is, in thirty seconds.

If you build a retrieval system, a search engine, a recommender, the retriever inside a RAG pipeline, you usually grade it with one number: nDCG@10 or recall@5 or similar. The number does not tell you whether your engine is using the query at all. A "predictor" that ignores every query and always returns the most-frequent class can score above a sensible threshold on every metric in standard use. That's the failure mode falsify-eval catches.

The library runs your engine's output against four hypothetical baselines (we call them nulls) and reports which baselines your engine beats by enough to matter. The fourth baseline is the contribution: a marginal-matched random predictor that simulates "the system that exploits gold-label distribution without using the query." If your engine doesn't beat that one, it's not really retrieving, it's pattern-matching the test.

"Most retrieval-system papers report a single aggregate metric and call it a contribution. Three failure modes make this practice unsafe at any benchmark size, and dangerous on small ones. This library closes one of them."

2 · The hidden failure: why a high benchmark score can mean nothing.

Imagine a corpus where 80% of the gold answers are "Document A" and 20% are split across "Documents B–F." A predictor that always returns "Document A" without reading the query will score around 0.80 recall@1, well above the noise floor of any benchmark. It is not retrieving. It is exploiting that the gold-label distribution is skewed.

This sounds contrived. It is not. Real-world retrieval benchmarks routinely have skewed gold distributions because some documents are queried more often than others, some categories dominate the corpus, and some questions have many equivalent answers. The published numbers look impressive against a uniform-random baseline. They are sometimes barely above the marginal-matched baseline, meaning the engine is doing little more than knowing the popularity distribution.

Three nulls in standard use will fail to detect this:

Permuted gold (Null A): swap the gold labels around at random. A constant predictor still beats this null because permuting doesn't change which label is most common.
Uniform-random gold (Null B): draw gold from a uniform distribution over labels. A constant predictor crushes this null because the constant is matched to the empirical (skewed) distribution, not the uniform one.
Random retrieval (Null C): ignore the engine and return K random items. A constant predictor that returns the most-frequent class beats this null trivially because most queries' golds are that class.

The fourth null fixes it.

3 · The four-null gate, with an interactive demo.

The gate runs your engine's output and four null-distribution outputs through the same metric. Your engine "passes" each null if its score exceeds the null's score by at least τ (default 0.05). The gate as a whole passes only if all four nulls pass.

Below: pick a predictor type. The bars show how each predictor scores against each null. Watch Null D in particular.

Real predictor mean nDCG@5: 1.000

GATE: ✓ PASS

All four Δ values exceed the threshold τ=0.05.

This demo uses simulated values calibrated to a 50-query, 12-label synthetic bench (the same one in examples/synthetic_demo.py). Δ values are the per-null deltas after 100 trials. Real numbers from your own bench will look similar in shape but different in magnitude.

The fourth null in formal terms

Null D, gold-marginal-matched random. For each query, draw the gold label from the empirical frequency distribution of gold labels in the bench, then score the engine's output against that drawn gold. Repeat for n_trials, average. A predictor that exploits the marginal distribution will produce a Δ_D near zero, exactly what we want it to flag.

4 · Worked example: a real engine on 10,000 queries.

The methodology was developed against and validated on Vāk-Kaṇaja, a private retrieval engine over a 6,309-chunk Sanskrit and Tamil literary corpus. The bench: 10,000 queries, top-K = 10, metric nDCG@10. Here are the actual numbers from the v0.1 release validation:

System	real mean	Δ_A	Δ_B	Δ_C	Δ_D	verdict
Anti-oracle (returns wrong)	0.000	−0.25	−0.25	−0.25	−0.25	FAIL (correctly)
Constant predictor	0.180	+0.09	+0.10	−0.07	+0.00	FAIL (Null C, D)
Marginal-rank (top-K freq)	0.343	+0.19	+0.19	+0.19	+0.00	FAIL (Null D only)
Vāk-Kaṇaja v0.1	0.789	+0.55	+0.55	+0.55	+0.52	PASS
Oracle	1.000	+0.75	+0.75	+0.75	+0.74	PASS

The marginal-rank row is the gate's contribution: a predictor that looks good on Nulls A, B, C (Δ ≈ +0.19) but is unmasked by Null D (Δ ≈ +0.00). Without the fourth null, you would conclude marginal-rank is a real retrieval system. It is not. It is the gold-distribution rank order, returned without reading the query.

This isn't a hypothetical. We discovered our own first published number had this exact shape and retracted it publicly. That retraction is what gave the validated numbers their weight.

5 · What this library does not do.

Five honest non-claims:

It does not grade LLM text generation. The four-null gate is calibrated for retrieval and ranking, search, recommendation top-K, RAG retrieval-side, classification-as-retrieval. Grading the LLM's generated answer requires different null distributions designed for free-text output. That is v0.3+ work.
It does not detect plagiarism, training-data leakage, or memorisation. Those are different problems requiring different methods.
It does not promise statistical significance with default settings. The default n_trials=50 gives noisy Δ estimates. For publishable claims use n_trials ≥ 200 and the bootstrap CI helpers in falsify_eval.stats.
It does not handle clustered queries correctly. Bootstrap CIs assume queries are i.i.d. If your bench has paraphrase clusters, near-duplicates, or temporally correlated queries, the CIs are anti-conservative by roughly √(cluster size). Known limit. Documented in STRESS_TEST_LADDER.md Tier 5d.
It does not replace your judgment about whether the bench itself is meaningful. Garbage-in still gives garbage verdicts. The gate detects whether your engine is gaming the bench; it cannot detect whether the bench is a faithful proxy for the task you care about.

6 · What I tried that didn't work.

Three honest failures from the development of this work, in chronological order:

The fractal channel that turned out negative

Multifractal-detrended-fluctuation features (MFDFA) over the Sanskrit corpus were a hypothesised reranking signal. After a stratified sweep at N=10,000 the result was a statistically significant small negative Δ at the optimal weight. The methodology let us catch this rather than ship it. The feature was disabled in production. Documented in the v6 preprint §5.8.c.

The closure-capture defect

The first published validation numbers were inflated because of a Python closure-capture bug, default arguments were being evaluated at module-import time, causing some configurations to be silently shadowed by older values. The fix produced a +0.0154 lift at the corrected weight, which was the first validated positive lift. Public retraction of the pre-fix numbers came with the fix. This is what every methodology of this type promises and few deliver.

The Akosh install bugs (2026-05-04)

An external user in India, Akosh, tried to follow the README's install instructions and reported "failed basic bench tests." Reproducing on a fresh clone revealed three bugs in the README: pip install falsify-eval referenced an unpublished PyPI package, the demo command used the wrong import path, and the Quick-demo URL had a literal <your-handle> placeholder. The library code was fine. v0.1.2 shipped 45 minutes after the report. Akosh credited in changelog.

The bug bounty I suspended

The original methodology paper offered a $2,000 cash bounty for any of three classes of counterexample. After the install-path bugs surfaced, I suspended the bounty pending broader internal validation. The suspension is documented in the README, CONTRIBUTING, SECURITY, and the preprint section 10. Suspending a public reward is the calibrated move when the implementation isn't yet hardened to the level the offer implied.

7 · Prior work, properly attributed.

The four-null gate is not the first attempt to discipline retrieval evaluation. It builds on three lines of work:

BEIR (Thakur, Reimers, Rücklé et al., 2021), heterogeneous benchmark suite for zero-shot retrieval. Establishes the practice of evaluating retrievers across many corpora, not one. Different from us: BEIR diversifies the bench; we diversify the null hypotheses.
MTEB (Muennighoff et al., 2022). Massive Text Embedding Benchmark. Standardised evaluation across 8 task categories. Different from us: MTEB is a leaderboard infrastructure; we are a per-system audit.
OLMES (Gu et al., AI2, 2024), open evaluation standard with explicit reproducibility checks. Closest in spirit to our work. Different from us: OLMES standardises what to measure; we add which baselines to compare against, particularly Null D.
RAGAS (Es et al., 2023), reference-free RAG evaluation. Different from us: RAGAS evaluates the generation side; we evaluate the retrieval side.
Voorhees, "Variations in relevance judgments and the measurement of retrieval effectiveness" (1998), the original meta-evaluation paper from TREC, foundational to all of this.

The contribution of falsify-eval is narrow and specific: the marginal-matched null (Null D) and its packaging as a four-test gate that any retrieval evaluator can run in ~30 seconds. Everything else is lineage.

8 · "Why not just integrate it with every existing eval suite?"

You should. The library is designed to drop into any existing pipeline because it is metric-agnostic, engine-agnostic, and corpus-agnostic. The MCP server (python -m falsify_eval.mcp_server) means Claude Code or any MCP client can call it on any retriever's output, including those wrapped by BEIR/MTEB/OLMES.

What this library is, specifically, is the missing fourth null. If you already use BEIR or MTEB or OLMES, the right move is to add the four-null gate as a per-system pre-flight check before publishing your numbers. It takes one Python import.

It is not a competitor to existing suites. It is the verification gate they don't currently include.

9 · Timeline of progress and challenges.

When	What
2026-02	Vāk-Kaṇaja retrieval engine drafted; first naive nDCG numbers published internally
2026-03	Suspicion that nDCG was misleading. First three nulls (A/B/C) drafted
2026-03 (mid)	Constant predictor passed three-null gate. The gap that became Null D.
2026-04 (early)	Closure-capture bug found; pre-fix numbers retracted publicly
2026-04 (mid)	N=21 → N=141 → N=10,000 bench expansion. UNDER-NS verdict confirmed.
2026-04 (late)	Industrial v6 release, broken-predictor suite, sensitivity grid, bench-size curve, isolated baselines, all PASS
2026-05-01	v0.1.0 → public GitHub release as Apache 2.0 standalone library
2026-05-04	Akosh report → v0.1.2 fix → v0.1.3 input validation + CLI + MCP → v0.1.4 doctor + LLM-RAG example. Current.

10 · Open problems & what needs further research.

Generalisation to LLM text generation. The four nulls assume top-K retrieval output. Free-text generation needs different null constructions. Likely candidates: prompt-permuted, instruction-uniform, response-marginal-matched. v0.3+.
Clustered-query CIs. Bootstrap assumes i.i.d. queries. A correction for paraphrase clusters and temporal correlation needs to be designed and validated. Probably block-bootstrap.
Adaptive τ selection. The default τ=0.05 is a heuristic. A principled procedure for choosing τ as a function of bench size and label-set complexity is open.
Multi-class extensions. Currently the gate is calibrated for retrieval where each query has one gold answer. Multi-relevance grading (rel ∈ {0,1,2,3}) is supported but not deeply validated.
Independent third-party validation. The N=10,000 result has been replicated only by the original author. Independent replication by EleutherAI, METR, AI2, or HuggingFace's leaderboard team is the highest-value missing piece. Outreach to these groups is the active next step.

11 · Five hard questions.

"Is this just a wrapper over standard statistical testing?"

No. Bootstrap CIs and permutation tests exist in scipy and statsmodels. The contribution is the specific construction of the marginal-matched null and the packaging of all four nulls as a single PASS/FAIL gate that runs in 30 seconds without statistical expertise on the user's part. The math is undergraduate. The discipline is the product.

"Can't I get the same result by running multiple existing benchmarks?"

Sometimes, but with much more effort and no guarantee. Diversifying benchmarks (BEIR's approach) catches some failures the four-null gate misses. The four-null gate catches some failures benchmark-diversification misses. The two are complementary.

"Why should I trust a one-person open-source project?"

Because the library is small enough to audit in an afternoon (~1,000 lines of pure Python + numpy), the math is in a public preprint, the implementation is Apache-2.0, and every published claim has a one-command reproduction path. Trust the verifiability, not the person.

"Who has independently validated this?"

As of 2026-05-04, no one. Akosh in India found install-path bugs; that's the only external interaction so far. Independent validation by an external lab is the highest-priority pending milestone, not a finished one.

"Is the author trying to get rich quick by selling false info?"

Pricing for commercial audit work is published at bhardwajandsons.com/#services with explicit promotion criteria (e.g. external validation lifts the price 20%). Zero clients today. The library is free. The methodology is in a public preprint. The brand exists in part to protect the methodology from the pressure to inflate. If any specific claim on this page is unverifiable, name it and I will retract it.

12 · Verify the claim in 30 seconds.

git clone https://github.com/spalsh-spec/falsify-eval.git
cd falsify-eval
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
falsify-eval doctor

Expected output: "All systems green" with the embedded synthetic bench passing all four nulls. If you see anything else, that is a bug; file an issue and I will fix it within 24 hours per the incident-response runbook.

To grade your own pipeline:

falsify-eval grade --input bench.jsonl --pool labels.txt --metric ndcg@5