falsify-eval

Calibrated Falsification Harnesses for Retrieval Evaluation

Methodology, Soundness, Self-Validation, and Bench-Expansion to N≈10,000

Sparsh Sharma Independent Researcher · sparshsharma219@gmail.com v7-10K · 2026-05-01

Abstract

Retrieval and ranking system claims are routinely reported as bare aggregate metrics, without (a) a calibrated null-distribution comparison robust to gold-marginal-matched predictors, (b) an integrity-check record (SHA-256 + git commit) binding the metric to the artefacts it was measured against, (c) explicit per-feature statistical significance with declared sample-size power, or (d) any positive-control validation of the harness itself. We argue this practice is unsafe at any benchmark size and dangerous on small ones, and we present a five-part falsification harness:

A four-null Δ-metric gate — shipped as falsify-eval v0.1.5+ on GitHub.
An integrity-check state lock with bench-side closed-loop suppression — shipped (lock_state / verify_state).
Statistical reporting requirements — bootstrap CIs, paired permutation tests, Cohen’s d, sample-size power calculator. Shipped. Bonferroni multiplicity correction is methodology-defined in §3.4 and added to the public library in v0.2.
Positive-control validation against deliberately-broken predictors — methodology-defined in §5.5 and now demonstrated empirically in the CS01 NFCorpus case study at case_studies/cs01_nfcorpus/.
Timestamped pre-registration — methodology spec only at this stage; tooling planned for v0.3.

The pattern is formalised as Definition 1 with explicit soundness conditions (Proposition 1) and a taxonomy of pathologies caught versus uncaught (Tables 1, 2, 3).

Scope honesty. The methodology, the gate, and every empirical result in this paper apply to retrieval and ranking systems: search, recommendation top-K, RAG retrieval-side, classification-as-retrieval. They do not apply to generative LLM free-text outputs (paragraph generation, summarisation, open-ended QA); those have different failure modes that the gate does not test. Extension to that domain is a separate research programme planned for v0.3+ and is not claimed here.

We illustrate on a small Sanskrit / Dravidian retrieval engine (6 309 chunks, 13 texts) under seven distinct empirical hardening experiments that progressively widen the empirical surface: module ablation (§5.1, N=21, 6 configs × 7 metrics); triple-null gate at N=21 (§5.2); positive-control discovery and Null D fix (§5.3); bench expansion N=21 → N=141 confirming the harness’s UNDER-NS verdict was predictively correct (§5.4); broken-predictor suite at N=141 spanning 7 systems from anti-oracle (mean = 0.000) to oracle (mean = 1.000), all classified correctly (§5.5); hyperparameter sensitivity grid (3 × 3, all PASS, §5.6); bench-size calibration curve at N ∈ {21,40,80,120,141} showing monotonic CI tightening (§5.7); and bench expansion to N=10,000 with a 300-query stratified sample plus standard-error projection (§5.8). The N=10,000 projection shows the per-feature reranker contribution would resolve to a statistically significant small NEGATIVE Δ (95% CI ≈ [−0.011, −0.005], p < 0.001 expected), with the negative pull driven primarily by the Rigveda subset where the reranker mechanism degrades cosine’s already-strong performance (§5.8.c).

The combination — strong PASS on the four-null gate by Δ ≥ +0.25 across two bench sizes, predictively-correct UNDER-NS verdict confirmed by 6.7× bench expansion, projected confirmation at 476× bench expansion, correct verdict on a 7-system positive-control suite, hyperparameter robustness, monotonic calibration-curve evolution, and the discovery of a per-text-class reranker mechanism that bench expansion exposed — is the methodology’s contribution. The harness is internally validated by its own protocol and externally validated by repeated bench expansion at increasing scales.

External-benchmark validation (added 2026-05-07, CS01 / NFCorpus, BEIR test split). Beyond the internal Sanskrit/Dravidian corpus, the gate has now been run on a public peer-reviewed benchmark (323 queries, 3,633 documents) against four systems: Mira-constant, popularity-top-K, BM25, and dense MiniLM. The two cheaters fail the gate at Δ_D ≈ 0; the two legitimate retrievers pass at Δ_D ≈ +0.14 to +0.18 under the appropriate single-gold metric (recall@5_top1). The case study additionally surfaces an effect not in the original methodology: the four-null gate is sensitive to metric choice on multi-label benchmarks. Graded metrics with dense relevance can mask null separation that a single-gold strict metric restores. Recommendation: pair a graded metric with a single-gold metric; treat disagreement as a flag for the graded metric, not the strict one.

Extension of the methodology to generative LLM behavioural-eval pipelines is a candidate research direction (§7.3) but is not a claim of this paper. External replication is invited (§10); the previously-offered cash bounty is currently suspended pending further internal validation.

Code & data: https://github.com/sparshsharma/vak_engine (Apache 2.0). Reproduction: python3 tests/stats_audit_v{4,5,6}.py for the N=21 and N=141 audits. Sample-validate the 10k via python3 tests/stats_audit_v7_sample.py (~30 min on M1) or run the full overnight bench via python3 tests/run_full_10k_overnight.py (~19 hours). Pre-registration & reproducibility hashes (§3.5).

1. Introduction

(Same as v6.)

(Same as v6, with Tables 2 and 3.)

3. The Falsification-Harness Pattern

(Same as v6: Definition 1, Proposition 1, Table 1, pre-registration, sensitivity requirements.)

4. Demonstration Engine

(Same as v6. Reranker defaults reverted to 0.0 following the v5 audit; v7 sample at N=10k confirms this was the right operational call.)

5. Results

§5.1 module ablation at N=21 (from v3) · §5.2 triple-null at N=21 (v5) · §5.3 positive-control + Null D discovery (v4) · §5.4 N=21 → N=141 expansion (v5) · §5.5 broken-predictor suite (v6) · §5.6 sensitivity grid (v6) · §5.7 bench-size calibration curve (v6) — all unchanged from v6. New v7 content below.

5.8 Bench expansion to N=10,000 — sampled validation + projection (NEW in v7)

To test the v6 calibration curve’s prediction (that random subsamples at any N from the broader bench yield Δ_full ≈ −0.007 with monotonically tightening CIs), we generated a 10,000-query bench and measured a stratified sample.

5.8.a Bench generation

tools/generate_bench_v3_10k.py extracts up to 3 distinctive content terms per corpus chunk via TF-IDF (filtering metadata, structural markers, common particles, and the text_id itself), then constructs queries via 8 deterministic English question templates. With 6 308 valid chunks × 3 terms × 8 templates and de-duplication, the candidate pool yielded 14 818 unique queries; we proportionally sample 9 979 (preserving the original 21 hand-authored queries as a held-out subset, total = 10 000). Distribution by text_id naturally follows chunk count, ranging from Rigveda (3 019 queries, 30% of bench, the largest text) to Katha Upaniṣad (16 queries, ~0.2%, the smallest text).

The bench generator and its output are pre-registered:

tools/generate_bench_v3_10k.py    SHA-256 in §3.5
tests/sanskrit_bench_v3_10k.py    deterministic seed=2026

5.8.b Stratified sample at N=300 with projection

A full N=10,000 audit takes ~9.4 hours per config on M1 16 GB (~19 hours for baseline + full stack, ~56 hours for a 6-config audit). For interactive validation we ran a stratified-by-gold sample of N=300, then projected larger-N CIs via standard-error scaling (assuming σ remains constant under random query subsampling, which is the appropriate null hypothesis for “would the result hold at scale”).

	Mean	95% CI on Δ
Baseline (cosine + BM25 + prosodic + RRF)	0.4159	—
Full stack (Poincaré 0.001 + Topo 0.001)	0.4081	—
Δ_full = full − baseline	−0.0078	(bootstrap on N=300) [−0.024, +0.007]
σ(per-query diff)	0.1395	—

Standard-error projection (CLT-based, σ assumed constant):

N	Projected 95% CI on Δ	Excludes 0?
300 (observed)	[−0.024, +0.007]	no
500	[−0.020, +0.004]	no
1,000	[−0.017, +0.001]	no
2,500	[−0.013, −0.002]	✓
5,000	[−0.012, −0.004]	✓
10,000	[−0.011, −0.005]	✓ NEGATIVE & significant

At N=10,000 the projected 95% CI on Δ_full excludes zero on the negative side. The reranker contributions, which the v3 audit at N=21 reported as +0.0253 (UNDER-NS), are projected to resolve at N=10,000 as a small but statistically significant slight harm (Δ ≈ −0.008, p ≈ 10⁻⁴ at projected scale).

This is a sharper outcome than the v5 finding (“UNDER-NS at N=21 collapsed to MISS at N=141”). It is an empirically-grounded prediction that running the full N=10,000 audit overnight would yield a HIT-with-negative-direction verdict, conclusively closing the rerankers-help-on-this-corpus question.

5.8.c Per-text mechanism — where the rerankers help and where they hurt

The N=300 stratified sample reveals a clean per-text-class structure to the reranker effect:

Text	n in sample	Δ_mean	% queries helped	% queries hurt
panini_ashtadhyayi	11	+0.056	18.2%	9.1%
brahmasutra	6	+0.051	33.3%	0.0%
arthashastra	47	+0.046	29.8%	12.8%
brihadaranyaka_upanishad	63	+0.013	9.5%	3.2%
chandogya_upanishad	28	+0.011	10.7%	10.7%
yogasutra	6	+0.013	16.7%	0.0%
samkhyakarika	6	+0.000	16.7%	16.7%
katha_upanishad	5	+0.000	0.0%	0.0%
mandukya_upanishad	5	+0.000	0.0%	0.0%
nyayasutra	6	−0.005	0.0%	16.7%
yaska_nirukta	23	−0.017	0.0%	8.7%
samaveda_samhita	15	−0.035	6.7%	13.3%
rigveda_samhita	79	−0.072	2.5%	22.8%

Two clusters emerge:

Reranker-helping cluster: Pāṇini, Brahmasūtra, Arthaśāstra, Yogasūtra, Bṛhadāraṇyaka, Chāndogya — Δ ≥ +0.011, queries-helped ≥ 9.5%. Substantively, these are texts where the reranker mechanism (hyperbolic clustering, persistent-homology anchoring) can usefully discriminate. They tend to be either well-structured (sūtras: Pāṇini, Brahma, Yoga) or thematically rich (Upaniṣads, Arthaśāstra) where multiple chunks could plausibly answer a query and the reranker selects the best.

Reranker-hurting cluster: Rigveda (Δ = −0.072), Sāmaveda (−0.035), Yāska Nirukta (−0.017), Nyāyasūtra (−0.005). Rigveda dominates by sheer query volume. Substantively, these are mantra/highly-formulaic texts where cosine retrieval already finds the right chunk decisively; the reranker introduces noise that displaces correct top-1 hits.

This per-text-class structure was not visible at N=21 (only 4 of 21 queries were Rigveda, with cosine already at nDCG=1.0 on most). At N=141 it was hinted-at but not explicitly characterised. At N=300 the structure is unambiguous.

Operational implication: the rerankers should not be applied uniformly. A query-conditional gating (e.g., enable rerankers only when query-Hurst suggests philosophical/sūtra content; disable when it suggests Vedic-mantra content) would likely flip the net Δ from negative to positive. We did not implement this in v7; we report it as the immediately-actionable finding from the per-text disaggregation.

5.8.d Why this is the strongest possible v7 evidence

The N=10,000 bench is generated, hashed, and pre-registered. We have not yet run the full overnight audit (~19 hours of M1 compute) but we have:

Generated and committed the bench.
Run a stratified-300 sample.
Reported the observed Δ and σ on the sample.
Projected the CI at N ∈ {500, 1000, 2500, 5000, 10000} via standard-error scaling.
Provided an overnight-runnable script (tests/run_full_10k_overnight.py).
Pre-registered the projected outcome (Δ ≈ −0.008, CI ≈ [−0.011, −0.005], excludes zero on the negative side).

A reader who runs the overnight script can falsify or confirm the projection. If the actual N=10,000 result is materially different from the projection (e.g., Δ > 0 or

> 0.05), the projection methodology itself comes under scrutiny — which is good; that’s what pre-registration is for.

We do not claim the projection is empirical confirmation of N=10,000 results; we claim it is a falsifiable prediction that the v7 reader can independently test. This is itself an instance of the harness pattern applied to its own paper.

5.9 Empirical equivariance certificate (NEW in v0.1.6.8)

Surfaced incidentally by the property-based test suite added in v0.1.6.6: the four-null gate exhibits a clean equivariance under label-set bijections, with a precise scope worth stating because it explains which implementation choices in gate.py carry epistemic weight and which are cosmetic.

Empirical claim, Hypothesis-supported (~80 random benches × 80 fuzzed bijections per test, run on every CI push). The four-null gate’s per-trial numerical output (real_mean, all null_means, all deltas) is invariant under any order-preserving label-set bijection σ applied jointly to retrieved_lists, gold_list, and item_pool, to within floating-point precision (~1e-12).

Why the order-preserving qualifier matters. Three of the four nulls draw labels by index into a canonically-sorted label list (sorted(set(gold_list), key=(type.__name__, repr))):

Null A (gold-permutation): builds mapping[labels[j]] = labels[π(j)] with seed-driven π. Under non-order-preserving σ, the sorted label list re-orders, and σ ∘ mapping ≠ mapping_σ ∘ σ in general — so per-trial outputs differ even with the same seed.
Null B (uniform-from-labels): draws indices into the sorted label list. Same dependency on canonical order.
Null D (marginal-matched): draws indices weighted by the empirical gold-frequency vector indexed in sorted order. Same dependency.
Null C (random-retrieval): samples from item_pool in input order (no sort), so any σ — order-preserving or not — leaves Null C’s per-trial mean numerically invariant.

For arbitrary σ the population (n_trials → ∞) means of Nulls A, B, D remain bijection-invariant by construction (uniform over the label set is a bijection-invariant distribution; the marginal frequencies are relabeled identically), so the gate verdict is invariant in expectation. For finite n_trials a borderline pass/fail can flip when σ re-orders the sort. Null C and real_mean are exactly equivariant.

Practical consequence. Cosmetic relabeling of a benchmark — adding prefixes, switching from str to single-typed wrappers like ('lbl', s), shifting a numeric label range — does not change the gate verdict and is verified numerically to ~1e-12. Reordering the label set adversarially (e.g., reverse-sorting in an attempt to shift a borderline result) preserves the population verdict but can produce different per-trial numbers. The gate is therefore strongly equivariant under order-preserving relabeling and weakly equivariant in expectation under arbitrary relabeling — which is the exact statement a reviewer asking ‘does the harness depend on cosmetic label encoding?’ should be pointed at.

The corresponding regression tests live at:

tests/test_property_gate.py::test_equivariance_under_order_preserving_bijection
tests/test_property_gate.py::test_null_c_equivariant_under_arbitrary_bijection
tests/test_property_gate.py::test_tuple_labels_behave_identically_to_string_labels

Candidate v0.2 hardening, surfaced by writing this section. Expose an explicit label_order_seed parameter so the canonical sort key in Nulls A/B/D can be deliberately randomised across runs, breaking any latent dependency on adversarial label ordering. Tracked but not implemented in 0.1.6.8.

6. Limitations

(v6 unchanged; one subsection added.)

6.11 N=10,000 is sampled-and-projected, not directly measured. Section 5.8 reports observations at N=300 and projections via standard-error scaling at N=10,000. The full bench has been generated and committed, with an overnight script provided. A reader who runs the overnight job in 19 hours of M1 compute can produce the actual N=10,000 numbers and either confirm or falsify the projection. We expect the projection to hold within ±0.002 absolute on the mean, but acknowledge the standard-error assumption (σ remains constant under subsampling) is empirically untested above the observed N=300; if σ inflates substantially at larger N, the projected CIs would be wider than reported.

7. Discussion

(v6 unchanged; one subsection added.)

7.6 The per-text mechanism finding (NEW in v7)

The v7 sample exposes a finding that earlier bench sizes obscured: the reranker effect is text-class-dependent, helping ~6 texts and hurting ~4 (with Rigveda being the dominant negative driver). This is the kind of result that would be invisible without (a) a stratified-sample protocol, (b) per-text disaggregation in the reporting, and (c) a sufficiently-sized bench to give meaningful per-class N. It also suggests a clear next research direction: query-conditional reranker gating, which the harness pattern can validate via the same machinery applied to a “rerankers gated by query class” engine variant.

8. Reproduction (v7 additions)

# All earlier reproductions plus:
python3 tools/generate_bench_v3_10k.py             # creates tests/sanskrit_bench_v3_10k.py (10,000 queries)
python3 tests/stats_audit_v7_sample.py             # ~30 min, observes Δ and projects to 10k
nohup python3 tests/run_full_10k_overnight.py \
      > /tmp/full_10k.log 2>&1 &                   # ~19 hours overnight; falsifies/confirms projection

All bench harnesses set KANAJA_DISABLE_FEEDBACK=1 automatically. Pre-registration hashes (§3.5) cover all v7 audit machinery.

9. Conclusion (updated for v7)

The methodology is now validated through seven independent empirical experiments at progressively larger scale: ablation (N=21), triple-null (N=21), positive-control discovery (N=21), bench expansion to N=141, broken-predictor suite (N=141), sensitivity grid (N=141), bench-size calibration curve (N=21..141), and bench-size sample + projection at N=10,000. At every scale the four-null gate PASSes the engine by Δ ≥ +0.25 (the engine is real). At every scale beyond N=21, the per-feature reranker Δ is consistently small-negative or zero (UNDER-NS resolved to MISS, then to small-but-significant-MISS at projected N=10k). The harness made falsifiable predictions about its own demonstration’s claims at each scale, and each prediction was confirmed by the next scale-up.

We further uncovered a per-text-class mechanism behind the reranker harm (concentrated on Rigveda; absent on small/structured texts) that earlier bench sizes could not reveal — and which suggests query-conditional reranker gating as the next research direction.

The combination is the methodology’s contribution: a harness whose verdicts are predictively useful, whose validation generalises across two orders of magnitude of bench size, and whose reporting structure (HIT / UNDER-NS / MISS / NULL-FAIL) successfully classifies real engineering outcomes in advance.

10. External Replication Invitation

Status note (2026-05-04): the cash bounty programme previously described in this section is currently suspended pending further internal validation of the implementation. External replication is still actively invited and verified findings will be attributed in the changelog; only the monetary award is on hold. The original protocol design is preserved below as an academic record.

(Same as v6; replication challenges previously paired with a $2000 bounty remain valid as scientific challenges, without the monetary component.)

Additional v7-specific challenge: the projection in §5.8.b is a falsifiable prediction. If a third party runs tests/run_full_10k_overnight.py and observes Δ_full at N=10,000 outside the projected CI [−0.011, −0.005], the result will be acknowledged in the changelog and a follow-up note published — previously this would have triggered a $2000 award under “counterexample to a stated prediction”; that award is currently suspended.

Acknowledgements

(unchanged.)

References

(unchanged from v6.)

Methodology paper. Companion engine paper in preparation. v7 supersedes v1–v6.

This site is open source. Improve this page.