falsify-eval

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

[0.2.0] — 2026-05-19

Released — first public release

The methodology library and its falsify-eval PyPI distribution go public under the Bhardwaj & Sons brand. Apache 2.0. v0.2.0 is the first version where pip install falsify-eval works without a git URL.

Fixed — v0.2-prep housekeeping pass

Four small fixes folded into this release. No behavioural change to the gate; no test-count change.

[0.1.6.11] — 2026-05-08

Fixed

[0.1.6.10] — 2026-05-08

Added — distribution + arXiv build prep

Three coordinated infrastructure additions that get the package from “clone-from-git only” to “ready for arXiv + PyPI.” Nothing in the gate’s behaviour changes; this is plumbing.

Internal

[0.1.6.9] — 2026-05-08

Added

Internal

[0.1.6.8] — 2026-05-08

Added — empirical equivariance certificate for the four-null gate

Two new property tests in tests/test_property_gate.py, plus PREPRINT §5.9 documenting the precise statement they support:

Documented (PREPRINT §5.9)

Total suite now 91/91; runs in ~10 s under python3 -W error::SyntaxWarning -m pytest tests/.

[0.1.6.7] — 2026-05-08

Fixed

[0.1.6.6] — 2026-05-08

Added — property-based test suite (Hypothesis) for the four-null gate

A package whose value proposition is rigor about retrieval evaluation has to be more rigorous than what it asks of users. tests/test_property_gate.py adds 13 universally-true properties of four_null_gate, each exercised against ~80 random benches generated by Hypothesis (≈1,040 example runs in ~6 s). The suite catches the classes of bug that line-coverage hides:

Algebraic invariants (per-result, must hold by construction):

  1. deltas[X] == real_mean - null_means[X] for every X ∈ {A,B,C,D}
  2. passes[X] == (deltas[X] >= tau)
  3. gate_passes == all(passes.values())
  4. Every float in the result is finite (no NaN / no Inf — closes the idcg=0 edge in ndcg_at_k)
  5. The result schema is complete (all documented keys present, sub-dicts keyed exactly by {A,B,C,D})
  6. τ-monotonicity: tightening τ cannot turn a FAIL into a PASS. A reviewer’s first sanity-check on a falsification harness.

Determinism:

  1. Same inputs + same seed → byte-identical numerical output across real_mean, null_means, deltas, passes, gate_passes. This is the property reviewers need to trust the headline numbers.

Metric properties (don’t even need the gate):

  1. ndcg, recall, mrr ∈ [0, 1] for every (retrieved, gold, k)
  2. recall@k is monotone in k

Gate semantics:

  1. Oracle bench (retrieved[0] == gold) → real_mean == 1.0 and the gate passes at τ=0.05 on any multi-class bench. The “if this fails, the methodology is wrong” property.
  2. Type-preservation: relabelling every string s → ('lbl', s) produces numerically-identical results. Closes Mayank-defect #1 (numpy auto-coerced tuple labels into 2D arrays, silently disabling the gate for any non-string label type).

Validation guards:

  1. tau ∉ [0, 1] and negative seeds raise ValueError with a useful message — exercised across the full bad-input space, not just point samples.

The suite runs in ~6 s on a laptop and is part of the default pytest run. .hypothesis/ was already in the tree; no new top-level dependency added beyond hypothesis (already a dev dep).

[0.1.6.5] — 2026-05-08

Added

Internal

[0.1.6.4] — 2026-05-08

Fixed

Added

Internal

[0.1.6.3] — 2026-05-08

Added — public priority announcement of companion engine Vāk-Kaṇaja

This release is non-functional: it adds a “Companion engine” section to the README that establishes public priority on the engine name (Vāk-Kaṇaja), its two named contributions (Pramāṇa-aware query routing; Anupalabdhi non-perception confidence floor), and the calibration discipline applied to it (the negative result on the novel rerankers at bench expansion, documented as a contribution rather than buried). The full vak-kanaja code release follows the morning launch sequence in a separate repo (bhardwaj-and-sons/vak-kanaja, public release imminent).

This is the “establish priority without releasing implementation” pattern that mathematicians, physicists, and patent-filers have used for 200 years. Anyone scooping the methodology now has the priority graph to contend with.

Tests

[0.1.6.2] — 2026-05-07

Fixed — Mayank Singh round-3 polish (negative-seed validation)

Mayank ran a 25-probe round-3 review against v0.1.6.1 and reported 23/25 PASS. The two non-PASS items both traced to flaws in his own test fixtures, except one polish item we honour here: negative seed values fell through to numpy.random.default_rng(-1) which raises an unhelpful internal error.

Credit: Mayank Singh — third clean round in 48 hours.

[0.1.6.1] — 2026-05-07

Fixed — Mayank Singh round-2 review (CLI stdin sentinel)

Credit: Mayank Singh — re-ran the full battery on v0.1.5.1 against the six round-2 surfaces and surfaced this one cleanly with a one-line repro.

Closed via v0.1.6 (Mayank’s round-2 finding #2)

Mayank’s round-2 also flagged the PREPRINT abstract still naming features not shipped in the public library. v0.1.6 (shipped earlier today) already addressed this: the abstract was rewritten to clearly separate shipped vs methodology-spec items, and bonferroni() was added to the public stats API. Mayank tested v0.1.5.1, which predates that fix.

[0.1.6] — 2026-05-07

Added — Lewi gap closure (consolidation pass)

Lewi Stone reviewed the brand site on 2026-05-07 and identified three real gaps: (1) the empirical case was missing — no demonstration of the gate working on a real, public benchmark; (2) the documentation promised evidence and delivered analogy; (3) the framing conflated AI systems broadly with retrieval and ranking systems specifically. This release closes all three.

Changed — copy + scope honesty

Added — case study CS02 (SciFact triangulation)

case_studies/cs02_scifact/ — second BEIR slice, 300 queries × 5,183 docs, sparse relevance (~1.1 docs/query). Confirms the gate works AND triangulates the CS01 metric-sensitivity finding: on sparse-relevance benchmarks both metrics give clean separation, on dense-relevance only the single-gold metric does. Joint CS01+CS02 picture provides empirical foundation across two relevance regimes.

Tests

[0.1.5.2] — 2026-05-06

Added — progress=True flag (AIKosh 5-hour incident)

Mayank reported the gate had been running 5 hours under AIKosh’s harness with no visible progress. Profiling confirmed the gate itself is fast (N=5,000 × pool=100k × n_trials=50 finishes in <2s with a cheap metric). The 5-hour runtime is fully explained by an LLM-judge metric at ~200 ms / call: N * (1 + 4 * n_trials) calls = ~100k for N=500, n_trials=50, which at 200 ms each is ~5.6 hours.

The library can’t speed up a slow user metric, but it can stop running silently. v0.1.5.2 adds:

[0.1.5.1] — 2026-05-06

Fixed — same defect class as Mayank #1, third null

[0.1.5] — 2026-05-06

Fixed — Mayank Singh adversarial battery (14 defects, headline #1 catastrophic)

Credit: Mayank Singh / Indian AI Lab ran a 47-test stress battery against v0.1.4 and surfaced 14 real defects. Every fix below is paired with a regression test in tests/test_mayank_battery.py.

Added

[0.1.4] — 2026-05-04

Added — terminal UX overhaul (Claude-Code-style hints)

Added — explicit compatibility statement

Added — LLM-RAG validation worked example

Honest non-claim

We do not claim “tested against every known AI model.” That requires hundreds of dollars in API costs and a multi-day study. We ship the worked Claude example as a pattern; running it against other models is one function-body swap and we encourage external validators to publish the result of doing so.

[0.1.3] — 2026-05-04

Fixed

Added

Scope statement (added in response to user request)

[0.1.2] — 2026-05-04

Fixed

Reported by

[0.1.1] — 2026-05-01

Fixed

Added

0.1.0 — 2026-05-01

Added