falsify-eval

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

[0.2.0] — 2026-05-19

Released — first public release

The methodology library and its falsify-eval PyPI distribution go public under the Bhardwaj & Sons brand. Apache 2.0. v0.2.0 is the first version where pip install falsify-eval works without a git URL.

Published to PyPI as falsify-eval via the OIDC trusted-publisher workflow on v* tag push. The version-sync guard cross-checks tag against __init__.py and pyproject.toml before the upload step runs (per the v0.1.6.11 fix).
Zenodo DOI minted for the release artefact. The DOI badge in the README and the identifiers block in CITATION.cff are populated from the GitHub Release → Zenodo deposit webhook.
Repository made public under spalsh-spec/falsify-eval. GitHub Pages enabled at https://spalsh-spec.github.io/falsify-eval/, serving the interactive sliders playground (play.html) and the long-form HTML explainer.
No behavioural changes from v0.1.6.11. The four-null gate, the lock, the stats helpers, the CLI, and the MCP server are byte-identical to v0.1.6.11. v0.2.0 is the public-distribution turn, not a code turn.

Fixed — v0.2-prep housekeeping pass

Four small fixes folded into this release. No behavioural change to the gate; no test-count change.

CITATION.cff URL drift. url: pointed at the wrong GitHub handle (sparshsharma); corrected to the canonical spalsh-spec. Citation-graph crawlers and reviewers use this field; wrong handle = broken linkage.
AIKosh spelling normalized across public surfaces. README, NEXT, and CHANGELOG entries used “AI Kosh”, “Akosh-AI”, and “AI Kosh” interchangeably. All platform references now use the official spelling AIKosh (https://aikosh.indiaai.gov.in/). Personal-name reference in v0.1.2’s “External user (Akosh, India, 2026-05-04)” left unchanged. Directory case_studies/cs03_aikosh_rag/ unchanged.
README Status section stale at v0.1.6.8. Bumped current-version line through v0.1.6.11 (and now v0.2.0); added per-version summary bullets for v0.1.6.9 (CS03 scaffold + Tested-platforms log), v0.1.6.10 (publish workflow + arXiv build prep), and v0.1.6.11 (publish workflow version-sync fix).
README test badge bumped 67 → 91. Stale since the v0.1.6.3 era.
README vak-kanaja cross-link corrected from the non-existent github.com/bhardwaj-and-sons/vak-kanaja to the working github.com/spalsh-spec/vak-kanaja. Forward-compatible: when the bhardwaj-and-sons org is created and the repo is transferred, GitHub auto-redirects the spalsh-spec URL forever.

[0.1.6.11] — 2026-05-08

Fixed

Publish workflow’s version-sync guard tried to import falsify_eval before the package was installed. Caught by the workflow itself on its first run (the v0.1.6.10 tag triggered the workflow, which failed at the version-check step before reaching upload — exactly as a guard should fail when something is wrong, but this time the something-wrong was the guard). Switched to reading __version__ and pyproject.toml’s version directly via grep/sed, which (a) doesn’t require the package to be importable, and (b) cross-checks both source files against each other and against the tag — three-way agreement is now the gate, matching what tests/test_mayank_battery.py::test_d7_version_sync does in pytest.

[0.1.6.10] — 2026-05-08

Added — distribution + arXiv build prep

Three coordinated infrastructure additions that get the package from “clone-from-git only” to “ready for arXiv + PyPI.” Nothing in the gate’s behaviour changes; this is plumbing.

.github/workflows/publish.yml — publishes to PyPI on every v* tag push using OIDC trusted publishing (no API tokens stored as repository secrets, no key rotation, no exfiltration risk if the workflow is ever compromised). Workflow includes a tag-vs-version-sync guard so a typo’d tag refuses to publish. One-time PyPI-side setup documented in docs/PYPI_PUBLISHING.md. The package name falsify-eval is verified available on PyPI at submission time.
tools/build_arxiv.sh — converts PREPRINT.md to an arXiv-submittable LaTeX bundle via pandoc (arxiv/preprint.tex + arxiv/falsify-eval-arxiv-submission.tar.gz). Optional local PDF preview if xelatex/pdflatex is installed. Categorisation, abstract guidance, endorsement notes, cover letter draft, and post-submission checklist all in docs/ARXIV_SUBMISSION.md.
[tool.mutmut] config in pyproject.toml + docs/MUTATION_TESTING.md documenting the deferred status: mutmut 3.x has a macOS regression (/.VolumeIcon.icns filesystem-root copy attempt) and mutmut 2.x has a Python 3.14 incompatibility (cannot pickle 'itertools.count'). Neither is a defect in this package; both resolve when upstream ships 3.14 support OR when CI adds a 3.12-pinned mutation-test job. Tracked for v0.2 with the exact configuration committed in pyproject.toml.
[project.optional-dependencies] dev bucket added alongside test, pinning mutmut, build, and twine so a dev-environment install is one command: pip install -e ".[dev]".

Internal

arxiv/, .mutmut-cache/, and mutants/ added to .gitignore — these are regenerable artefacts that should not be committed.
Local wheel build verified: python3 -m build produces a 12-file, ~100KB wheel that passes twine check. Installed in a fresh venv, falsify-eval doctor exits 0 with the same numbers as the editable install (real_mean=0.8557, all four nulls pass, GATE PASS).
91/91 tests still pass under python3 -W error::SyntaxWarning -m pytest tests/.

[0.1.6.9] — 2026-05-08

Added

CS03 case-study scaffold (case_studies/cs03_aikosh_rag/). First slot for a real production retriever inside an organisation, prepared after Jasmeet Singh (AIKosh) volunteered to wire the four-null gate into AIKosh’s internal RAG benchmark. The slot includes:
- CS03_REPORT.md — pre-registered structure with TBD sections marked explicitly so no fabricated numbers can sit there.
- run_case_study.py — refuses to run until data/queries.jsonl, data/pool.txt, and data/retriever.py are provided; exits 2 with a clear input-list message rather than silently producing fake output.
- Pre-registered expected outcomes in §5 so the actual run can falsify the predictions when results land.
Tested platforms log in README. External-verification entries by testers who are not the package author, dated, with version pinned. Initial entries:
- 2026-05-08 — Jasmeet Singh (AIKosh) — Windows 10 (19045) / Python 3.14.3 / PowerShell — verified install + upgrade (0.1.6.2 → 0.1.6.7) + doctor + quickstart + grade all clean; confirmed the cp1252 defect closed in 0.1.6.4 stays closed.
- 2026-05-07 — Mayank Singh — macOS 14 (M1) / Python 3.12 / zsh — 14-defect adversarial battery, all closed by 0.1.6.2.

Internal

README Status section was stale at 0.1.6.3; brought up to date with entries for 0.1.6.4 → 0.1.6.9 and a corrected test count (91, was 67).
v0.2 plan renumbered: CS03 now slots AIKosh, CS04 → FiQA, CS05 → Quora. (CS01 NFCorpus and CS02 SciFact are already in tree with results.)

[0.1.6.8] — 2026-05-08

Added — empirical equivariance certificate for the four-null gate

Two new property tests in tests/test_property_gate.py, plus PREPRINT §5.9 documenting the precise statement they support:

test_equivariance_under_order_preserving_bijection (Hypothesis, ~80 random benches × ~80 fuzzed prefixes). Under any order-preserving label-set bijection σ applied jointly to retrieved, gold, and item_pool, the gate’s per-trial real_mean, all four null_means, all deltas, and the verdict (gate_passes) are identical to the un-bijected run, to within ~1e-12. This is the property a reviewer asking “does the harness depend on cosmetic label encoding?” should be pointed at — the answer is no, by certificate.
test_null_c_equivariant_under_arbitrary_bijection (Hypothesis, ~80 random benches × ~80 fuzzed permutations of the pool). Under any bijection σ — order-preserving or not — real_mean and Null C’s per-trial mean are exactly equivariant. Null C samples from item_pool in input order (no sort), so the seed-driven sample sequence is bijection-stable.

Documented (PREPRINT §5.9)

The exact scope of equivariance: strong (per-trial numerical) under order-preserving σ; weak (in-expectation, population) under arbitrary σ for Nulls A/B/D, because those nulls index into a canonically- sorted label list. Null C and real_mean are exactly equivariant under any σ.
A worked-example proof sketch showing why Nulls A/B/D break per-trial equivariance under non-order-preserving σ (σ ∘ mapping ≠ mapping_σ ∘ σ when sort-order changes) but remain bijection-invariant in expectation.
A candidate v0.2 hardening: an explicit label_order_seed parameter that deliberately randomises the canonical sort, breaking any latent dependency on adversarial label ordering. Tracked, not implemented in this release.

Total suite now 91/91; runs in ~10 s under python3 -W error::SyntaxWarning -m pytest tests/.

[0.1.6.7] — 2026-05-08

Fixed

CI on 0.1.6.6 failed across all matrix cells because the new tests/test_property_gate.py imports hypothesis, which was used only as a transitive dev install on my local box (a .hypothesis/ cache was in the tree but no dependency had been declared). Two-line fix:
- pyproject.toml: add hypothesis>=6.0 to the test optional-deps bucket alongside pytest>=7.0.
- .github/workflows/ci.yml: install with pip install -e ".[test]" instead of pip install -e . + ad-hoc pip install pytest. Now the test extras govern what CI installs, so adding a dev dep in pyproject.toml automatically propagates to CI without a workflow edit.

[0.1.6.6] — 2026-05-08

Added — property-based test suite (Hypothesis) for the four-null gate

A package whose value proposition is rigor about retrieval evaluation has to be more rigorous than what it asks of users. tests/test_property_gate.py adds 13 universally-true properties of four_null_gate, each exercised against ~80 random benches generated by Hypothesis (≈1,040 example runs in ~6 s). The suite catches the classes of bug that line-coverage hides:

Algebraic invariants (per-result, must hold by construction):

deltas[X] == real_mean - null_means[X] for every X ∈ {A,B,C,D}
passes[X] == (deltas[X] >= tau)
gate_passes == all(passes.values())
Every float in the result is finite (no NaN / no Inf — closes the idcg=0 edge in ndcg_at_k)
The result schema is complete (all documented keys present, sub-dicts keyed exactly by {A,B,C,D})
τ-monotonicity: tightening τ cannot turn a FAIL into a PASS. A reviewer’s first sanity-check on a falsification harness.

Determinism:

Same inputs + same seed → byte-identical numerical output across real_mean, null_means, deltas, passes, gate_passes. This is the property reviewers need to trust the headline numbers.

Metric properties (don’t even need the gate):

ndcg, recall, mrr ∈ [0, 1] for every (retrieved, gold, k)
recall@k is monotone in k

Gate semantics:

Oracle bench (retrieved[0] == gold) → real_mean == 1.0 and the gate passes at τ=0.05 on any multi-class bench. The “if this fails, the methodology is wrong” property.
Type-preservation: relabelling every string s → ('lbl', s) produces numerically-identical results. Closes Mayank-defect #1 (numpy auto-coerced tuple labels into 2D arrays, silently disabling the gate for any non-string label type).

Validation guards:

tau ∉ [0, 1] and negative seeds raise ValueError with a useful message — exercised across the full bad-input space, not just point samples.

The suite runs in ~6 s on a laptop and is part of the default pytest run. .hypothesis/ was already in the tree; no new top-level dependency added beyond hypothesis (already a dev dep).

[0.1.6.5] — 2026-05-08

Added

Cross-platform path-mangling detection. When --input or --pool points to a non-existent file on POSIX, we now check whether the path matches the shell-eaten form of a Windows-style path (e.g. my-benchbench.jsonl ← typed my-bench\bench.jsonl, where zsh/bash treated \ as an escape rather than a path separator). If we can unambiguously decode the intent — i.e. exactly one prefix in cwd is a directory whose name is a prefix of the bad path AND contains the remainder as a real file — we surface a precise “did you mean my-bench/bench.jsonl?” hint instead of the bare FileNotFoundError. Reported by Parth 2026-05-08 after copy-pasting Jasmeet’s Windows tutorial command into zsh.
Regression test tests/test_shell_mangled_paths.py covering both the recovery suggestion and end-to-end grade error formatting.

Internal

Fixed a SyntaxWarning: invalid escape sequence in the new helper’s docstring by switching to a raw docstring (r"""). Caught by running python3 -W error::SyntaxWarning -m pytest.

[0.1.6.4] — 2026-05-08

Fixed

Windows console crash on grade (reported by Jasmeet, Win10/PowerShell, Py 3.14.3). The pretty-printer emitted Δ, τ, ✓, ✗, ⚠, ─ which the legacy Windows console (cp1252 codepage) cannot encode, raising UnicodeEncodeError: 'charmap' codec can't encode character 'Δ' mid-print. Two-layer fix in falsify_eval/cli.py:
1. UTF-8 hardening at CLI entry. main() now calls _init_io() which reconfigures sys.stdout/sys.stderr to UTF-8 with errors='replace' before anything is printed. This alone makes the original crash impossible on every modern Python (≥3.7) and on every host OS, since the codepage of the underlying console no longer governs the encoding used by the interpreter.
2. Auto-degrade to ASCII when the stream still can’t encode. If stdout’s post-reconfigure encoding still rejects our glyphs (e.g. piping into a non-UTF-8 log processor), the printer transparently falls back to ASCII equivalents: Δ→d, τ→tau, ✓→[ok], ✗→[x], ⚠→!, ─→-.

Added

--ascii flag and FALSIFY_ASCII=1 environment variable to force ASCII-only output on demand (useful for CI logs that strip UTF-8).
doctor now reports stdout encoding and ascii_mode so install bugs related to console encoding are visible from a single command.
Regression test tests/test_windows_encoding.py that simulates a cp1252 console and proves the old code path crashes, the new path doesn’t, and --ascii produces a fully cp1252-decodable output stream.

Internal

All Path.open() and Path.read_text() / write_text() calls in cli.py now pass encoding='utf-8' explicitly. This was a latent companion bug — on the same Windows host that crashed Jasmeet’s print, reading a UTF-8 bench.jsonl could silently mojibake-corrupt rows depending on user locale.

[0.1.6.3] — 2026-05-08

Added — public priority announcement of companion engine Vāk-Kaṇaja

This release is non-functional: it adds a “Companion engine” section to the README that establishes public priority on the engine name (Vāk-Kaṇaja), its two named contributions (Pramāṇa-aware query routing; Anupalabdhi non-perception confidence floor), and the calibration discipline applied to it (the negative result on the novel rerankers at bench expansion, documented as a contribution rather than buried). The full vak-kanaja code release follows the morning launch sequence in a separate repo (bhardwaj-and-sons/vak-kanaja, public release imminent).

This is the “establish priority without releasing implementation” pattern that mathematicians, physicists, and patent-filers have used for 200 years. Anyone scooping the methodology now has the priority graph to contend with.

README updated with “Companion engine — Vāk-Kaṇaja” section above Status
Status section now references v0.1.6.3 and the test count of 67
Test badge updated 62 → 67

Tests

67 passing on a fresh clone (no functional changes from v0.1.6.2): Mayank-battery 31 + property-based 4 + scipy cross-check 11 + smoke 8 + validation 9 + CLI stdin 4. Total runtime <4 seconds.

[0.1.6.2] — 2026-05-07

Fixed — Mayank Singh round-3 polish (negative-seed validation)

Mayank ran a 25-probe round-3 review against v0.1.6.1 and reported 23/25 PASS. The two non-PASS items both traced to flaws in his own test fixtures, except one polish item we honour here: negative seed values fell through to numpy.random.default_rng(-1) which raises an unhelpful internal error.

_validate_inputs now rejects non-int and negative seeds up-front with a contextual ValueError: seed must be a non-negative integer, got <repr>.
New regression test test_d15_negative_or_non_int_seed_raises_clean_error parametrised across 5 bad seeds (-1, -100, 0.5, “2026”, None).

Credit: Mayank Singh — third clean round in 48 hours.

[0.1.6.1] — 2026-05-07

Fixed — Mayank Singh round-2 review (CLI stdin sentinel)

falsify-eval grade --input - now reads JSONL from stdin (UNIX convention). v0.1.5.1 wrapped args.input in Path() before opening, which turned - into a literal filename and crashed with FileNotFoundError: '-'. v0.1.6.1 threads - through load_jsonl() directly and dispatches to sys.stdin.
Error messages now label stdin as <stdin> (e.g. <stdin>:2: invalid JSON) instead of leaking a misleading filename.
--input help text now documents the - sentinel.
4 new regression tests in tests/test_cli_stdin.py exercise the fix via subprocess against the actual CLI entry point: stdin streaming success, empty-stdin clean failure (the v0.1.5.1 regression must not return), malformed-stdin error labelling, and file-input no-regression.

Credit: Mayank Singh — re-ran the full battery on v0.1.5.1 against the six round-2 surfaces and surfaced this one cleanly with a one-line repro.

Closed via v0.1.6 (Mayank’s round-2 finding #2)

Mayank’s round-2 also flagged the PREPRINT abstract still naming features not shipped in the public library. v0.1.6 (shipped earlier today) already addressed this: the abstract was rewritten to clearly separate shipped vs methodology-spec items, and bonferroni() was added to the public stats API. Mayank tested v0.1.5.1, which predates that fix.

[0.1.6] — 2026-05-07

Added — Lewi gap closure (consolidation pass)

Lewi Stone reviewed the brand site on 2026-05-07 and identified three real gaps: (1) the empirical case was missing — no demonstration of the gate working on a real, public benchmark; (2) the documentation promised evidence and delivered analogy; (3) the framing conflated AI systems broadly with retrieval and ranking systems specifically. This release closes all three.

bonferroni() helper in falsify_eval.stats — the PREPRINT abstract has promised Bonferroni-corrected paired tests since v0.1.0 but the public library did not ship the helper. It does now. Returns family-wise adjusted p-values, per-test α, and a per-test reject decision.
tests/test_stats_vs_scipy.py — 11 cross-check tests that reconcile our pure-numpy bootstrap_ci, paired_permutation_p, cohens_d_paired, and bonferroni against scipy on identical fixed-seed inputs. Closes Mayank attack-surface #4 ahead of his next round.
tests/test_property_based.py — 4 property-based tests via hypothesis: determinism under same seed, oracle always passes, constant cheater always fails Δ_D, query-order permutation invariance. Each test runs ~15 randomly generated benches per property.

Changed — copy + scope honesty

EXPLAINER_simple.html — title, og tags, and three body sections rewritten from “AI systems” to “search and ranking systems”. Added explicit scope-honesty callout block at the top: tests retrieval-and-ranking, does NOT test generative LLM outputs. Both case-study links inline.
PREPRINT.md abstract — struck the cryptographic record framing (corrected to integrity-check record (SHA-256 + git commit) per v0.1.5 calibration discipline). Added explicit shipped-vs-planned column for the five-part harness so a reader knows exactly what is in the public library vs what is methodology spec only. Replaced the generalises to LLM behavioural eval pipelines claim with a sober candidate research direction phrasing. Added a paragraph documenting the empirical CS01 result and the metric-sensitivity finding.
README.md — links to CS02 alongside CS01, status section updated.

Added — case study CS02 (SciFact triangulation)

case_studies/cs02_scifact/ — second BEIR slice, 300 queries × 5,183 docs, sparse relevance (~1.1 docs/query). Confirms the gate works AND triangulates the CS01 metric-sensitivity finding: on sparse-relevance benchmarks both metrics give clean separation, on dense-relevance only the single-gold metric does. Joint CS01+CS02 picture provides empirical foundation across two relevance regimes.

Tests

58 passing on a fresh clone (was 43 in v0.1.5.2): smoke 8 + validation 9 + Mayank battery 26 + scipy cross-check 11 + property-based 4. All run in <3 seconds.

[0.1.5.2] — 2026-05-06

Added — `progress=True` flag (AIKosh 5-hour incident)

Mayank reported the gate had been running 5 hours under AIKosh’s harness with no visible progress. Profiling confirmed the gate itself is fast (N=5,000 × pool=100k × n_trials=50 finishes in <2s with a cheap metric). The 5-hour runtime is fully explained by an LLM-judge metric at ~200 ms / call: N * (1 + 4 * n_trials) calls = ~100k for N=500, n_trials=50, which at 200 ms each is ~5.6 hours.

The library can’t speed up a slow user metric, but it can stop running silently. v0.1.5.2 adds:

four_null_gate(..., progress=True) — prints per-stage timing to stderr with the expected number of metric_fn calls, so the user can tell whether the run is making progress, see which stage is the bottleneck, and decide whether to lower n_trials or kill the run.
result["stage_seconds"] — populated when progress=True. Lets downstream tooling collect timing without reparsing stderr.
README “Why is my run taking so long?” troubleshooting section with the exact N * (1 + 4 * n_trials) formula.

[0.1.5.1] — 2026-05-06

Fixed — same defect class as Mayank #1, third null

null_a_permuted was the last null still passing the label list directly to np.random.default_rng().permutation(). For tuple labels, numpy silently converts list-of-tuples to a 2D array; for frozen dataclass labels without order=True, the prerequisite sorted(set(...)) raised TypeError. Both cases crashed the whole gate. Fix: same index-based permutation + (type(x).__name__, repr(x)) sort key already used in null_b/null_d.
Two new regression tests (test_d1b_*, test_d1c_*) cover tuple and frozen-dataclass labels end-to-end (oracle passes, constant cheater fails).

[0.1.5] — 2026-05-06

Fixed — Mayank Singh adversarial battery (14 defects, headline #1 catastrophic)

Credit: Mayank Singh / Indian AI Lab ran a 47-test stress battery against v0.1.4 and surfaced 14 real defects. Every fix below is paired with a regression test in tests/test_mayank_battery.py.

CRITICAL — str() cast catastrophe (Defect #1): null_b_uniform and null_d_marginal_matched wrapped each random gold draw in str(label). For any non-string label type (int, float, np.int64, tuple, dataclass) the comparator inside the user-supplied metric never matched, the null mean collapsed toward zero, Δ inflated to ≈ real_mean, and the gate’s central guarantee was silently void. Constant-most-frequent predictors PASSED the gate for any non-string label set. Fix: type-preserving index-based sampling (sample indices into the sorted label list, then look up the original label object). Verified across str, int, np.int64, float.
Null C silently used the gold-label set as the pool (Defect #2): v0.1.4 defaulted item_pool=None to “use the gold set”. On a real corpus this makes Null C ~|gold| / |pool| ≈ 1000× weaker than honest. v0.1.5 raises ValueError when item_pool is omitted; the caller must pass the actual chunk-id pool.
k > len(item_pool) raised raw numpy error (Defect #3): now raises a contextual ValueError with the offending sizes.
“Cryptographic” overselling (Defect #4): the lock primitive is SHA-256 + git-commit binding, an integrity check that catches accidental drift, not a tamper-proof seal against an adversary with write access to the artifacts and the lock. README and lock.py docstring corrected; explicit threat-model paragraph added.
DEFAULT_TRACKED extension list (Defect #5): intentionally excludes .py, .md, .csv, .yaml because git already tracks them and the git-commit binding covers them — but v0.1.4 didn’t say so. Docstring now documents the choice and shows the opt-in pattern (tracked_extensions=DEFAULT_TRACKED | {".py", ".md"}).
Empty inputs (Defect #6): clean ValueError instead of RuntimeWarning + NaN.
Version drift (Defect #7): __init__.py and pyproject.toml now sync-tested.
Single-class bench (Defect #8): Null A and Null D collapse to identical distributions; v0.1.5 emits a warning so the caller knows ΔA and ΔD are one test, not two.
Sparse marginal (Defect #9): when N < 2·|pool|, Null D’s marginal estimator degenerates toward Null B; v0.1.5 emits a warning.
Order-stable label set across runs (Defect #10): sort key is (type(x).__name__, repr(x)) — total order even with mixed types.
k validation (Defect #11): must be a positive integer; floats / zero / negative / strings / None all rejected up front.
tau validation (Defect #12): must be in [0, 1]; values outside the interval rejected up front.
Gold not in pool (Defect #13): previously produced silent all-zero output because Null C could never sample the gold. Now raises with a preview of the offending labels.
Length mismatch (Defect #14): previously truncated to the shortest of the three lists. Now raises with all three lengths in the error message.

Added

tests/test_mayank_battery.py — 24 regression tests covering every defect above, parametrised across label types where relevant.
four_null_gate result now includes a warnings: list[str] field for the single-class and sparse-marginal flags.

[0.1.4] — 2026-05-04

Added — terminal UX overhaul (Claude-Code-style hints)

falsify-eval doctor — end-to-end install verification. Reports python + numpy + falsify-eval versions, runs the gate against an embedded demo bench, prints next-step commands. The right first command for any new user.
falsify-eval quickstart [DIR] — writes a sample bench.jsonl and pool.txt and prints the exact grade command to run against them. Zero-friction first-run.
falsify-eval grade --demo — grade an embedded 50-query synthetic bench with no input files needed. Useful for CI smoke-tests and “does this even work” checks.
ANSI-coloured output (auto-disabled when not a TTY or when NO_COLOR env var is set, so CI logs stay clean).
Post-grade “what’s next?” hint footer suggesting the next command, conditional on PASS vs FAIL. On FAIL, diagnoses which null failed and suggests the most likely cause.
Per-error contextual hints. INPUT ERROR: k=99 > len(item_pool)=8 is followed by a hint: line suggesting --pool or smaller K in --metric. The “gold not in pool” error suggests label-set drift between train/eval.
--quiet flag on grade to suppress hints (for piping/JSON consumers).
Examples in every subcommand’s --help output.

Added — explicit compatibility statement

README now lists every environment falsify-eval is known to install in with a one-line install command: local, Colab, Kaggle/Sagemaker, GitHub Actions, Docker, AWS Lambda, air-gapped. The library is pure Python + numpy with no native extensions, so the audit surface is tiny and the deployment surface is large.

Added — LLM-RAG validation worked example

examples/llm_rag_validation.py — wraps a Claude-Haiku call as a retriever and runs the four-null gate on its output. Includes a random-baseline negative control and a keyword-fallback positive control so the gate’s three regimes (FAIL / modest PASS / strong PASS) are visible. To adapt to GPT-4, Llama, Mistral, Gemini, or any other LLM: swap the body of the retriever function. Everything else is identical.
Documentation explicitly states that falsify-eval grades the retrieval side of any RAG pipeline, regardless of LLM vendor or stack (BM25, FAISS, Pinecone, Weaviate, Vespa, etc.).

Honest non-claim

We do not claim “tested against every known AI model.” That requires hundreds of dollars in API costs and a multi-day study. We ship the worked Claude example as a pattern; running it against other models is one function-body swap and we encourage external validators to publish the result of doing so.

[0.1.3] — 2026-05-04

Fixed

k > len(item_pool) now raises a clear ValueError instead of crashing with a raw numpy “Cannot take a larger sample than population” error from inside Null C. Caught by the public stress-test ladder (Tier 5a).
Gold labels not present in item_pool now raise a clear ValueError instead of silently producing all-zero output that read as “everything fails the gate.” The error names the missing labels (first 5 + count). Caught by the public stress-test ladder (Tier 5f).

Added

falsify-eval CLI. Non-Python users can now drive the four-null gate from JSONL files: falsify-eval grade --input bench.jsonl --metric ndcg@5. Built-in metrics: ndcg, recall, mrr at any K. Subcommands lock and verify wrap lock_state / verify_state.
MCP server (python -m falsify_eval.mcp_server). Exposes grade_retrieval as a tool any MCP client (Claude Code, Claude Desktop, custom enterprise apps) can invoke directly. Stdio JSON-RPC; no extra dependencies beyond the base library.
Result dict now includes a warnings list. Soft signals that the gate ran successfully but the interpretation needs care:
- single-class benchmark (Null A and Null D mathematically collapse)
- sparse marginal (Null D’s marginal estimator is noisy when N < 2· pool )
Comprehensive input-validation tests in tests/test_validation.py.
Public stress-test ladder (STRESS_TEST_LADDER.md) with five tiers from smoke test to mathematical-edge cases, plus runnable scripts in tests/stress/.

Scope statement (added in response to user request)

falsify-eval is a methodology for retrieval / ranking evaluation. Generalising the four-null gate to LLM text-generation, classification, RAG, or recommender-system evaluation requires new null distributions designed for those domains. That is v0.3+ work; we will not claim universal coverage before doing it. Standard II of the house.

[0.1.2] — 2026-05-04

Fixed

README hero install command was broken. Removed pip install falsify-eval (package is not yet published to PyPI; would have produced “No matching distribution found”). Replaced with the source-install path that actually works today, plus a one-line note that PyPI is planned for v0.2.
README hero demo command was broken (python -m falsify_eval.examples.synthetic_demo failed with ModuleNotFoundError: No module named 'falsify_eval.examples' because the examples/ directory is at the repo root, not inside the package). Replaced with python3 examples/synthetic_demo.py.
README “Quick demo” git-clone URL contained a literal <your-handle> placeholder that any real user would have copy-pasted verbatim and seen fail. Replaced with the real spalsh-spec/falsify-eval URL.
pyproject.toml Homepage and Issues URLs pointed at non-existent github.com/sparshsharma/falsify-eval (HTTP 404). Corrected to github.com/spalsh-spec/falsify-eval and added explicit Repository URL for completeness.
Suspended the cash bug-bounty programme pending further internal validation (removed from README, CONTRIBUTING, SECURITY, issue template, and config.yml; preserved as academic record in PREPRINT §10 with status note).

Reported by

External user (Akosh, India, 2026-05-04). All four blockers were reproducible on a fresh clone with Python 3.14 + numpy 2.4.4.

[0.1.1] — 2026-05-01

Fixed

Export bootstrap_diff_ci and power_n_required from top-level package (from falsify_eval import bootstrap_diff_ci was previously broken despite being documented in the README; caught by the new CI import-smoke job).

Added

GitHub Actions CI matrix (Python 3.10/3.11/3.12 × ubuntu/macos)
Issue and PR templates
CONTRIBUTING.md, CODE_OF_CONDUCT.md, CHANGELOG.md
PREPRINT.md and SUPPLEMENTARY.md shipped in-repo
README ## Preprint section with anchor

0.1.0 — 2026-05-01

Added

Initial public release.
Four-null falsification gate: four_null_gate, null_a_permuted, null_b_uniform, null_c_random_retrieval, null_d_marginal_matched.
Cryptographic state lock: lock_state, verify_state.
Statistical reporting: bootstrap_ci, bootstrap_diff_ci, paired_permutation_p, cohens_d_paired, power_n_required.
50-query synthetic demo (examples/synthetic_demo.py) covering oracle, constant predictor, and plausible mock engine.
8 unit tests covering gate correctness, statistical primitives, and lock round-trip.

This site is open source. Improve this page.

falsify-eval

Changelog

Unreleased

[0.2.0] — 2026-05-19

Released — first public release

Fixed — v0.2-prep housekeeping pass

[0.1.6.11] — 2026-05-08

Fixed

[0.1.6.10] — 2026-05-08

Added — distribution + arXiv build prep

Internal

[0.1.6.9] — 2026-05-08

Added

Internal

[0.1.6.8] — 2026-05-08

Added — empirical equivariance certificate for the four-null gate

Documented (PREPRINT §5.9)

[0.1.6.7] — 2026-05-08

Fixed

[0.1.6.6] — 2026-05-08

Added — property-based test suite (Hypothesis) for the four-null gate

[0.1.6.5] — 2026-05-08

Added

Internal

[0.1.6.4] — 2026-05-08

Fixed

Added

Internal

[0.1.6.3] — 2026-05-08

Added — public priority announcement of companion engine Vāk-Kaṇaja

Tests

[0.1.6.2] — 2026-05-07

Fixed — Mayank Singh round-3 polish (negative-seed validation)

[0.1.6.1] — 2026-05-07

Fixed — Mayank Singh round-2 review (CLI stdin sentinel)

Closed via v0.1.6 (Mayank’s round-2 finding #2)

[0.1.6] — 2026-05-07

Added — Lewi gap closure (consolidation pass)

Changed — copy + scope honesty

Added — case study CS02 (SciFact triangulation)

Tests

[0.1.5.2] — 2026-05-06

Added — progress=True flag (AIKosh 5-hour incident)

[0.1.5.1] — 2026-05-06

Fixed — same defect class as Mayank #1, third null

[0.1.5] — 2026-05-06

Fixed — Mayank Singh adversarial battery (14 defects, headline #1 catastrophic)

Added

[0.1.4] — 2026-05-04

Added — terminal UX overhaul (Claude-Code-style hints)

Added — explicit compatibility statement

Added — LLM-RAG validation worked example

Honest non-claim

[0.1.3] — 2026-05-04

Fixed

Added

Scope statement (added in response to user request)

[0.1.2] — 2026-05-04

Fixed

Reported by

[0.1.1] — 2026-05-01

Fixed

Added

0.1.0 — 2026-05-01

Added

Added — `progress=True` flag (AIKosh 5-hour incident)