All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
The methodology library and its falsify-eval PyPI distribution go public
under the Bhardwaj & Sons brand. Apache 2.0. v0.2.0 is the first version
where pip install falsify-eval works without a git URL.
falsify-eval via the OIDC trusted-publisher
workflow on v* tag push. The version-sync guard cross-checks tag against
__init__.py and pyproject.toml before the upload step runs (per the
v0.1.6.11 fix).identifiers block in CITATION.cff are populated from
the GitHub Release → Zenodo deposit webhook.spalsh-spec/falsify-eval. GitHub Pages
enabled at https://spalsh-spec.github.io/falsify-eval/, serving the
interactive sliders playground (play.html) and the long-form HTML
explainer.Four small fixes folded into this release. No behavioural change to the gate; no test-count change.
url: pointed at the wrong GitHub handle
(sparshsharma); corrected to the canonical spalsh-spec. Citation-graph
crawlers and reviewers use this field; wrong handle = broken linkage.case_studies/cs03_aikosh_rag/ unchanged.github.com/bhardwaj-and-sons/vak-kanaja to the working
github.com/spalsh-spec/vak-kanaja. Forward-compatible: when the
bhardwaj-and-sons org is created and the repo is transferred, GitHub
auto-redirects the spalsh-spec URL forever.import falsify_eval
before the package was installed. Caught by the workflow itself on its
first run (the v0.1.6.10 tag triggered the workflow, which failed at the
version-check step before reaching upload — exactly as a guard should
fail when something is wrong, but this time the something-wrong was the
guard). Switched to reading __version__ and pyproject.toml’s version
directly via grep/sed, which (a) doesn’t require the package to be
importable, and (b) cross-checks both source files against each other
and against the tag — three-way agreement is now the gate, matching what
tests/test_mayank_battery.py::test_d7_version_sync does in pytest.Three coordinated infrastructure additions that get the package from “clone-from-git only” to “ready for arXiv + PyPI.” Nothing in the gate’s behaviour changes; this is plumbing.
.github/workflows/publish.yml — publishes to PyPI on every v*
tag push using OIDC trusted publishing (no API tokens stored as
repository secrets, no key rotation, no exfiltration risk if the
workflow is ever compromised). Workflow includes a tag-vs-version-sync
guard so a typo’d tag refuses to publish. One-time PyPI-side setup
documented in docs/PYPI_PUBLISHING.md. The package name falsify-eval
is verified available on PyPI at submission time.
tools/build_arxiv.sh — converts PREPRINT.md to an
arXiv-submittable LaTeX bundle via pandoc (arxiv/preprint.tex +
arxiv/falsify-eval-arxiv-submission.tar.gz). Optional local PDF
preview if xelatex/pdflatex is installed. Categorisation, abstract
guidance, endorsement notes, cover letter draft, and post-submission
checklist all in docs/ARXIV_SUBMISSION.md.
[tool.mutmut] config in pyproject.toml + docs/MUTATION_TESTING.md
documenting the deferred status: mutmut 3.x has a macOS regression
(/.VolumeIcon.icns filesystem-root copy attempt) and mutmut 2.x has
a Python 3.14 incompatibility (cannot pickle 'itertools.count').
Neither is a defect in this package; both resolve when upstream ships
3.14 support OR when CI adds a 3.12-pinned mutation-test job. Tracked
for v0.2 with the exact configuration committed in pyproject.toml.
[project.optional-dependencies] dev bucket added alongside test,
pinning mutmut, build, and twine so a dev-environment install is
one command: pip install -e ".[dev]".
arxiv/, .mutmut-cache/, and mutants/ added to .gitignore —
these are regenerable artefacts that should not be committed.python3 -m build produces a 12-file,
~100KB wheel that passes twine check. Installed in a fresh venv,
falsify-eval doctor exits 0 with the same numbers as the editable
install (real_mean=0.8557, all four nulls pass, GATE PASS).python3 -W error::SyntaxWarning -m pytest tests/.case_studies/cs03_aikosh_rag/). First slot
for a real production retriever inside an organisation, prepared after
Jasmeet Singh (AIKosh) volunteered to wire the four-null gate into
AIKosh’s internal RAG benchmark. The slot includes:
CS03_REPORT.md — pre-registered structure with TBD sections marked
explicitly so no fabricated numbers can sit there.run_case_study.py — refuses to run until data/queries.jsonl,
data/pool.txt, and data/retriever.py are provided; exits 2 with a
clear input-list message rather than silently producing fake output.Tested platforms log in README. External-verification entries by
testers who are not the package author, dated, with version pinned.
Initial entries:
doctor + quickstart + grade all clean;
confirmed the cp1252 defect closed in 0.1.6.4 stays closed.Two new property tests in tests/test_property_gate.py, plus PREPRINT §5.9
documenting the precise statement they support:
test_equivariance_under_order_preserving_bijection (Hypothesis,
~80 random benches × ~80 fuzzed prefixes). Under any order-preserving
label-set bijection σ applied jointly to retrieved, gold, and
item_pool, the gate’s per-trial real_mean, all four null_means,
all deltas, and the verdict (gate_passes) are identical to the
un-bijected run, to within ~1e-12. This is the property a reviewer
asking “does the harness depend on cosmetic label encoding?” should
be pointed at — the answer is no, by certificate.
test_null_c_equivariant_under_arbitrary_bijection (Hypothesis,
~80 random benches × ~80 fuzzed permutations of the pool). Under any
bijection σ — order-preserving or not — real_mean and Null C’s
per-trial mean are exactly equivariant. Null C samples from
item_pool in input order (no sort), so the seed-driven sample
sequence is bijection-stable.
real_mean are exactly equivariant
under any σ.σ ∘ mapping ≠ mapping_σ ∘ σ
when sort-order changes) but remain bijection-invariant in expectation.label_order_seed parameter
that deliberately randomises the canonical sort, breaking any latent
dependency on adversarial label ordering. Tracked, not implemented in
this release.Total suite now 91/91; runs in ~10 s under
python3 -W error::SyntaxWarning -m pytest tests/.
CI on 0.1.6.6 failed across all matrix cells because the new
tests/test_property_gate.py imports hypothesis, which was used only as
a transitive dev install on my local box (a .hypothesis/ cache was in
the tree but no dependency had been declared). Two-line fix:
pyproject.toml: add hypothesis>=6.0 to the test optional-deps
bucket alongside pytest>=7.0..github/workflows/ci.yml: install with pip install -e ".[test]"
instead of pip install -e . + ad-hoc pip install pytest. Now the
test extras govern what CI installs, so adding a dev dep in
pyproject.toml automatically propagates to CI without a workflow edit.A package whose value proposition is rigor about retrieval evaluation has
to be more rigorous than what it asks of users. tests/test_property_gate.py
adds 13 universally-true properties of four_null_gate, each exercised
against ~80 random benches generated by Hypothesis (≈1,040 example runs in
~6 s). The suite catches the classes of bug that line-coverage hides:
Algebraic invariants (per-result, must hold by construction):
deltas[X] == real_mean - null_means[X] for every X ∈ {A,B,C,D}passes[X] == (deltas[X] >= tau)gate_passes == all(passes.values())ndcg_at_k){A,B,C,D})Determinism:
real_mean, null_means, deltas, passes, gate_passes.
This is the property reviewers need to trust the headline numbers.Metric properties (don’t even need the gate):
Gate semantics:
real_mean == 1.0 and the gate
passes at τ=0.05 on any multi-class bench. The “if this fails, the
methodology is wrong” property.s → ('lbl', s)
produces numerically-identical results. Closes Mayank-defect #1
(numpy auto-coerced tuple labels into 2D arrays, silently disabling
the gate for any non-string label type).Validation guards:
tau ∉ [0, 1] and negative seeds raise ValueError with a useful
message — exercised across the full bad-input space, not just point
samples.The suite runs in ~6 s on a laptop and is part of the default pytest run.
.hypothesis/ was already in the tree; no new top-level dependency added
beyond hypothesis (already a dev dep).
--input or --pool
points to a non-existent file on POSIX, we now check whether the path
matches the shell-eaten form of a Windows-style path (e.g.
my-benchbench.jsonl ← typed my-bench\bench.jsonl, where zsh/bash
treated \ as an escape rather than a path separator). If we can
unambiguously decode the intent — i.e. exactly one prefix in cwd is a
directory whose name is a prefix of the bad path AND contains the
remainder as a real file — we surface a precise “did you mean
my-bench/bench.jsonl?” hint instead of the bare FileNotFoundError.
Reported by Parth 2026-05-08 after copy-pasting Jasmeet’s Windows
tutorial command into zsh.tests/test_shell_mangled_paths.py covering both
the recovery suggestion and end-to-end grade error formatting.SyntaxWarning: invalid escape sequence in the new helper’s
docstring by switching to a raw docstring (r"""). Caught by running
python3 -W error::SyntaxWarning -m pytest.Windows console crash on grade (reported by Jasmeet, Win10/PowerShell, Py 3.14.3).
The pretty-printer emitted Δ, τ, ✓, ✗, ⚠, ─ which the legacy Windows console
(cp1252 codepage) cannot encode, raising UnicodeEncodeError: 'charmap' codec
can't encode character 'Δ' mid-print. Two-layer fix in falsify_eval/cli.py:
UTF-8 hardening at CLI entry. main() now calls _init_io() which
reconfigures sys.stdout/sys.stderr to UTF-8 with errors='replace'
before anything is printed. This alone makes the original crash impossible
on every modern Python (≥3.7) and on every host OS, since the codepage of
the underlying console no longer governs the encoding used by the
interpreter.
Auto-degrade to ASCII when the stream still can’t encode. If stdout’s
post-reconfigure encoding still rejects our glyphs (e.g. piping into a
non-UTF-8 log processor), the printer transparently falls back to ASCII
equivalents: Δ→d, τ→tau, ✓→[ok], ✗→[x], ⚠→!, ─→-.
--ascii flag and FALSIFY_ASCII=1 environment variable to force
ASCII-only output on demand (useful for CI logs that strip UTF-8).doctor now reports stdout encoding and ascii_mode so install bugs
related to console encoding are visible from a single command.tests/test_windows_encoding.py that simulates a cp1252
console and proves the old code path crashes, the new path doesn’t, and
--ascii produces a fully cp1252-decodable output stream.Path.open() and Path.read_text() / write_text() calls in cli.py
now pass encoding='utf-8' explicitly. This was a latent companion bug —
on the same Windows host that crashed Jasmeet’s print, reading a UTF-8
bench.jsonl could silently mojibake-corrupt rows depending on user locale.This release is non-functional: it adds a “Companion engine” section to the
README that establishes public priority on the engine name (Vāk-Kaṇaja),
its two named contributions (Pramāṇa-aware query routing; Anupalabdhi
non-perception confidence floor), and the calibration discipline applied
to it (the negative result on the novel rerankers at bench expansion,
documented as a contribution rather than buried). The full vak-kanaja
code release follows the morning launch sequence in a separate repo
(bhardwaj-and-sons/vak-kanaja, public release imminent).
This is the “establish priority without releasing implementation” pattern that mathematicians, physicists, and patent-filers have used for 200 years. Anyone scooping the methodology now has the priority graph to contend with.
Mayank ran a 25-probe round-3 review against v0.1.6.1 and reported 23/25 PASS.
The two non-PASS items both traced to flaws in his own test fixtures, except
one polish item we honour here: negative seed values fell through to
numpy.random.default_rng(-1) which raises an unhelpful internal error.
_validate_inputs now rejects non-int and negative seeds up-front with a
contextual ValueError: seed must be a non-negative integer, got <repr>.test_d15_negative_or_non_int_seed_raises_clean_error
parametrised across 5 bad seeds (-1, -100, 0.5, “2026”, None).Credit: Mayank Singh — third clean round in 48 hours.
falsify-eval grade --input - now reads JSONL from stdin (UNIX convention).
v0.1.5.1 wrapped args.input in Path() before opening, which turned -
into a literal filename and crashed with FileNotFoundError: '-'. v0.1.6.1
threads - through load_jsonl() directly and dispatches to sys.stdin.<stdin> (e.g. <stdin>:2: invalid JSON)
instead of leaking a misleading filename.--input help text now documents the - sentinel.tests/test_cli_stdin.py exercise the fix via
subprocess against the actual CLI entry point: stdin streaming success,
empty-stdin clean failure (the v0.1.5.1 regression must not return),
malformed-stdin error labelling, and file-input no-regression.Credit: Mayank Singh — re-ran the full battery on v0.1.5.1 against the six round-2 surfaces and surfaced this one cleanly with a one-line repro.
Mayank’s round-2 also flagged the PREPRINT abstract still naming features
not shipped in the public library. v0.1.6 (shipped earlier today) already
addressed this: the abstract was rewritten to clearly separate shipped
vs methodology-spec items, and bonferroni() was added to the public
stats API. Mayank tested v0.1.5.1, which predates that fix.
Lewi Stone reviewed the brand site on 2026-05-07 and identified three real gaps: (1) the empirical case was missing — no demonstration of the gate working on a real, public benchmark; (2) the documentation promised evidence and delivered analogy; (3) the framing conflated AI systems broadly with retrieval and ranking systems specifically. This release closes all three.
bonferroni() helper in falsify_eval.stats — the PREPRINT abstract
has promised Bonferroni-corrected paired tests since v0.1.0 but the
public library did not ship the helper. It does now. Returns family-wise
adjusted p-values, per-test α, and a per-test reject decision.tests/test_stats_vs_scipy.py — 11 cross-check tests that reconcile
our pure-numpy bootstrap_ci, paired_permutation_p, cohens_d_paired,
and bonferroni against scipy on identical fixed-seed inputs. Closes
Mayank attack-surface #4 ahead of his next round.tests/test_property_based.py — 4 property-based tests via
hypothesis: determinism under same seed, oracle always passes,
constant cheater always fails Δ_D, query-order permutation invariance.
Each test runs ~15 randomly generated benches per property.EXPLAINER_simple.html — title, og tags, and three body sections rewritten
from “AI systems” to “search and ranking systems”. Added explicit
scope-honesty callout block at the top: tests retrieval-and-ranking,
does NOT test generative LLM outputs. Both case-study links inline.PREPRINT.md abstract — struck the cryptographic record framing
(corrected to integrity-check record (SHA-256 + git commit) per v0.1.5
calibration discipline). Added explicit shipped-vs-planned column for the
five-part harness so a reader knows exactly what is in the public library
vs what is methodology spec only. Replaced the generalises to LLM
behavioural eval pipelines claim with a sober candidate research
direction phrasing. Added a paragraph documenting the empirical CS01
result and the metric-sensitivity finding.README.md — links to CS02 alongside CS01, status section updated.case_studies/cs02_scifact/ — second BEIR slice, 300 queries × 5,183 docs,
sparse relevance (~1.1 docs/query). Confirms the gate works AND triangulates
the CS01 metric-sensitivity finding: on sparse-relevance benchmarks both
metrics give clean separation, on dense-relevance only the single-gold
metric does. Joint CS01+CS02 picture provides empirical foundation across
two relevance regimes.
progress=True flag (AIKosh 5-hour incident)Mayank reported the gate had been running 5 hours under AIKosh’s harness
with no visible progress. Profiling confirmed the gate itself is fast
(N=5,000 × pool=100k × n_trials=50 finishes in <2s with a cheap metric).
The 5-hour runtime is fully explained by an LLM-judge metric at ~200 ms /
call: N * (1 + 4 * n_trials) calls = ~100k for N=500, n_trials=50,
which at 200 ms each is ~5.6 hours.
The library can’t speed up a slow user metric, but it can stop running silently. v0.1.5.2 adds:
four_null_gate(..., progress=True) — prints per-stage timing to stderr
with the expected number of metric_fn calls, so the user can tell
whether the run is making progress, see which stage is the bottleneck,
and decide whether to lower n_trials or kill the run.result["stage_seconds"] — populated when progress=True. Lets
downstream tooling collect timing without reparsing stderr.N * (1 + 4 * n_trials) formula.null_a_permuted was the last null still passing the label list directly
to np.random.default_rng().permutation(). For tuple labels, numpy
silently converts list-of-tuples to a 2D array; for frozen dataclass labels
without order=True, the prerequisite sorted(set(...)) raised TypeError.
Both cases crashed the whole gate. Fix: same index-based permutation +
(type(x).__name__, repr(x)) sort key already used in null_b/null_d.test_d1b_*, test_d1c_*) cover tuple and
frozen-dataclass labels end-to-end (oracle passes, constant cheater fails).Credit: Mayank Singh / Indian AI Lab ran a 47-test stress battery against
v0.1.4 and surfaced 14 real defects. Every fix below is paired with a
regression test in tests/test_mayank_battery.py.
str() cast catastrophe (Defect #1): null_b_uniform and
null_d_marginal_matched wrapped each random gold draw in str(label).
For any non-string label type (int, float, np.int64, tuple, dataclass)
the comparator inside the user-supplied metric never matched, the null mean
collapsed toward zero, Δ inflated to ≈ real_mean, and the gate’s central
guarantee was silently void. Constant-most-frequent predictors PASSED
the gate for any non-string label set. Fix: type-preserving index-based
sampling (sample indices into the sorted label list, then look up the
original label object). Verified across str, int, np.int64, float.item_pool=None to “use the gold set”. On a real corpus this
makes Null C ~|gold| / |pool| ≈ 1000× weaker than honest. v0.1.5 raises
ValueError when item_pool is omitted; the caller must pass the actual
chunk-id pool.k > len(item_pool) raised raw numpy error (Defect #3): now raises a
contextual ValueError with the offending sizes.lock.py docstring corrected; explicit
threat-model paragraph added.DEFAULT_TRACKED extension list (Defect #5): intentionally excludes
.py, .md, .csv, .yaml because git already tracks them and the
git-commit binding covers them — but v0.1.4 didn’t say so. Docstring now
documents the choice and shows the opt-in pattern
(tracked_extensions=DEFAULT_TRACKED | {".py", ".md"}).ValueError instead of RuntimeWarning + NaN.__init__.py and pyproject.toml now sync-tested.(type(x).__name__, repr(x)) — total order even with mixed types.k validation (Defect #11): must be a positive integer; floats / zero /
negative / strings / None all rejected up front.tau validation (Defect #12): must be in [0, 1]; values outside the
interval rejected up front.tests/test_mayank_battery.py — 24 regression tests covering every defect
above, parametrised across label types where relevant.four_null_gate result now includes a warnings: list[str] field for the
single-class and sparse-marginal flags.falsify-eval doctor — end-to-end install verification. Reports python +
numpy + falsify-eval versions, runs the gate against an embedded demo
bench, prints next-step commands. The right first command for any new user.falsify-eval quickstart [DIR] — writes a sample bench.jsonl and
pool.txt and prints the exact grade command to run against them.
Zero-friction first-run.falsify-eval grade --demo — grade an embedded 50-query synthetic bench
with no input files needed. Useful for CI smoke-tests and “does this even
work” checks.NO_COLOR
env var is set, so CI logs stay clean).INPUT ERROR: k=99 > len(item_pool)=8 is
followed by a hint: line suggesting --pool or smaller K in --metric.
The “gold not in pool” error suggests label-set drift between train/eval.--quiet flag on grade to suppress hints (for piping/JSON consumers).--help output.examples/llm_rag_validation.py — wraps a Claude-Haiku call as a
retriever and runs the four-null gate on its output. Includes a
random-baseline negative control and a keyword-fallback positive control
so the gate’s three regimes (FAIL / modest PASS / strong PASS) are
visible. To adapt to GPT-4, Llama, Mistral, Gemini, or any other LLM:
swap the body of the retriever function. Everything else is identical.We do not claim “tested against every known AI model.” That requires hundreds of dollars in API costs and a multi-day study. We ship the worked Claude example as a pattern; running it against other models is one function-body swap and we encourage external validators to publish the result of doing so.
k > len(item_pool) now raises a clear ValueError instead of crashing
with a raw numpy “Cannot take a larger sample than population” error from
inside Null C. Caught by the public stress-test ladder (Tier 5a).item_pool now raise a clear ValueError
instead of silently producing all-zero output that read as “everything fails
the gate.” The error names the missing labels (first 5 + count). Caught by
the public stress-test ladder (Tier 5f).falsify-eval CLI. Non-Python users can now drive the four-null gate
from JSONL files: falsify-eval grade --input bench.jsonl --metric ndcg@5.
Built-in metrics: ndcg, recall, mrr at any K. Subcommands lock and
verify wrap lock_state / verify_state.python -m falsify_eval.mcp_server). Exposes
grade_retrieval as a tool any MCP client (Claude Code, Claude Desktop,
custom enterprise apps) can invoke directly. Stdio JSON-RPC; no extra
dependencies beyond the base library.warnings list. Soft signals that the gate
ran successfully but the interpretation needs care:
| sparse marginal (Null D’s marginal estimator is noisy when N < 2· | pool | ) |
tests/test_validation.py.STRESS_TEST_LADDER.md) with five tiers from
smoke test to mathematical-edge cases, plus runnable scripts in
tests/stress/.pip install falsify-eval
(package is not yet published to PyPI; would have produced
“No matching distribution found”). Replaced with the source-install path
that actually works today, plus a one-line note that PyPI is planned for v0.2.python -m falsify_eval.examples.synthetic_demo
failed with ModuleNotFoundError: No module named 'falsify_eval.examples'
because the examples/ directory is at the repo root, not inside the
package). Replaced with python3 examples/synthetic_demo.py.<your-handle>
placeholder that any real user would have copy-pasted verbatim and seen
fail. Replaced with the real spalsh-spec/falsify-eval URL.pyproject.toml Homepage and Issues URLs pointed at non-existent
github.com/sparshsharma/falsify-eval (HTTP 404). Corrected to
github.com/spalsh-spec/falsify-eval and added explicit Repository URL
for completeness.bootstrap_diff_ci and power_n_required from top-level package
(from falsify_eval import bootstrap_diff_ci was previously broken despite
being documented in the README; caught by the new CI import-smoke job).CONTRIBUTING.md, CODE_OF_CONDUCT.md, CHANGELOG.mdPREPRINT.md and SUPPLEMENTARY.md shipped in-repo## Preprint section with anchorfour_null_gate, null_a_permuted,
null_b_uniform, null_c_random_retrieval, null_d_marginal_matched.lock_state, verify_state.bootstrap_ci, bootstrap_diff_ci,
paired_permutation_p, cohens_d_paired, power_n_required.examples/synthetic_demo.py) covering oracle,
constant predictor, and plausible mock engine.