falsify-eval

## Some search engines pretend to be smart. They look like they understand your question.
They actually just return whatever's most popular in their database. A student named **Mira** would do the same on her French exam.
She'd score 80% by always picking "C". She doesn't speak French. **This is a 30-second test that catches them.** ### → Try it without installing anything [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/spalsh-spec/falsify-eval/blob/main/notebooks/quickstart.ipynb) [![Play with sliders](https://img.shields.io/badge/play%20with%20sliders-▶-9c4a1a?style=for-the-badge)](https://spalsh-spec.github.io/falsify-eval/play.html) [![Real-data case study](https://img.shields.io/badge/real%20data-CS01%20NFCorpus-3d7a4a?style=for-the-badge)](case_studies/cs01_nfcorpus/CS01_REPORT.md) The **Colab** runs the actual library on a synthetic bench (60 seconds, no install).
The **Playground** lets you pick a strategy with sliders and watch the gate verdict update live in your browser.
The **Case study** shows the same gate working on a peer-reviewed BEIR benchmark. ### → Or install and run locally ```bash pip install falsify-eval ``` For the latest unreleased changes, install from source: ```bash pip install git+https://github.com/spalsh-spec/falsify-eval ``` Free. Open source. Runs on your laptop. Works on any search system. Built for **search engines, recommendation systems, the retrieval side of RAG.**
*Not* built for the part of ChatGPT that writes paragraphs — that's a different problem we haven't built a test for. ### Local audit web app This repo now includes a local-first audit dashboard in `apps/web`. Best for **retrieval, ranking, and RAG retrieval-side audits**. It is not a judge for free-text generation, summarisation, or open-ended answer quality. Screenshots: ![Audit web workbench](/falsify-eval/docs/screenshots/audit-web-workbench.png) ![Audit report export](/falsify-eval/docs/screenshots/audit-web-report.png) Run it: ```bash npm install npm run dev --workspace apps/web ``` Open `http://localhost:3000`. The app accepts: - dataset `.json` or `.jsonl` - system output `.json` or `.jsonl` - baseline output `.json` or `.jsonl`, or a corpus file for a generated BM25 baseline - optional corpus `.json` or `.jsonl` with document `id` and `text` - claim config as JSON or YAML It returns: - PASS / WARN / FAIL verdict - human-readable Markdown report - machine-readable JSON report - dataset quality report before the evidence checks - evidence table for metric, statistical, null, stability, leakage, and reproducibility checks Privacy model: - audits run locally - uploaded files are never placed in `public/` - raw uploads are stored under `.local-audits/raw` or `AUDIT_STORAGE_DIR` - filenames are not trusted as storage paths - reports redact emails, phone numbers, bearer tokens, API keys, and obvious secrets - no audit code makes external network calls - delete an audit job with `DELETE /api/audits/:id` Run the included corpus-backed demo: ```bash npm run build:cli --workspace apps/web node apps/web/dist-cli/falsify-audit.mjs run \ --dataset examples/audit-web-demo/dataset.jsonl \ --system examples/audit-web-demo/system-output.jsonl \ --corpus examples/audit-web-demo/corpus.jsonl \ --config examples/audit-web-demo/config.yaml \ --out /tmp/falsify-audit-demo.json \ --pack-out /tmp/falsify-audit-pack ``` The public demo folder includes the dataset, corpus, system output, generated BM25 baseline output, claim config, and expected Markdown report: [`examples/audit-web-demo/`](examples/audit-web-demo/). The corpus path builds a deterministic BM25 lexical baseline locally. No auth, accounts, hosted storage, external APIs, or LLM judging are used. Deploy notes: - Local-first use is the default and safest path for private benchmarks. - Vercel is suitable for a public demo with toy data only unless private storage and access control are added. - Full notes: [`docs/AUDIT_WEB_DEPLOYMENT.md`](/falsify-eval/docs/AUDIT_WEB_DEPLOYMENT.html). Phase 2 local workbench additions: ```bash node apps/web/dist-cli/falsify-audit.mjs template --template rag_search --out /tmp/rag-claim.yaml node apps/web/dist-cli/falsify-audit.mjs compare \ --dataset apps/web/examples/rag-dataset.jsonl \ --system apps/web/examples/rag-system-v1.jsonl \ --right-system apps/web/examples/rag-system.jsonl \ --corpus apps/web/examples/rag-corpus.jsonl \ --config apps/web/examples/claim.yaml \ --mode system_v1_vs_v2 \ --out /tmp/falsify-comparison.json ```
[![CI](https://github.com/spalsh-spec/falsify-eval/actions/workflows/ci.yml/badge.svg)](https://github.com/spalsh-spec/falsify-eval/actions/workflows/ci.yml) [![Tests](https://img.shields.io/badge/tests-91%20passing-brightgreen)](tests/) [![PyPI](https://img.shields.io/pypi/v/falsify-eval.svg?color=blue)](https://pypi.org/project/falsify-eval/) [![DOI](https://zenodo.org/badge/1226286341.svg)](https://doi.org/10.5281/zenodo.20284676) [![Release](https://img.shields.io/github/v/release/spalsh-spec/falsify-eval?color=blue&label=release)](https://github.com/spalsh-spec/falsify-eval/releases/latest) [![Python ≥ 3.10](https://img.shields.io/badge/python-≥3.10-blue.svg)](https://www.python.org/) [![Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE) [**30-second demo**](#30-second-demo) · [**The Mira test**](#the-mira-test) · [**How it works**](#how-it-works) · [**Three surfaces**](#three-surfaces) · [**Preprint**](#preprint)

The Mira test

Imagine a student named Mira who never studied. She noticed that on past exams, “C” is the most common correct answer. So she writes C every time and scores 80%. She looks smart on paper. She has zero actual knowledge — she gamed the pattern.

A retrieval or ranking system can do the same thing. If the most popular document in a corpus happens to be relevant for most queries, a system that always returns that popular document will score well on aggregate metrics — without using the query at all. (This is not a hypothetical: see the CS01 NFCorpus case study where this exact predictor scores nDCG@10 = 0.066 on a published BEIR benchmark while ignoring every query.)

The published number looks great. It does not mean what you think it means.

falsify-eval is a Mira-check for retrieval and ranking systems. It compares your system’s score against four “fake students” — four null distributions, including one (Null D, the marginal-matched random) that is original to this work and that the previous standard nulls miss. If your system can’t beat all four by a calibrated margin, the gate fails.

→ Case studies (real numbers, two public benchmarks):

CS01 — NFCorpus (323 BEIR queries, dense relevance ~38 docs/query)
CS02 — SciFact (300 BEIR queries, sparse relevance ~1.1 docs/query)

Across both: Mira and popularity-only fail at Δ_D ≈ 0; BM25 and dense MiniLM pass at Δ_D = +0.14 to +0.73. Reproducible in 5 minutes each on M1 laptop. Joint finding: graded metrics (nDCG) on dense-relevance benchmarks can flatten the gate — pair them with single-gold strict metrics (recall@K against top-1).

30-second demo

pip install falsify-eval
python -c "from falsify_eval.demo import run; run()"

Three systems graded on a 50-query synthetic bench:

═══ constant_predictor (deliberately broken) ═══
  real mean nDCG@5         = 0.20
    Δ_A (gold-permuted)    = +0.000  ✗
    Δ_B (uniform random)   = +0.001  ✗
    Δ_C (random retrieval) = +0.18   ✓
    Δ_D (marginal-matched) = +0.000  ✗  ← the gate that catches Mira
  GATE: ✗ FAIL  (correctly rejected)

═══ mock_engine (plausible retrieval, 70% top-1) ═══
  real mean nDCG@5         = 0.62
    Δ across all 4 nulls   ≥ +0.40   ✓
  GATE: ✓ PASS  (correctly accepted)

═══ oracle (perfect top-1) ═══
  real mean nDCG@5         = 1.00
  GATE: ✓ PASS by maximum margin

How it works

%%{init: {'theme': 'base', 'themeVariables': {
    'fontFamily': 'Garamond, EB Garamond, Georgia, serif',
    'primaryColor': '#f3eee5',
    'primaryTextColor': '#1c1611',
    'primaryBorderColor': '#9c4a1a',
    'lineColor': '#9d8147',
    'tertiaryColor': '#faf6ed',
    'tertiaryBorderColor': '#d4c8b2',
    'edgeLabelBackground': '#f3eee5'
}}}%%
flowchart LR
    R([your retriever]) -->|top-K per query| S[real score]
    G([gold labels]) --> S
    G -->|permute π| A[Null A · label-permuted]
    G -->|iid uniform| B[Null B · uniform random]
    P([item pool]) -->|sample K| C[Null C · random retrieval]
    G -->|sample by class freq| D[Null D · marginal-matched ★]
    S --> Δ{Δ ≥ τ on<br/>all four?}
    A --> Δ
    B --> Δ
    C --> Δ
    D --> Δ
    Δ -->|yes| PASS([✓ PASS])
    Δ -->|no| FAIL([✗ FAIL])
    classDef ok    fill:#eef3e8,stroke:#3d7a4a,color:#1a3d22,stroke-width:1.5px;
    classDef fail  fill:#f7e9e3,stroke:#9c4a1a,color:#5a1c0c,stroke-width:1.5px;
    classDef novel fill:#fef9e7,stroke:#9d8147,color:#5a4720,stroke-width:2px;
    classDef gate  fill:#f3eee5,stroke:#1c1611,color:#1c1611,stroke-width:2px;
    class PASS ok
    class FAIL fail
    class D novel
    class Δ gate

Null	What it tests	Catches
A — gold-permuted	bijection π over class labels	systems that learned label distribution shape, not relevance
B — uniform random	iid uniform draw of gold per query	systems that exploit class-prior assumption
C — random retrieval	replace engine output with K random items from pool	systems that score by retrieval coverage, not ranking quality
D — marginal-matched ★	iid draw of gold from the empirical class frequency	predictors matched to the gold marginal — new in this work

Null D is the load-bearing contribution. It correctly rejects the constant-most-frequent predictor that A and B can false-positive. (Definition 1 of the preprint.)

First principles — the mental model

If your AI cannot beat random chance in four different ways, you do not know if it is actually working.

The broken ruler problem. A bare score like “0.77 nDCG” is a measurement — but a measurement is only meaningful against a baseline. What does a dumb, cheating, or random system score on the same bench? If you do not know, the number is a rubber ruler: real, but meaningless.

The four control groups. The gate builds four deliberately broken systems and measures each one:

Null	The cheat it simulates	What a PASS rules out
A — permuted labels	Shuffle which answer belongs to which question. Every answer still appears — just reassigned (a bijection).	Your system learned the shape of the answer distribution, not relevance.
B — uniform random	For each question, draw a random correct answer with equal probability.	Your system exploits a uniform class-prior assumption.
C — random retrieval	Replace your search results entirely — return K random documents from the corpus.	Your system is no better than noise.
D — marginal-matched ★	Draw answers weighted by how common each answer actually is in the benchmark.	Your system only learned “answer X is frequent” — not what is being asked.

The delta. Δ = your_score − null_score. The gate passes only if Δ ≥ τ (default 0.05) on all four nulls simultaneously. One failure = gate fails.

Why Null D is the novel contribution. Say “Rigveda” appears in 30% of queries. A system that always answers “Rigveda” scores 0.30 recall without understanding a single question. Nulls A and B do not reliably catch this cheater — they sample without respecting frequency. Null D samples with frequency weights, directly simulating that exact exploit. Beating Null D means your system learned something beyond the base rate.

The integrity lock. Even a passing gate is meaningless if the benchmark files changed between runs. lock_state() computes a SHA-256 fingerprint of every benchmark artifact and binds it to a git commit — like a firmware checksum on a device update. verify_state() detects any drift silently introduced by migrations, feedback loops, or annotation changes.

→ The Four Nulls explained · Visual guide · Editorial explainer · Preprint §3

Three surfaces

# 1. Library
from falsify_eval import four_null_gate

result = four_null_gate(
    retrieved_lists, gold_list, rel_list, my_metric,
    item_pool=corpus_ids, k=5, n_trials=50, tau=0.05,
    progress=True,                      # stderr per-stage timing
)
assert result["gate_passes"]

# 2. CLI on JSONL benches — no Python knowledge needed
falsify-eval grade --input bench.jsonl --metric ndcg@5 --pool corpus.txt
falsify-eval doctor                     # end-to-end install verification
falsify-eval quickstart ./demo          # writes a sample bench + pool

// 3. MCP server — Claude Code, Cursor, any MCP-compatible client
{
  "mcpServers": {
    "falsify-eval": {
      "command": "python",
      "args": ["-m", "falsify_eval.mcp_server"]
    }
  }
}

Claude can then call grade_retrieval directly on any retrieval pipeline output you give it — no glue code, no separate scoring service.

What it catches

A non-exhaustive list of failure modes the gate flags:

Broken predictor	Δ_A	Δ_B	Δ_C	Δ_D	Gate
Constant most-frequent class	≈ 0	≈ 0	+	≈ 0	✗
Marginal-matched random	≈ 0	+	+	≈ 0	✗
Popularity-only ranker (no query feature)	+	+	+	small	✗
Lexical-match-only on bag-of-words	+	+	+	+	✓
Full retriever (BM25 / dense / hybrid)	+	+	+	+	✓
Full retriever on drifted corpus	varies	varies	varies	varies	✗ via `verify_state`

The first three score well on bare aggregate metrics (nDCG, MRR, recall@K). The standard reporting practice publishes those numbers. The four-null gate rejects them.

What the gate does NOT prove

A passing gate is necessary for credible reporting, not sufficient. It does not prove:

the engine learned the actual relevance signal (only that it learned something beyond the four trivial null classes)
the engine generalises beyond the evaluation set
per-feature contribution claims are significant (handled separately by bootstrap_ci, paired_permutation_p, cohens_d_paired)
the bench developer didn’t overfit query phrasing to engine behaviour

The library is calibrated for retrieval and ranking evaluation — search, recommendation top-K, RAG retrieval-side, classification-as-retrieval. It is not yet generalised to LLM free-text generation, summarisation, or open-ended QA. Those domains need their own null distributions and are planned for v0.3+.

Validating an LLM-RAG pipeline

from falsify_eval import four_null_gate

# Replace this with whatever your retriever returns. The library doesn't
# care if it's BM25, FAISS, Pinecone, Weaviate, Vespa, or a homegrown
# bag-of-words. It grades the OUTPUT, not the engine.
def my_rag_retriever(query: str) -> list[str]:
    """Return top-K document IDs for a query."""
    ...

retrieved = [my_rag_retriever(q) for q in queries]

def recall_at_5(r, g, _rel): return 1.0 if g in r[:5] else 0.0

res = four_null_gate(
    retrieved, gold, [3]*len(gold), recall_at_5,
    item_pool=pool, k=5, n_trials=100, tau=0.05, seed=2026,
)
print("GATE:", "PASS" if res["gate_passes"] else "FAIL", res["deltas"])

A complete Claude-API worked example with a 50-query bench is in examples/llm_rag_validation.py. To adapt it to GPT-4 / Llama / Mistral / Gemini: swap the API call inside my_rag_retriever. The gate is identical.

Why is my run taking so long?

The gate calls your metric_fn exactly N × (1 + 4 × n_trials) times.

Metric cost / call	N=500, n_trials=50
In-memory check (~1 µs)	0.1 s
Embedding lookup (~1 ms)	1.7 min
LLM-judge call (~200 ms)	~5.6 hours

If your run is taking hours, your metric is the bottleneck — not the gate (which finishes N=5,000 × pool=100k × n_trials=50 in under 2 seconds with a fast metric). Pass progress=True to see per-stage timing on stderr. Three options to speed up: (1) drop n_trials from 50 → 20 — statistically defensible; (2) cache metric_fn calls; (3) parallelise the four nulls with multiprocessing — pure CPU, no shared state.

How this compares

Capability	DVC	MLflow	W&B	Ragas	TruLens	falsify-eval
Vendor-free	✓	✓	✗	✓	partial	✓
Pure-text human-readable lock	✗	✗	✗	✗	✗	✓
Couples artifact hash + verified score	✗	✗	partial	✗	partial	✓
Falsification gate (CI-enforceable)	✗	✗	✗	✗	✗	✓
Marginal-matched null ★	✗	✗	✗	✗	✗	✓
Positive-control self-validation	✗	✗	✗	✗	✗	✓

The tools above solve different problems (versioning, tracking, observability). They complement falsify-eval; they don’t replace it.

Where it actually runs

Pure Python ≥ 3.10 + numpy ≥ 1.24. No GPUs, no native extensions, no internet at runtime. | Environment | One-liner | |---|---| | Local laptop | `pip install git+https://github.com/spalsh-spec/falsify-eval` | | Google Colab | `!pip install git+https://github.com/spalsh-spec/falsify-eval` | | Kaggle / Sagemaker / Databricks | same as Colab | | GitHub Actions | add the `pip install` line to your `run:` block | | Docker (any base image with Python ≥ 3.10) | `RUN pip install git+...` | | AWS Lambda / Cloud Functions | bundle as a layer; the wheel is < 50 KB | | Air-gapped / offline | clone the repo to a USB stick; install from local path | The library is intentionally minimal so the audit surface is small and the deployment surface is large. No network calls, no telemetry, no opinions about your runtime.

What the gate proves (Proposition 1)

If the four-null gate PASSes (Δ ≥ τ on all four nulls) at N_trials = 50, τ = 0.05, then with Bonferroni-corrected confidence ≥ 0.95: - The engine is **not** equivalent to a label-permutation-invariant ranker (rejected by *G_A*). - The engine is **not** achieving its score solely via the uniform-class-prior assumption (rejected by *G_B*). - The engine is **not** equivalent to a uniform-random retriever (rejected by *G_C*). - The engine is **not** equivalent to a gold-marginal-matched predictor (rejected by *G_D — new in this work*). The full proof is in [`PREPRINT.md`](/falsify-eval/PREPRINT.html), §3.

Why we built it

Most retrieval-system papers report a single aggregate metric (nDCG@k, MRR) and call it a contribution. Three failure modes make this practice unsafe at any benchmark size and dangerous on small ones: 1. **Null-distribution silence.** A learned ranker can absorb gold-label distribution shape without learning underlying query–document relevance. A constant predictor matched to the empirical class marginal can score non-trivially without using the query at all. 2. **Corpus drift between commits.** ALTER TABLE migrations and feedback-loop side effects mutate runtime artifacts without changing source code. A "score-neutral" annotation can be true about the source diff while false about the runnable system. 3. **Small-sample claims masquerading as significance.** A +0.02 metric gain on N < 50 queries usually sits inside the bench's noise floor. The four-null gate addresses (1). The integrity-check state lock (`lock_state` / `verify_state`) addresses (2). The statistical-reporting helpers (`bootstrap_ci`, `paired_permutation_p`, `cohens_d_paired`, `power_n_required`) address (3). All in <1,000 lines of Python with `numpy` as the only runtime dependency.

Preprint

PREPRINT.md — Calibrated Falsification Harnesses for Retrieval Evaluation (v7, with N=10,000 validation, broken-predictor suite, sensitivity grid, soundness proposition).
SUPPLEMENTARY.md — extended tables, ablations, bench-size calibration curve.

Submission to arXiv is pending. The DOI will be added to CITATION.cff on acceptance. In the interim, the markdown is the canonical source; both files are immutable for v0.1.0 (verifiable via lock_state against the v0.1.0 tag).

@article{sharma2026calibrated,
  title  = {Calibrated Falsification Harnesses for Retrieval Evaluation},
  author = {Sharma, Sparsh},
  year   = {2026},
  eprint = {<arxiv-id-when-published>},
  archivePrefix = {arXiv},
  primaryClass  = {cs.IR}
}

Companion engine — Vāk-Kaṇaja (public release imminent)

Vāk-Kaṇaja is the Sanskrit / Pāṇinian retrieval engine built alongside falsify-eval. It is the first retriever (to my knowledge) adversarially verified by the four-null gate via cross-falsification, and the first to wire the 6 classical Pramāṇas of Nyāya / Mīmāṃsā into a retrieval engine as a router — detecting the query’s epistemological type (Pratyakṣa, Anumāna, Upamāna, Arthāpatti, Anupalabdhi, Śabda) and routing evidence channels accordingly.

It also implements an Anupalabdhi (non-perception) confidence floor: when the corpus does not contain the answer, the engine returns “corpus does not contain this knowledge” as a positive verdict, refusing to leak weak chunks. Pairs with falsify-eval’s Null A naturally — the silent-failure failure mode that load-bearing AI-safety arguments rely on assuming away.

The engine ships with a calibrated negative result: bench expansion N=21 → N=141 falsified the lift from the novel rerankers (Poincaré, topological persistence, fractal affinity), which now ship at production weight 0 and are documented as opt-in research components. The 3-channel φ-RRF baseline is the production default. This is the falsify-eval discipline applied to the authoring engine — same calibration that earned three clean rounds of adversarial review on this library.

Public release imminent at github.com/spalsh-spec/vak-kanaja, Apache 2.0, under the Bhardwaj & Sons brand. Priority announcement dated 2026-05-08.

Status

v0.2.0 — current. First public release: live on PyPI (pip install falsify-eval), Zenodo DOI 10.5281/zenodo.20284676, GitHub Pages playground. Published via OIDC trusted publishing (Sigstore-attested against tag v0.2.0).
v0.1.6.11 — 91 tests passing on a fresh clone (Mayank-battery 31 + property-based 15 + scipy cross-check 11 + smoke 8 + validation 9 + CLI stdin 4 + Windows-encoding 3 + shell-mangled paths 6 + sundry 4); ~10 s on M1. CI matrix green on Ubuntu × {3.10, 3.11, 3.12} and macOS × {3.10, 3.11, 3.12}.
v0.1.6.11 — publish-workflow version-sync guard hardened: previously tried to import falsify_eval before the package was installed and failed at the version-check step; now reads __version__ and pyproject.toml’s version directly via grep/sed so the tag, source files, and built artefact are cross-checked three ways without requiring an install.
v0.1.6.10 — distribution + arXiv build prep (infrastructure-only, no gate behaviour change): added .github/workflows/publish.yml for OIDC trusted publishing to PyPI on every v* tag push; added tools/build_arxiv.sh for converting PREPRINT.md to an arXiv-submittable LaTeX bundle via pandoc; added [tool.mutmut] config + docs/MUTATION_TESTING.md documenting the deferred status; added [project.optional-dependencies] dev bucket pinning mutmut, build, and twine.
v0.1.6.9 — added CS03 case-study scaffold (case_studies/cs03_aikosh_rag/) for the AIKosh internal RAG integration (Jasmeet Singh, in flight); added Tested-platforms log to README; renumbered v0.2 case studies (CS03 = AIKosh, CS04 = FiQA, CS05 = Quora).
v0.1.6.8 — empirical equivariance certificate: PREPRINT §5.9 + property tests proving the gate is strongly equivariant under order-preserving label-set bijections and Null C / real_mean are exactly equivariant under arbitrary bijections.
v0.1.6.7 — declared hypothesis>=6.0 as a test dep so CI installs it. (Caught by CI matrix the moment v0.1.6.6 landed.)
v0.1.6.6 — Hypothesis property-based test suite for the four-null gate: 13 universally-true properties (algebraic, deterministic, metric, gate-semantics, validation), each fuzzed against ~80 random benches per CI run.
v0.1.6.5 — cross-platform path-mangling hint: when --input my-bench\bench.jsonl is copy-pasted into zsh and the backslash gets eaten, the CLI now suggests the corrected forward-slash path instead of a bare FileNotFoundError.
v0.1.6.4 — Windows console UTF-8 / ASCII output hardening (closes Jasmeet’s cp1252 UnicodeEncodeError on the Δ glyph): reconfigure stdout to UTF-8 with errors='replace' at CLI entry, with auto-fallback to ASCII glyphs (Δ→d, τ→tau, ✓→[ok]) when the post-reconfigure stream still can’t encode them. Also --ascii flag and FALSIFY_ASCII=1 env var.
v0.1.6.3 — public priority announcement of companion engine Vāk-Kaṇaja.
v0.1.6.2 — Mayank round-3 polish: negative-seed validation in _validate_inputs.
v0.1.6.1 — Mayank round-2: CLI --input - now reads from stdin (was FileNotFoundError: '-').
v0.1.6 — bonferroni helper, scipy cross-check tests, property-based tests, CS02 SciFact case study, PREPRINT scope-honesty rewrite, AI/retrieval conflation strike across surfaces.
v0.1.5.2 — added progress=True flag to four_null_gate after Mayank’s 5-hour AIKosh silent-run incident.
v0.1.5.1 — closed null_a defect class for tuple / dataclass labels.
v0.1.5 — fixed all 14 defects from the Mayank Singh adversarial battery; full credit in CHANGELOG.md.
v0.2 (in progress) — PyPI publish ✓ shipped (v0.2.0 live on PyPI); case studies CS03 (AIKosh internal RAG, scaffolded — see case_studies/cs03_aikosh_rag/), CS04 (FiQA) and CS05 (Quora) for metric-sensitivity triangulation; broken-predictor zoo as a public artifact; label_order_seed parameter to break dependency on adversarial label ordering (see PREPRINT §5.9).
v0.3+ (planned) — extension to LLM free-text and summarisation; pre-registration tooling. (Not yet shipped — do not claim coverage.)

Tested platforms

External-verification log. Each entry is a real run by a real person who is not the package author, dated, with the exact version they ran. New entries go at the top.

Date	Tester	OS	Python	Shell	Version	Notes
2026-05-08	Jasmeet Singh (AIKosh)	Windows 10 (19045)	3.14.3	PowerShell	0.1.6.7	install / upgrade 0.1.6.2→0.1.6.7 / `doctor` / `quickstart` / `grade` all clean; original cp1252 defect closed. CS03 integration with AIKosh’s internal RAG retriever in flight.
2026-05-07	Mayank Singh	macOS 14 (M1)	3.12	zsh	0.1.5 → 0.1.6.2	adversarial 14-defect battery; all closed.

Issues and PRs welcome. The reference implementation is intentionally minimal; the goal is for the protocol to be small enough that adopters audit the entire library before depending on it.

*A house of standards.* Released by **[Bhardwaj & Sons](https://bhardwajandsons.com)** under Apache 2.0.
The methodology is free, public, and citable so it can become a standard rather than a product.

This site is open source. Improve this page.