Tests: retrieval and ranking systems. Search engines. Recommendation top-K. The retrieval side of RAG (where the AI fetches source documents before generating an answer). Anything that picks which items from a fixed corpus to show in what order.
Does NOT test: the part of ChatGPT, Claude, Gemini that writes paragraphs of free-form text. Those are generative language models — different failure modes, different mathematics, different test that we have not yet built. Anyone telling you this tool checks "AI in general" is overselling. We're not.
Search and ranking systems fetch information. Search engines. Recommendation systems. The retrieval side of RAG. They get graded on how often they return the right document for a given query.
Some of them get a high grade by exploiting which answers happen to be most popular in the test, not by understanding the query. We built a 30-second check that spots that pattern. We've now run it on a real public benchmark and shown it works (CS01 / NFCorpus, 323 queries, 5 minutes on a laptop).
Free. Open source. Works on any retrieval or ranking system you point it at.
Imagine a multiple-choice exam on French history. You give it to a hundred students. Most questions are about Napoleon. Roughly 80% of the correct answers happen to be "C." Maybe the teacher who wrote the test had a habit. It happens.
One student, Mira, didn't study. She doesn't speak French. She doesn't know what Napoleon did. But she figured out from looking at past tests that "C" is the most common answer. So she just writes C for every question.
Mira scores 80%.
If you only see her score, you'll think Mira is a French history expert. She isn't. She figured out the test, not the subject.
Resume screening tools. A search-and-rank system that retrieves candidates can score well on offline metrics by always returning candidates from the most common training-set universities — regardless of the job description. We have not personally audited a named product; the failure mode is well-documented in academic IR literature.
Medical literature search. A search system can score "highly relevant" by always returning the most-cited paper in a specialty, regardless of the actual symptoms in the query. Same mechanism as Mira.
The retrieval side of RAG (the part of a Claude/GPT/Gemini answer that says "here are my sources"). If the retriever is biased toward popular documents, the LLM's answer will be too — even though the LLM itself is fine. The four-null gate tests the retrieval side. The generative side is a separate failure-mode class.
In all three cases the system looks like it's reasoning. It's actually exploiting that some answers are popular. Aggregate metrics alone cannot tell you which.
Our tool gives the same exam to four imaginary students who can't possibly know the real subject. Then it compares the system's score to theirs.
Pick a student type below. The third button is the killer demo. It shows what happens when a retrieval system is doing what Mira does.
The tool was developed and validated on a real Sanskrit and Tamil literary search engine over 10,000 test questions. Here are the actual numbers from the v0.1 release. Each row is a different "student" graded by all four imaginary tests.
| Student type | Score | vs A | vs B | vs C | vs Mira | Verdict |
|---|---|---|---|---|---|---|
| Anti-oracle (always wrong) | 0% | −25 | −25 | −25 | −25 | FAIL |
| Constant predictor | 18% | +9 | +10 | −7 | +0 | FAIL |
| Mira-style | 34% | +19 | +19 | +19 | +0 | FAIL |
| Real engine | 79% | +55 | +55 | +55 | +52 | PASS |
| Oracle (perfect) | 100% | +75 | +75 | +75 | +74 | PASS |
Look at the third row. The Mira-style student gets 34%. That's positive. It beats imaginary students A, B, and C by 19 percentage points each. Without the Mira test, you'd say this student passed.
But the Mira test gives a margin of zero. The fourth check proves this student isn't really studying. It's just exploiting the popular-answer pattern.
This isn't hypothetical. Our own first published numbers were inflated by a programming bug that produced exactly this pattern. We retracted them publicly. That retraction is what gave the validated numbers their weight.
I'm telling you these because you should know I'm not selling a magic box. I'm selling something that has been honestly tested and has visible flaws.
I tried adding a sophisticated analysis technique to make the AI better. After running it on 10,000 test questions, the result was statistically significant, and slightly worse than not using it. I disabled the feature. The whole point of this kind of testing is to catch yourself when you're wrong. That was the catch.
A subtle Python issue meant some of my settings were silently using old values. The published numbers were higher than the real ones. When I found the bug, I retracted the inflated numbers publicly. The corrected numbers were lower but real.
Akosh tried to install my tool last week. Three things in the README were wrong. The "pip install" command pointed to a package that doesn't exist. The demo command had a typo. The clone URL was a placeholder I forgot to fill in. He got "failed basic bench tests," exactly the kind of report that would normally mean my whole library is broken. It wasn't broken. My install instructions were. Forty-five minutes after he reported it, the fixed version was live. He's named in the project's changelog as the v0.1.2 reporter.
The original methodology paper offered $2,000 cash to anyone who could break the methodology in specific ways. After the install bugs surfaced, I suspended the bounty pending broader validation. Until the implementation has been validated by an external lab, paying for breakage when I haven't yet hardened the entry doors is the wrong order.
It is not a list of fatal flaws. Every active research project has these. It is a list of moments where the system caught itself. That's what you're looking for in any methodology. Not "did it have problems," but "did it surface them."
Honest answer as of today: almost nobody, yet. The methodology paper is public. The code is public. The math is undergraduate-level statistics packaged so you don't need to be a statistician.
The independent validation step (having a major AI research lab run the tool against their own published results and confirm the methodology works) is the highest-priority next step. It has not happened yet. If you're at one of those labs and reading this, that's what I most need.
Other people have built similar things. We stand on:
The contribution here is narrow and specific. The Mira-style imaginary student, packaged as a 30-second four-test gate that anyone can run. Everything else is lineage.
falsify-eval doctor. Takes 30 seconds. Tell us if it doesn't work.