Is the search engine actually any good?

SCOPE: WHAT THIS DOES AND DOES NOT TEST

Tests: retrieval and ranking systems. Search engines. Recommendation top-K. The retrieval side of RAG (where the AI fetches source documents before generating an answer). Anything that picks which items from a fixed corpus to show in what order.

Does NOT test: the part of ChatGPT, Claude, Gemini that writes paragraphs of free-form text. Those are generative language models — different failure modes, different mathematics, different test that we have not yet built. Anyone telling you this tool checks "AI in general" is overselling. We're not.

If you have 30 seconds

Search and ranking systems fetch information. Search engines. Recommendation systems. The retrieval side of RAG. They get graded on how often they return the right document for a given query.

Some of them get a high grade by exploiting which answers happen to be most popular in the test, not by understanding the query. We built a 30-second check that spots that pattern. We've now run it on a real public benchmark and shown it works (CS01 / NFCorpus, 323 queries, 5 minutes on a laptop).

Free. Open source. Works on any retrieval or ranking system you point it at.

If you stop reading hereThat's all you really need. The rest of this page is for people who want to see how it actually works, what doesn't work, and what we still don't know.

If you have 2 minutes

How a student who didn't study can score 80%.

Imagine a multiple-choice exam on French history. You give it to a hundred students. Most questions are about Napoleon. Roughly 80% of the correct answers happen to be "C." Maybe the teacher who wrote the test had a habit. It happens.

One student, Mira, didn't study. She doesn't speak French. She doesn't know what Napoleon did. But she figured out from looking at past tests that "C" is the most common answer. So she just writes C for every question.

Mira scores 80%.

If you only see her score, you'll think Mira is a French history expert. She isn't. She figured out the test, not the subject.

This is a documented failure mode in retrieval and ranking. Their published scores look brilliant. We've now demonstrated, on a real public benchmark (CS01 / NFCorpus), that a system doing the equivalent of "always pick C" scores nDCG@10 = 0.066 — non-trivial — without using the query at all. The four-null gate correctly fails it.

Where this matters in the real world

Resume screening tools. A search-and-rank system that retrieves candidates can score well on offline metrics by always returning candidates from the most common training-set universities — regardless of the job description. We have not personally audited a named product; the failure mode is well-documented in academic IR literature.

Medical literature search. A search system can score "highly relevant" by always returning the most-cited paper in a specialty, regardless of the actual symptoms in the query. Same mechanism as Mira.

The retrieval side of RAG (the part of a Claude/GPT/Gemini answer that says "here are my sources"). If the retriever is biased toward popular documents, the LLM's answer will be too — even though the LLM itself is fine. The four-null gate tests the retrieval side. The generative side is a separate failure-mode class.

In all three cases the system looks like it's reasoning. It's actually exploiting that some answers are popular. Aggregate metrics alone cannot tell you which.

If you stop reading hereYou understand the problem. Mira gets 80% because she figured out the test. Some search-and-rank systems do the same thing. We built a way to spot it, and we've shown it works on a public benchmark. Done.

If you have 5 minutes

Four imaginary students. The exam tells you which one your retriever is.

Our tool gives the same exam to four imaginary students who can't possibly know the real subject. Then it compares the system's score to theirs.

Imaginary Student A. Takes the answer key and randomly rearranges it. Beating this means the system isn't just memorising answer positions.
Imaginary Student B. Random for each question. Beating this means the system isn't just guessing.
Imaginary Student C. Picks five random answers per question. Beating this means the system's narrowing-down is meaningful.
Imaginary Student D, "Mira." Always picks the most common answer. Beating this means the system is actually using the query rather than exploiting popularity. This is the new contribution. Most published retrieval evaluations forget about it.

Pick a student type below. The third button is the killer demo. It shows what happens when a retrieval system is doing what Mira does.

This student's exam score: 79%

VERDICT: ✓ This student really studied.

They beat all four imaginary students by a clear margin (more than 5 percentage points). The score reflects real knowledge.

This demo uses simulated numbers calibrated to a 50-question exam. The real tool uses the same logic on retrieval and ranking systems with hundreds-to-thousands of test queries and proper statistical comparisons. See the CS01 NFCorpus case study for the real numbers.

If you stop reading hereYou've now seen the methodology in one click. The Mira button shows the exact failure mode the tool was built to catch. That's the contribution.

If you have 8 minutes

What it actually says when you point it at a real engine.

The tool was developed and validated on a real Sanskrit and Tamil literary search engine over 10,000 test questions. Here are the actual numbers from the v0.1 release. Each row is a different "student" graded by all four imaginary tests.

Student type	Score	vs A	vs B	vs C	vs Mira	Verdict
Anti-oracle (always wrong)	0%	−25	−25	−25	−25	FAIL
Constant predictor	18%	+9	+10	−7	+0	FAIL
Mira-style	34%	+19	+19	+19	+0	FAIL
Real engine	79%	+55	+55	+55	+52	PASS
Oracle (perfect)	100%	+75	+75	+75	+74	PASS

Look at the third row. The Mira-style student gets 34%. That's positive. It beats imaginary students A, B, and C by 19 percentage points each. Without the Mira test, you'd say this student passed.

But the Mira test gives a margin of zero. The fourth check proves this student isn't really studying. It's just exploiting the popular-answer pattern.

This isn't hypothetical. Our own first published numbers were inflated by a programming bug that produced exactly this pattern. We retracted them publicly. That retraction is what gave the validated numbers their weight.

If you stop reading hereYou've seen the methodology applied to real numbers, and seen us catch our own mistake. That's about as honest as a research project can be in public.

If you have 12 minutes

Things I tried that failed.

I'm telling you these because you should know I'm not selling a magic box. I'm selling something that has been honestly tested and has visible flaws.

1. The fancy maths feature that turned out negative.

I tried adding a sophisticated analysis technique to make the AI better. After running it on 10,000 test questions, the result was statistically significant, and slightly worse than not using it. I disabled the feature. The whole point of this kind of testing is to catch yourself when you're wrong. That was the catch.

2. The programming bug that inflated my first numbers.

A subtle Python issue meant some of my settings were silently using old values. The published numbers were higher than the real ones. When I found the bug, I retracted the inflated numbers publicly. The corrected numbers were lower but real.

3. A friend in India broke the install in 90 seconds.

Akosh tried to install my tool last week. Three things in the README were wrong. The "pip install" command pointed to a package that doesn't exist. The demo command had a typo. The clone URL was a placeholder I forgot to fill in. He got "failed basic bench tests," exactly the kind of report that would normally mean my whole library is broken. It wasn't broken. My install instructions were. Forty-five minutes after he reported it, the fixed version was live. He's named in the project's changelog as the v0.1.2 reporter.

4. I had a $2,000 bug bounty. I suspended it.

The original methodology paper offered $2,000 cash to anyone who could break the methodology in specific ways. After the install bugs surfaced, I suspended the bounty pending broader validation. Until the implementation has been validated by an external lab, paying for breakage when I haven't yet hardened the entry doors is the wrong order.

What this list is not

It is not a list of fatal flaws. Every active research project has these. It is a list of moments where the system caught itself. That's what you're looking for in any methodology. Not "did it have problems," but "did it surface them."

If you stop reading hereYou've seen the failure history. The next layer is about what's still genuinely uncertain.

If you have 15 minutes

What this tool cannot tell you, and who has actually checked it.

Five honest limits

It only checks systems that fetch documents or rank options. Not the part of ChatGPT, Claude, or Gemini that writes paragraphs of free-form text — those are generative language models, a different problem with different mathematics, and a different test we haven't built.
It can't tell you the AI is "smart," only that it isn't faking. Green light means doing real work. It does not mean good enough for your specific job.
You bring the questions and answers. Garbage exam, garbage verdict.
It assumes test questions are independent. If half your questions are basically the same question rephrased, the tool's confidence will be too high. Known limit.
It doesn't catch plagiarism, training-data leakage, or AI hallucinations. Those are different problems.

Who has independently checked this

Honest answer as of today: almost nobody, yet. The methodology paper is public. The code is public. The math is undergraduate-level statistics packaged so you don't need to be a statistician.

The independent validation step (having a major AI research lab run the tool against their own published results and confirm the methodology works) is the highest-priority next step. It has not happened yet. If you're at one of those labs and reading this, that's what I most need.

What we're not the first to do

Other people have built similar things. We stand on:

BEIR (Thakur et al, 2021). Diversifies the bench. We diversify the imaginary students.
MTEB (Muennighoff et al, 2022). Leaderboard infrastructure. We are a per-system audit.
OLMES (AI2, 2024). Open evaluation standards. We add the missing fourth check.
RAGAS (Es et al, 2023). Evaluates the answer-generation side. We evaluate the document-fetching side.
The TREC tradition going back to Voorhees, 1998. Meta-evaluation as a discipline.

The contribution here is narrow and specific. The Mira-style imaginary student, packaged as a 30-second four-test gate that anyone can run. Everything else is lineage.

What you can do

Three actions, smallest to largest.

Smallest. Share this page with one person who works in AI, science, healthcare, or finance. Ask whether the systems they trust have been checked this way.
Medium. If you know someone at a research lab (EleutherAI, METR, AI2, HuggingFace, Anthropic, Cohere), forward the technical version and ask them to run the 30-second verification.
Largest. If you build software yourself, the project lives at github.com/spalsh-spec/falsify-eval. Run falsify-eval doctor. Takes 30 seconds. Tell us if it doesn't work.

Search and ranking systems take exams. This catches the ones that didn't really study.

How a student who didn't study can score 80%.

Where this matters in the real world

Four imaginary students. The exam tells you which one your retriever is.

What it actually says when you point it at a real engine.

Things I tried that failed.

1. The fancy maths feature that turned out negative.

2. The programming bug that inflated my first numbers.

3. A friend in India broke the install in 90 seconds.

4. I had a $2,000 bug bounty. I suspended it.

What this list is not

What this tool cannot tell you, and who has actually checked it.

Five honest limits

Who has independently checked this

What we're not the first to do

Three actions, smallest to largest.