🔥 Hot topics · Recent inflections · 📈 Timeline · Ask · Editorials · 🔥 Hot topics · Recent inflections · 📈 Timeline · Ask · Editorials
Stuff AI CAN'T Do

How we score

Jury methodology

How an AI panel rates each capability claim and how those individual votes combine into a single verdict.

⚖ What is the jury?

Each topic on this site (e.g. "Can AI translate Yoruba accurately?") gets reviewed by a rotating panel of AI models — between 3 and 7 of them per check, drawn from different model families and different vendors. We call this panel the jury.

We deliberately do not publish which models sit on a given check, and we never name them in public verdicts. The point of the jury is to capture the consensus of independent reasoning systems, not to advertise specific brands or invite gaming. Internally, admin can audit which model returned which verdict for transparency.

🗳️ What each juror does

Every juror is given the same prompt:

  1. Read the statement (e.g. "Can AI compose a fugue in the style of Bach?")
  2. Return a one-word verdict: CAN, CAN NOT, or UNDECIDED.
  3. Give a one-sentence reason for the verdict.
  4. If the verdict is CAN, estimate the month and year the capability first reliably emerged.

Each juror answers independently. None of them sees the others' verdicts. This avoids the herd-effect you'd get if one model anchored the rest.

📊 How verdicts combine

A statement's status (CAN / CAN NOT / DISPUTED) is decided by the cumulative tally of every juror verdict ever recorded for it — not the most recent check alone. As more checks accumulate over weeks, the tally smooths out noise from any single panel.

The rules, in order:

  • Need at least 2 verdicts. A single juror can't flip a status — the topic stays DISPUTED until a second juror weighs in.
  • Unanimous wins immediately. If every juror agrees (e.g. 3-of-3 say CAN NOT), the verdict settles right away — no ambiguity to resolve.
  • Otherwise, 80% agreement settles it. Once at least 3 verdicts have accumulated, the verdict flips to whichever direction crosses the 80% threshold. 11 say CAN, 1 says CAN NOT → CAN (91%).
  • Below 80% = DISPUTED. If the panel can't agree at 80%+, the topic stays DISPUTED, which is its own honest answer — it means the experts genuinely disagree.

🔄 How often jurors run

The jury runs continuously. Stalest topics (longest since their last check) are reviewed first. Every check writes a permanent row in the audit log at the bottom of each topic page, showing how many jurors participated and the verdict breakdown that day.

Because AI capabilities change month to month, a verdict isn't a one-time judgement — it's the current rolling consensus. A topic that was CAN NOT in March can flip to CAN by June, and the audit log preserves that history.

🧑‍⚖️ Audience votes vs. jury verdicts

The audience bar ("What the audience thinks") and the jury verdict are two separate signals — they do not influence each other.

  • Audience votes are human opinions, useful for spotting where popular intuition differs from expert assessment.
  • Jury verdicts are the source of truth for the CAN / CAN NOT / DISPUTED status pill.

When humans and the jury disagree, that's editorially interesting — often it surfaces an emerging capability the public hasn't caught up to yet, or a hype claim the jury isn't buying.

🤔 Why not name the AIs?

Naming jurors creates problems we want to avoid:

  • Vendor cheerleading — "model X says Y!" makes the site into a marketing channel.
  • Targetable gaming — once people know which models judge, prompts and content can be tuned to game specific ones.
  • Brand bias in your reading — you might trust or distrust a verdict based on which logo cast it, instead of the consensus.

Treating jurors as an anonymous panel keeps the focus on the verdict, not the voter.

Last updated May 2026

Got one we missed?

Add a statement to the atlas. We review weekly.