The test is NOT "ask the AI if it can do X" — every model would just say yes. Instead, we ask each AI to actually do the thing and judge the result.
Concrete example
Statement: "Can AI translate Yoruba accurately?"
1️⃣ A "writer" model generates:
- Test prompt — "Translate this English sentence to Yoruba: 'The market opens at six in the morning when the rain stops.' Return only the translation."
- Rubric — "1. Grammatically correct Yoruba. 2. Tense matches source. 3. 'Market' rendered as native equivalent. 4. No English words leak through."
2️⃣ Each AI on the panel takes the test
Gemini, Mistral, Groq, etc. each receive the prompt and produce their own translation.
3️⃣ A "judge" model scores every output against the rubric
- ✅ PASS — all rubric criteria met
- ◐ PARTIAL — most met, but missed something material
- ✗ FAIL — didn't get it right
- ○ SKIPPED — task can't be reduced to a single text prompt
Re-run monthly. Every test is logged with the raw AI output, so verdicts are auditable. Capabilities change fast — what failed last month may pass this month.