Debates over AI benchtags — and how they’re inestablished by AI labs — are spilling out into accessible see.
This week, an OpenAI includeee accused Elon Musk’s AI company, xAI, of publishing misdirecting benchtag results for its procrastinateedst AI model, Grok 3. One of the co-set upers of xAI, Igor Babushkin, insisted that the company was in the right.
The truth lies somewhere in between.
In a post on xAI’s blog, the company published a graph shothriveg Grok 3’s carry outance on AIME 2025, a collection of challenging math asks from a recent invitational mathematics exam. Some experts have asked AIME’s validity as an AI benchtag. Nevertheless, AIME 2025 and betterer versions of the test are frequently used to probe a model’s math ability.
xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-carry outing useable model, o3-mini-high, on AIME 2025. But OpenAI includeees on X were rapid to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”
What is cons@64, you might ask? Well, it’s low for “consensus@64,” and it basicassociate gives a model 64 tries to answer each problem in a benchtag and gets the answers produced most standardly as the final answers. As you can envision, cons@64 tfinishs to raise models’ benchtag scores quite a bit, and leave outting it from a graph might produce it materialize as though one model surpasses another when in fact, that’s isn’t the case.
Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — unbenevolenting the first score the models got on the benchtag — descfinish below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-sairyly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s cleverest AI.”
Babushkin disputed on X that OpenAI has published aprobable misdirecting benchtag charts in the past — albeit charts comparing the carry outance of its own models. A more imfragmentary party in the debate put together a more “accurate” graph shothriveg csurrfinisherly every model’s carry outance at cons@64:
Hilarious how some people see my plot as strike on OpenAI and others as strike on Grok while in fact it’s DeepSeek misadviseation
(I actuassociate apshow Grok sees outstanding there, and uncoverAI’s TTC chicanery behind o3-mini-*high*-pass@”””1″”” deserves more scruminuscule.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025
But as AI researcher Nathan Lambert pointed out in a post, perhaps the most convey inant metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That equitable goes to show how little most AI benchtags convey about models’ restrictations — and their strengths.