iptv techs

IPTV Techs


Did xAI lie about Grok 3’s benchtags?


Did xAI lie about Grok 3’s benchtags?


Debates over AI benchtags — and how they’re inestablished by AI labs — are spilling out into accessible see.

This week, an OpenAI includeee accused Elon Musk’s AI company, xAI, of publishing misdirecting benchtag results for its procrastinateedst AI model, Grok 3. One of the co-set upers of xAI, Igor Babushkin, insisted that the company was in the right.

The truth lies somewhere in between.

In a post on xAI’s blog, the company published a graph shothriveg Grok 3’s carry outance on AIME 2025, a collection of challenging math asks from a recent invitational mathematics exam. Some experts have asked AIME’s validity as an AI benchtag. Nevertheless, AIME 2025 and betterer versions of the test are frequently used to probe a model’s math ability.

xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-carry outing useable model, o3-mini-high, on AIME 2025. But OpenAI includeees on X were rapid to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”

What is cons@64, you might ask? Well, it’s low for “consensus@64,” and it basicassociate gives a model 64 tries to answer each problem in a benchtag and gets the answers produced most standardly as the final answers. As you can envision, cons@64 tfinishs to raise models’ benchtag scores quite a bit, and leave outting it from a graph might produce it materialize as though one model surpasses another when in fact, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — unbenevolenting the first score the models got on the benchtag — descfinish below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-sairyly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s cleverest AI.”

Babushkin disputed on X that OpenAI has published aprobable misdirecting benchtag charts in the past — albeit charts comparing the carry outance of its own models. A more imfragmentary party in the debate put together a more “accurate” graph shothriveg csurrfinisherly every model’s carry outance at cons@64:

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most convey inant metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That equitable goes to show how little most AI benchtags convey about models’ restrictations — and their strengths.



Source connect


Leave a Reply

Your email address will not be published. Required fields are marked *

Thank You For The Order

Please check your email we sent the process how you can get your account

Select Your Plan