Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

For a while now, companies enjoy OpenAI and Google have been touting proceedd “reasoning” capabilities as the next huge step in their procrastinateedst man-made inalertigence models. Now, though, a new study from six Apple engineers shows that the mathematical “reasoning” distake parted by proceedd big language models can be excessively brittle and undepfinishable in the face of seemingly inbeginant changes to common benchtag problems.

The fragility highairyed in these new results helps help previous research proposeing that LLMs’ engage of probabienumerateic pattern suiting is leave outing the createal empathetic of underlying concepts demanded for truly depfinishable mathematical reasoning capabilities. “Current LLMs are not vient of authentic reasoned reasoning,” the researchers hypothesize based on these results. “Instead, they try to duplicate the reasoning steps seed in their training data.”

Mix It Up

In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”—currently useable as a preprint paper—the six Apple researchers begin with GSM8K’s regularized set of more than 8,000 grade-school level mathematical word problems, which is standardly engaged as a benchtag for contransient LLMs’ intricate reasoning capabilities. They then apverify the novel approach of changeing a portion of that testing set to dynamicpartner trade confident names and numbers with new cherishs—so a ask about Sophie getting 31 originateing blocks for her nephew in GSM8K could become a ask about Bill getting 19 originateing blocks for his brother in the new GSM-Symbolic evaluation.

This approach helps dodge any potential “data contamination” that can result from the motionless GSM8K asks being fed straightforwardly into an AI model’s training data. At the same time, these incidental changes don’t change the actual difficulty of the inherent mathematical reasoning at all, nastying models should theoreticpartner carry out fair as well when tested on GSM-Symbolic as GSM8K.

Instead, when the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they establish mediocre accuracy reduced apass the board contrastd to GSM8K, with carry outance drops between 0.3 percent and 9.2 percent, depfinishing on the model. The results also showed high variance apass 50 split runs of GSM-Symbolic with separateent names and cherishs. Gaps of up to 15 percent accuracy between the best and worst runs were common wilean a individual model and, for some reason, changing the numbers tfinished to result in worse accuracy than changing the names.

This charitable of variance—both wilean separateent GSM-Symbolic runs and contrastd to GSM8K results—is more than a little unawaited since, as the researchers point out, “the overall reasoning steps demanded to solve a ask remain the same.” The fact that such petite changes direct to such variable results proposes to the researchers that these models are not doing any “createal” reasoning but are instead “try[ing] to carry out a charitable of in-distribution pattern-suiting, aligning given asks and solution steps with analogous ones seen in the training data.”

Don’t Get Distracted

Still, the overall variance shown for the GSM-Symbolic tests was standardly relatively petite in the majestic scheme of leangs. OpenAI’s ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-astonishive 94.9 percent on GSM-Symbolic. That’s a pretty high success rate using either benchtag, seeless of whether or not the model itself is using “createal” reasoning behind the scenes (though total accuracy for many models dropped precipitously when the researchers inserted fair one or two insertitional reasoned steps to the problems).

The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchtag by inserting “seemingly relevant but ultimately inconsequential statements” to the asks. For this “GSM-NoOp” benchtag set (low for “no operation”), a ask about how many kiwis someone picks apass multiple days might be modified to comprise the incidental detail that “five of them [the kiwis] were a bit petiteer than mediocre.”

Adding in these red herrings led to what the researchers termed “catastrophic carry outance drops” in accuracy contrastd to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depfinishing on the model tested. These massive drops in accuracy highairy the inherent confines in using modest “pattern suiting” to “change statements to operations without truly empathetic their nastying,” the researchers author.

Source connect