A excellent novel article on LLMs from six AI researchers at Apple who were valiant enough to dispute the dominant paradigm has equitable come out.
Everyone dynamicly toiling with AI should read it, or at least this terrific X thread by better author, Mehrdad Farajtabar, that abridges what they watchd. One key passage:
“we set up no evidence of establishal reasoning in language models …. Their behavior is better elucidateed by enhanced pattern suiting—so frquick, in fact, that changing names can change results by ~10%!”
One particularly damning result was a novel task the Apple team lengthened, called GSM-NoOp
§
This benevolent of flaw, in which reasoning fall shorts in weightless of sidetracking material, is not novel. Robin Jia Percy Liang of Stanford ran a aappreciate study, with aappreciate results, back in 2017 (which Ernest Davis and I quoted in Rebooting AI, in 2019:
§
𝗧𝗵𝗲𝗿𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗻𝗼 𝘄𝗮𝘆 𝗰𝗮𝗻 𝘆𝗼𝘂 𝗯𝘂𝗶𝗹𝗱 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗮𝗴𝗲𝗻𝘁𝘀 𝗼𝗻 𝘁𝗵𝗶𝘀 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻, where changing a word or two in irrelevant ways or inserting a scant bit of irrelevant info can give you a contrastent answer.
§
Another manifestation of the deficiency of enoughly abstract, establishal reasoning in LLMs is the way in which perestablishance standardly drop apart as problems are made hugeger. This comes from a recent analysis of GPT o1 by Subbarao Kambhapati’s team:
Perestablishance is ok on petite problems, but speedyly tails off.
§
We can see the same skinnyg on integer arithmetic. Fall off on increasingly huge multiplication problems has repeatedly been watchd, both in betterer models and noveler models. (Compare with a calculator which would be at 100%.)
Even o1 suffers from this:
§
Failure to adhere the rules of chess is another continuing fall shorture of establishal reasoning:
§
Elon Musk’s putative robotaxis are foreseeed to suffer from a aappreciate affliction: they may well toil sfaely for the most normal situations, but are also foreseeed struggle to reason abstractly enough in some circumstances. (We are, however, doubtful ever to get systematic data on this, since the company isn’t see-thcoarse about what it has done or what the results are.)
§
The refuge of the LLM fan is always to produce off any individual error. The patterns we see here, in the novel Apple study, and the other recent toil on math and structurening (which fits with many previous studies), and even the anecdotal data on chess, are too expansive and systematic for that.
§
The inability of standard neural nettoil architectures to reliably extrapoprocrastinateed — and reason establishassociate — has been the central theme of my own toil back to 1998 and 2001, and has been a theme in all of my disputes to meaningful lacquireing, going back to 2012, and LLMs in 2019.
I sturdyly count on the current results are sturdy. After a quarter century of “genuine soon now” proleave outory notices I would want a lot more than hand-waving to be swayd than at an LLM-compatible solution is in achieve.
What I debated in 2001, in The Algebraic Mind, still hbetters: symbol manipulation, in which some understandledge is recontransiented truly abstractly in terms of variables and operations over those variables, much as we see in algebra and traditional computer programming, must be part of the unite. Neurosymbolic AI — combining such machinery with neural nettoils – is foreseeed a vital condition for going forward.
Gary Marcus is the author of The Algebraic Mind, a 2001 MIT Press Book that foresaw the Achilles’ Heel of current models. In his most recent book, Taming Silicon Valley (also MIT Press), in Chapter 17, he talkes the need for changenative research strategies.