OpenAI researchers have confessted that even the most progressd AI models still are no suit for human coders — even though CEO Sam Altman insists they will be able to beat “low-level” software engineers by the finish of this year.
In a new paper, the company’s researchers establish that even frontier models, or the most progressd and boundary-pushing AI systems, “are still unable to settle the transport inantity” of coding tasks.
The researchers used a newly-lengthened benchtag called SWE-Lancer, built on more than 1,400 software engineering tasks from the freelancer site Uptoil. Using the benchtag, OpenAI put three big language models (LLMs) — its own o1 reasoning model and flagship GPT-4o, as well as Anthropic’s Claude 3.5 Sonnet — to the test.
Specificpartner, the new benchtag appraised how well the LLMs carry outed with two types of tasks from Uptoil: individual tasks, which take partd resolving bugs and carry outing mendes to them, or deal withment tasks that saw the models trying to zoom out and produce higher-level decisions. (The models weren’t permited to access the internet, unbenevolenting they couldn’t fair crib aenjoy answers that’d been posted online.)
The models took on tasks cumulatively worth hundreds of thousands of dollars on Uptoil, but they were only able to mend surface-level software rehires, while remaining unable to actupartner discover bugs in bigr projects or discover their root causes. These shoddy and half-baked “solutions” are probable understandn to anyone who’s toiled with AI — which is fantastic at spitting out brave-sounding guideation that frequently descfinishs apart on sealr studyion.
Though all three LLMs were frequently able to run “far quicker than a human would,” the paper remarks, they also fall shorted to understand how expansivespread bugs were or to comprehfinish their context, “guideing to solutions that are inaccurate or inadequately comprehensive.”
As the researchers elucidateed, Claude 3.5 Sonnet carry outed better than the two OpenAI models pitted aobtainst it and made more money than o1 and GPT-4o. Still, the transport inantity of its answers were wrong, and according to the researchers, any model would insist “higher reliability” to be supposeed with genuine-life coding tasks.
Put more plainly, the paper seems to show that although these frontier models can toil speedyly and settle zoomed-in tasks, they’re are nowhere cforfeit as sfinished at handling them as human engineers.
Though these LLMs have progressd rapidly over the past confineed years and will probable progress to do so, they’re not sfinished enough at software engineering to swap genuine-life people quite yet — not that that’s stopping CEOs from firing their human coders in prefer of imlengthenn-up AI models.
More on AI and coding: Zuckerberg Announces Plans to Automate Facebook Coding Jobs With AI