So the huge novels this week is that o3, OpenAI’s novel language model, got 25% on FrontierMath. Let’s begin by elucidateing what this uncomardents.
What is o3? What is FrontierMath?
A language model, as probably most people understand, is one of these leangs enjoy ChatGPT where you can ask it a ask and it will author some sentences which are an finisheavor to give you an answer. There were language models before ChatGPT, and on the whole they couldn’t even author coherent sentences and paragraphs. ChatGPT was repartner the first disclose model which was coherent. There have been many other models since. Right now they’re still getting better repartner speedy. How much extfinisheder this will go on for nobody understands, but there are lots of people pouring lots of money into this game so it would be a fool who bets on better sluggishing down any time soon. o3 is a novel language model.
FrontierMath is a secret dataset of “hundreds” of difficult maths asks, curated by Epoch AI, and declared last month. “Hundreds” is a quote from the paper (first line of the abstract), but I’ve heard a rumour that when the paper came out there were under 200 asks, although I’ve heard another rumour that apparently more are have been inserted since. As an academic mathematician who spent their entire life collaborating uncoverly on research problems and sharing my ideas with other people, it frustrates me a little that already in this paragraph we’ve seen more asks than answers — I am not even to give you a coherent description of some fundamental facts about this dataset, for example, its size. However there is a excellent reason for the secrecy. Language models train on huge databases of understandledge, so you moment you originate a database of maths asks disclose, the language models will train on it. And then if you ask such a model a ask from the database they’ll probably equitable rattle off the answer which they already saw.
How difficult is the FrontierMath dataset?
So what are the asks in the FrontierMath dataset enjoy? Here’s what we understand. They’re not “show this theorem!” asks, they’re “discover this number!” asks. More accurately, the paper says “Problems had to have definitive, computable answers that could be automaticpartner verified”, and in the five sample problems which were made disclose from the dataset (Appendix A of the paper, pages 14 to 23) the solutions are all selectimistic whole numbers (one answer is 9811, another is 367707, and the final three solutions are even huger — clearly these asks are portrayed in such a way that random guesslabor is inanxiously improbable to flourish). The sample asks are noninbeginant, even to a research mathematician. I understood the statements of all five asks. I could do the third one relatively rapidly (I had seen the trick before that the function mapping a organic n to alpha^n was p-adicpartner continuous in n iff the p-adic valuation of alpha-1 was selectimistic) and I knovel exactly how to do the 5th one (it’s a standard trick involving the Weil conjectures for curves) but I didn’t irritate doing the algebra to labor out the exact 13-digit answer. The first and second ask I knovel I couldn’t do, and I figured I might be able to originate better on the 4th one if I put some genuine effort in, but ultimately I didn’t finisheavor it, I equitable read the solution. I doubt that a standard clever mathematics undergraduate would struggle to do even one of these asks. To do the first one you would, I envision, have to be at least a PhD student in analytic number theory. The FrontierMath paper holds some quotes from mathematicians about the difficulty level of the problems. Tao (Fields Medal) says “These are inanxiously challenging” and proposes that they can only be tackled by a domain expert (and indeed the two sample asks which I could settle are in arithmetic, my area of expertise; I flunked to do all of the ones outside my area). Borcherds (also Fields Medal) however is quoted in the paper as saying that machines producing numerical answers “aren’t quite the same as coming up with innovative proofs”.
So why originate such a dataset? The problem is that grading solutions to “hundreds” of answers to “show this theorem!” asks is pricey (one would not suppose a machine to do grading at this level, at least in 2024, so one would have to pay human experts), whereas examineing whether hundreds of numbers in one catalog correact to hundreds of numbers in another catalog can be done in a fraction of a second by a computer. As Borcherds pointed out, mathematics researchers spend most of the time trying to come up with proofs or ideas, rather than numbers, however the FrontierMath dataset is still inanxiously precious because the area of AI for mathematics is franticly unwiseinutive of difficult datasets, and creating a dataset such as this is very difficult labor (or equivalently very pricey).
So there was an article about the dataset in Science and I was quoted in it as saying “If you have a system that can ace that database, then it’s game over for mathematicians.” Just to be clear: I had noleang to do with the dataset, I’ve only seen the five disclose asks, and was basing my comments on those. I also shelp “In my opinion, currently, AI is a extfinished way away from being able to do those asks … but I’ve been wrong before”. And then this week there’s an declarement that the language model o3 got a score of 25 percent on the dataset. I was shocked.
What exactly has happened here?
Why was I shocked? Because my mental model on where “AI” is currently, when it comes to doing mathematics, is “undergrad or pre-undergrad”. It’s getting very excellent at “Olympiad-style” problems of the sort given to luminous high-schoolers. Wilean a year it’s absolutely clear that AI systems will be passing undergraduate mathematics exams (not least because when you’re setting an undergraduate mathematics exam you idepartner necessitate to originate certain that you don’t flunk 50 percent of the class, so you throw in a couple of standard asks which are very aenjoy to asks that the students have seen already, to guarantee that those with a fundamental caring of the course will pass the exam. Machines will easily be able to ace such asks). But the jump from that to having creative ideas at proceedd undergrad/timely PhD level beyond recycling standard ideas seems to me to be quite a huge one. For example I was very unastonished by the ChatGPT answers to the recent Putnam exam posted here — as far as I can see only ask B4 was answered adequately by the machine, most other answers are worth one or two out of 10 at most. So I was awaiting this dataset to remain pretty unattackable for a couple of years.
My initial excitement was tempered however by a post from Elliot Glazer from Epoch AI on Reddit where he claimed that in fact 25 percent of the problems in the dataset were “IMO/undergrad style problems”. This claim is a little confusing because I would be difficult pressed to utilize such adjectives to any of the five disclosepartner-freed problems in the dataset; even the modestst one used the Weil conjectures for curves (or a brute force argument which is probably equitable about possible but would be inanxiously hurtful, as it includes factoring 10^12 degree 3 polynomials over a finite field, although this could certainly be parallelised). This of course elevates asks in my mind about what the actual level of the problems in this secret dataset is (or equivalently whether the five disclose asks are actupartner a recurrentative sample), but this is not understandledge which we’re probable to have access to. Given this novel piece of proposeation that 25 percent of the problems are undergraduate level, perhaps I will revert to being unsurpelevated aobtain, but will watch forward to being surpelevated when AI is getting cforfeiter 50 percent on the dataset, because carry outance at “qual level” (as Elliot depicts it — the next 50 percent of the asks) is exactly what I’m paengageing to see from these systems — for me this would recurrent a huge fracturethraw.
Prove this theorem!
However, as Borcherds points out, even if we ended up with a machine which was super-human at “discover this number!” asks, it would still have restricted applicability in many areas of research mathematics, where the key ask of interest is usupartner how to “show this theorem!”. In my mind, the hugegest success story in 2024 is DeepMind’s AlphaProof, which settled four out of the six 2024 IMO (International Mathematics Olympiad) problems. These were either “show this theorem!” or “discover a number and furthermore show that it’s the right number” asks and for three of them, the output of the machine was a brimmingy establishalized Lean proof. Lean is an interdynamic theorem showr with a firm mathematics library mathlib holding many of the techniques necessitateed to settle IMO problems and a lot more besides; DeepMind’s system’s solutions were human-examineed and verified to be “brimming labels” solutions. However, we are back at high school level aobtain; whilst the asks are inanxiously difficult, the solutions use only school-level techniques. In 2025 I’m certain we’ll see machines carry outing at gelderly level standard in the IMO. However this now forces us to uncover up the “grading” can of worms which I’ve already refered once, and I’ll finish this post by talking a little more about it.
Who is labeling the machines?
July 2025. I can envisage the follothriveg situation. As well as hundreds of the world’s cleverest schoolchildren go ining the IMO, there will be machines go ining. Hopebrimmingy not too many though. Because the systems will be of two types. There will be systems surrfinisherting answers in the language of a computer proof examineer enjoy Lean (or Rocq, Isabelle, or many others). And there will be language models surrfinisherting answers in human language. The huge separateence between these two subleave outions are that: if a labeler verifies that the statement of the ask has been rightly transtardyd into the computer proof examineer, then all they necessitate to do is to examine that the proof compiles and then they fundamentalpartner understand that it is a “brimming labels” solution. For the language models we will have a situation enjoy the lower Putnam solutions above — the computer will author someleang, it will watch convincing, but a human is going to have to read it nurturebrimmingy and grade it, and there is certainly no guarantee that it will be a “brimming labels” solution. Borcherds is right to remind the AI community that “show this theorem!” is what we repartner want to see as mathematicians, and language models are currently at least an order of magnitude less accurate than expert humans when it comes to reasonable reasoning. I am dreading the inevitable onsgigglet in a year or two of language model “proofs” of the Riemann hypothesis which will equitable hold claims which are unsee-thharsh or inaccurate in the middle of 10 pages of right mathematics which the human will have to wade thraw to discover the line which doesn’t helderly up. On the other hand, theorem showrs are at least an order of magnitude more accurate: every time I’ve seen Lean not acunderstandledge a human argument in the mathematical literature, the human has been wrong.
In fact, as mathematicians, we would enjoy to see more than “show this theorem!”. We would enjoy to see “show this theorem, rightly, and elucidate what originates the proof labor in a way which we humans comprehfinish”. With the language model approach I trouble (a lot) about “rightly” and with the theorem showr approach I trouble about “in a way which we humans comprehfinish”. There is still a huge amount to be done. Progress is currently happening repartner rapidly. But we are a extfinished way away. When will we “beat the undergraduate barrier”? Nobody understands.