Last year, large language models were shown to be effective at solving math problems at the high school level and above. “Math is the source of great impact, but it’s done in much the same way that people have done for centuries standing in front of a blackboard,” said Patrick Shafto, DARPA program manager, in a video introducing the program. “The modern world is built on math. Math allows us to model complex systems, such as the way air flows around airplanes, the way financial markets fluctuate, and the way blood flows through the heart. Breakthroughs in advanced math can unlock new technologies, such as cryptography, which is crucial for private messaging and online banking, and data compression, which allows us to take images and video on the internet.
But advances in math can take years. DARPA wants to speed up the process. The goal of expMath is to encourage mathematicians and AI researchers to develop what DARPA calls “AI co-authors” tools that can break down large, complex math problems into smaller, simpler problems that are easier to understand and therefore faster to solve.
For decades, mathematicians have used computers to speed up calculations or to check the correctness of certain mathematical statements. The new hope is that AI might help them crack previously unsolvable problems.
But there’s a big difference betweenAI that can solve high school math problems (which the latest generation of models have mastered) and AI that can (in theory) solve problems that professional mathematicians spend their careers trying to solve.
On the one hand, these tools can automate some of the tasks that math graduates do; on the other, they can push human knowledge beyond existing limits.
Here are three ways to think about this gap.
AI needs more than just clever tricksLarge language models aren’t good at math. They make things up and can be convinced that “2 + 2 = 5.” But newer versions of the technology, especially so-called large reasoning models (LRMs) like OpenAI’s o3 and Anthropic’s Claude 4 Thinking, are far more capable than ever before—and that’s got mathematicians excited.
This year, many LRMs scored high on the American Invitational Mathematics Exam (AIME), an exam for the top 5% of high school math students in the United States. LRMs try to solve problems step by step, rather than just giving the first answer.
Meanwhile, new hybrid models that combine a master’s degree in law (LLM) with some kind of fact-checking system have also made breakthroughs. Emily de Oliveira Santos, a mathematician at the University of São Paulo in Brazil, points to Google DeepMind’s AlphaProof system, which combines an LLM with DeepMind’s game-playing model AlphaZero, as an important milestone. Last year, AlphaProof became the first computer program to reach the level of a silver medalist in the International Mathematical Olympiad, one of the world’s most prestigious math competitions.
In May, Google DeepMind’s model AlphaEvolve achieved better results than anything humans have ever achieved on more than 50 unsolved math problems and several real-world computer science problems.
The momentum of progress is palpable. “GPT-4’s math capabilities are nowhere near undergraduate level,” de Oliveira Santos says. “I remember when it was released, I tested it with a topology problem, and it couldn’t get past a few lines before it completely froze.” But when she gave the same problem to OpenAI’s o1, an LRM released in January, it succeeded. Is AI closing in on human mathematicians?