MIT and KAUST Researchers Launch MathNet to Improve AI Reasoning

Can artificial intelligence ever truly master the nuance of creative mathematical reasoning, or is it merely memorizing patterns from a narrow slice of global competition data? This question sits at the heart of a significant new development from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), King Abdullah University of Science and Technology (KAUST), and the company HUMAIN. By aggregating MathNet, a massive, verified dataset of proof-based problems, researchers are shifting the focus from simply solving problems to understanding the cultural and structural diversity of mathematics itself.

A Global Archive Built by Hand

For decades, the International Mathematical Olympiad (IMO) has functioned as a quiet engine of global mathematical innovation. Every year, participating nations bring booklets of original, creative problems that are shared among delegations and then effectively vanish from the public record. Shaden Alshammari, an MIT PhD student and lead author on the paper, notes, "Every country brings a booklet of its most novel and most creative problems. They share the booklets with each other, but no one had made the effort to collect them, clean them, and upload them online."

The resulting dataset is staggering in its scale and provenance. Comprising more than 30,000 expert-authored problems and solutions spanning 47 countries, 17 languages, and 143 competitions, it is five times larger than the next-biggest dataset of its kind. Building this required the team to track down 1,595 PDF volumes totaling more than 25,000 pages. Much of the backbone of this archive was provided by Navid Safaei, a longtime IMO community figure who had been collecting and scanning these booklets by hand since 2006. Unlike existing resources that rely on informal community forums, MathNet draws exclusively from official national competition booklets, ensuring the solutions are expert-written and peer-reviewed.

The Reality of AI Reasoning

While recent headlines suggest that frontier AI models have achieved gold-medal performance at the IMO, the reality captured by MathNet is more nuanced. Even GPT-5, the top-performing model tested by the team, averaged roughly 69.3 percent on the main benchmark of 6,400 problems, failing nearly one-in-three Olympiad-level questions. The study reveals that AI performance is not just a matter of compute power, but of exposure to diverse mathematical traditions.

Furthermore, the data exposes significant blind spots. When problems include visual figures, performance drops across the board, underscoring that visual reasoning remains a persistent hurdle for even the most advanced systems. Perhaps more concerning is the linguistic bias: while some models show high proficiency in English, several open-source models scored 0 percent on problems presented in Mongolian. These findings challenge the assumption that AI is universally capable, demonstrating instead that models often mirror the limitations of their training data.

Limitations to Consider

It is vital to recognize that MathNet is a benchmark, not a panacea. The researchers highlight that retrieval-augmented generation—providing a model with a similar problem to help it solve a new one—is a double-edged sword. While a model like DeepSeek-V3.2-Speciale gained up to 12 percentage points with accurate retrieval, irrelevant retrieval actually degraded performance in roughly 22 percent of cases. Additionally, the challenge of identifying mathematically equivalent problems across different notations remains steep; testing eight state-of-the-art embedding models, the researchers found that even the best identified the correct match only about 5 percent of the time on the first try.

Next Steps for the Mathematical Community

The significance of this work will be further explored when the team presents their findings at the International Conference on Learning Representations (ICLR) in Brazil later this month. For the researchers, the ultimate goal is to foster a more inclusive landscape for students who lack formal training resources, providing them with a centralized, searchable collection of problems from six continents. As the IMO community continues to navigate the impact of AI, the next readings of performance benchmarks on MathNet will determine whether these models can move beyond mere pattern matching to achieve genuine, transferable mathematical intuition. MathNet is currently available for public use at mathnet.csail.mit.edu.