The Final Benchmark

16 Feb, 2026

2026-01-14 · 5 min read

Preface

Throughout this post, 'AI System' refers broadly to any artificial intelligence system, whether LLM-based agents, other LLM-scaffold architectures, or entirely different approaches.

Introduction

Current efforts for creating AI Systems can be roughly classified into two categories: Intelligence (Reasoning / Capabilities) and Goal Alignment. Goal Alignment refers to whether an AI System will complete a given task as intended or, more broadly, whether it will act in accordance with widely accepted human morals. Intelligence refers to the System's capability to actually fulfill that goal.

Any benchmark for testing AI Systems essentially tests a mix of these two categories.

For example, SWE Bench Verified tests whether an AI System solves a software engineering problem without cheating, and whether it has the intelligence to generate the necessary code and use the appropriate tools, to solve the problem. This benchmark leans more towards testing the intelligence of the model. Another example, primarily testing goal alignment, is Petri 2.0. But there are numerous others.

The benchmark I want to introduce in this blog post is Formal Conjectures, which is primarily intelligence focused.

The final Intelligence Benchmark

Formal Conjectures is a benchmark of open problems in math formalized in Lean. It is a crowdsourced open source effort coordinated on Github and started by Researchers working at Google Deepmind.

If you're unfamiliar with Lean or formalization: in short, it's a digital model of mathematical concepts that enables computers to deterministically verify mathematical proofs. You can read more on Lean here. Here are some crude statistics on the benchmark. As of 02.2026 the benchmark has over 800 open problems, but it's growing with every new submission.

Why is math relevant for measuring the intelligence of AI? Mathematics is essentially reasoning made rigorous through symbols. Real-world problems are messier, which is why we call it general reasoning in that context. To support this hypothesis with some data, here is a correlation plot between Reasoning benchmarks and Math benchmarks.

Why I call this the final benchmark

Let me list some strengths of this benchmark:

Solutions are verified by the minimal Lean Kernel, making it nearly impossible for an AI to cheat.
The questions are open problems in math, meaning there is no risk of data leakage / contamination (meaning question answers being in the training data).
There is essentially no saturation of the benchmark, as it is an evolving benchmark, and we can always ask more and harder questions, but more on that later.

There are some slight downsides:

There can be misformalizations, meaning the formalized Lean statement does not correspond exactly to the intended natural language problem and may therefore be easier (or harder) to solve. This is mitigated by peer-reviewing each submission and periodically running a system that checks for misformalizations.
Answers must be provided in Lean, a format where AI Systems may currently be less proficient than in natural language math. This is not a major concern, as the gap between natural language math and formalized math will continue to close, and AI Systems can formalize human-made theory themselves (or invent entirely new theory). Frontier LLMs, as an example, have gotten significantly better at writing Lean over the past months.

Overall, this benchmark appears quite robust for measuring AI intelligence.

The Busy Beaver Scale

As mentioned earlier, this benchmark effectively never saturates. This is because mathematics has the peculiar property that easy-to-state questions can be incredibly difficult to answer. Some (famous) examples are Fermat's Last Theorem or Odd Perfect Numbers, which is also in Formal Conjectures Perfect Numbers File.

Another such simple-to-state-but-hard-to-solve problem is determining the value of $B B (n)$ for a given $n$ , where $B B (n)$ is the Busy Beaver Function. The Busy Beaver Function is the maximum number of shifts a Turing Machine with $n$ states and $2$ symbols can make on a empty tape before halting. But the exact definition is not too important.

For further information on the Busy Beaver Function, see this survey by Scott Aaronson. You can also explore the Busy Beaver Challenge Wiki.

This function is interesting because for each $n$ , it essentially represents a collection of challenges: determining whether specific Turing Machines halt. These can be very simple challenges, most are simple, but some are notoriously difficult. Furthermore, the halting of some Turing Machines can be encoded as mathematical problems, a few examples being listed in the Busy Beaver Math Olympiad. They are generally very hard to solve. So determining $B B (n)$ essentially requires solving many novel challenges. Each successive $n$ brings more problems, since the number of Turing Machines of size $n$ is $(4 n + 1)^{2 n}$ , and harder ones, as an additional state allows for greater complexity.

An AI System capable of proving $B B (7)$ would be vastly more intelligent than one that can "merely" prove $B B (6)$ .¹ So this essentially gives an example of an ordered (by difficulty) never-ending list of hard problems to ask.²³

Other such Intelligence Scales

Of course one can find many such intelligence measures. Another option in a similar vein to determining Busy Beaver values would be, to prove all well-formed formulas of size $n$ . Another example is determining the exact values of the Ramsey Numbers from the field of Graph Theory. However, unlike Busy Beaver values, where the Halting Problem forbids a general solution, the Ramsey Number problem may admit a general structure across all values.

Personally, I simply like the idea of using Busy Beaver values to classify AI intelligence. One can say that a $B B (8)$ -AI means that it is capable of proving the value of $B B (8)$ .

Conclusion

Many benchmarks are becoming saturated, such as ARC-AGI. This trend has prompted dramatic names like Humanity's Last Exam. Formal Conjectures, by contrast, enables us to test intelligence indefinitely, assuming the AI System actually reveals its intelligence, which is the small element of Goal Alignment this benchmark tests.

In conclusion, Formal Conjectures offers a verifiable, never-saturating, non-leaking, and cheat-proof AI Intelligence Benchmark. But Formal Conjectures is nothing special in that regard; any Formal Proof System, such as Isabelle, Rocq, or HOL Light, can be used to formally state extremely hard-to-prove statements. Formal Conjectures is simply the biggest organized effort of this sort across all Formal Proof Systems.

Addendum (Feb 17, 2026): Just one day after publishing this post, Tristan Sterin — who initiated the BB(5) project — published a post on how well Claude Opus 4.6 can determine BB(4). You can read it here. It positioned Opus 4.6 just shy of BB(4) on the Busy Beaver Scale.

Interestingly, one might get problems with writing out the full number, it is known that $B B (6) > 2 ↑ ↑ ↑ 5$ (see https://wiki.bbchallenge.org/wiki/BB(6)). Storing that number completely requires likely more computer memory than what all of humanity has build, we can not even properly represent the number of digits it has. Nevertheless we can still proof the Busy Beaver Champion, the Turing machine, which takes the longest to halt. Or even more general, just proving which Turing machines hold and which don't, that is the hard part.↩
This claim is not strictly accurate. Due to the uncomputability of the Busy Beaver function, for every computable, consistent axiomatic theory $T$ there is a threshold $n_{T}$ beyond which the value of $B B (n)$ is unprovable within $T$ (see Proposition 3 in Aaronson's survey). While this creates a theoretical upper bound, $n_{T}$ is likely sufficiently big for measuring intelligence in the foreseeable future. Moreover, the benchmark can adapt by also asking for a proof of independence. Once independence is proven, the challenge shifts to defining a stronger axiomatic theory. Interestingly, since generating a hierarchical sequence of such theories is also uncomputable, moving to a stronger system requires genuine novelty too.↩
Addendum (Feb 20, 2026); There is also a more practical limit, (besides the theoretical limit from ²) with proving the Busy Beaver Values. For each $B B (n)$ , we need a halting proof / disproof of every machine of size $n$ , but the longest proof / disproof also grows uncomputable. So there is a value $n_{P}$ , for which the proof to determine $B B (n_{P})$ , is longer than the amount of atoms in the universe or atleast larger than all of the combined computer memory of humanity. This value $n_{P}$ might be much smaller than the theoretical limit $n_{T}$ , but only time will tell.↩

#ai #benchmark #lean #math