Huschi's Blog

The Final Benchmark

2026-01-14 · 5 min read

Preface

Throughout this post, 'AI System' refers broadly to any artificial intelligence system, whether LLM-based agents, other LLM-scaffold architectures, or entirely different approaches.

Introduction

Current efforts for creating AI Systems can be roughly classified into two categories: Intelligence (Reasoning / Capabilities) and Goal Alignment. Goal Alignment refers to whether an AI System will complete a given task as intended or, more broadly, whether it will act in accordance with widely accepted human morals. Intelligence refers to the System's capability to actually fulfill that goal.

Any benchmark for testing AI Systems essentially tests a mix of these two categories.

For example, SWE Bench Verified tests whether an AI System solves a software engineering problem without cheating, and whether it has the intelligence to generate the necessary code and use the appropriate tools, to solve the problem. This benchmark leans more towards testing the intelligence of the model. Another example, primarily testing goal alignment, is Petri 2.0. But there are numerous others.

The benchmark I want to introduce in this blog post is Formal Conjectures, which is primarily intelligence focused.

The final Intelligence Benchmark

Formal Conjectures is a benchmark of open problems in math formalized in Lean. It is a crowdsourced open source effort coordinated on Github and started by Researchers working at Google Deepmind.

If you're unfamiliar with Lean or formalization: in short, it's a digital model of mathematical concepts that enables computers to deterministically verify mathematical proofs. You can read more on Lean here. Here are some crude statistics on the benchmark. As of 02.2026 the benchmark has over 800 open problems, but it's growing with every new submission.

Why is math relevant for measuring the intelligence of AI? Mathematics is essentially reasoning made rigorous through symbols. Real-world problems are messier, which is why we call it general reasoning in that context. To support this hypothesis with some data, here is a correlation plot between Reasoning benchmarks and Math benchmarks.

Why I call this the final benchmark

Let me list some strengths of this benchmark:

There are some slight downsides:

Overall, this benchmark appears quite robust for measuring AI intelligence.

The Busy Beaver Scale

As mentioned earlier, this benchmark effectively never saturates. This is because mathematics has the peculiar property that easy-to-state questions can be incredibly difficult to answer. Some (famous) examples are Fermat's Last Theorem or Odd Perfect Numbers, which is also in Formal Conjectures Perfect Numbers File.

Another such simple-to-state-but-hard-to-solve problem is determining the value of BB(n) for a given n, where BB(n) is the Busy Beaver Function. The Busy Beaver Function is the maximum number of shifts a Turing Machine with n states and 2 symbols can make on a empty tape before halting. But the exact definition is not too important.

For further information on the Busy Beaver Function, see this survey by Scott Aaronson. You can also explore the Busy Beaver Challenge Wiki.

This function is interesting because for each n, it essentially represents a collection of challenges: determining whether specific Turing Machines halt. These can be very simple challenges, most are simple, but some are notoriously difficult. Furthermore, the halting of some Turing Machines can be encoded as mathematical problems, a few examples being listed in the Busy Beaver Math Olympiad. They are generally very hard to solve. So determining BB(n) essentially requires solving many novel challenges. Each successive n brings more problems, since the number of Turing Machines of size n is (4n+1)2n, and harder ones, as an additional state allows for greater complexity.

An AI System capable of proving BB(7) would be vastly more intelligent than one that can "merely" prove BB(6).1 So this essentially gives an example of an ordered (by difficulty) never-ending list of hard problems to ask.23

Other such Intelligence Scales

Of course one can find many such intelligence measures. Another option in a similar vein to determining Busy Beaver values would be, to prove all well-formed formulas of size n. Another example is determining the exact values of the Ramsey Numbers from the field of Graph Theory. However, unlike Busy Beaver values, where the Halting Problem forbids a general solution, the Ramsey Number problem may admit a general structure across all values.

Personally, I simply like the idea of using Busy Beaver values to classify AI intelligence. One can say that a BB(8)-AI means that it is capable of proving the value of BB(8).

Conclusion

Many benchmarks are becoming saturated, such as ARC-AGI. This trend has prompted dramatic names like Humanity's Last Exam. Formal Conjectures, by contrast, enables us to test intelligence indefinitely, assuming the AI System actually reveals its intelligence, which is the small element of Goal Alignment this benchmark tests.

In conclusion, Formal Conjectures offers a verifiable, never-saturating, non-leaking, and cheat-proof AI Intelligence Benchmark. But Formal Conjectures is nothing special in that regard; any Formal Proof System, such as Isabelle, Rocq, or HOL Light, can be used to formally state extremely hard-to-prove statements. Formal Conjectures is simply the biggest organized effort of this sort across all Formal Proof Systems.

Addendum (Feb 17, 2026): Just one day after publishing this post, Tristan Sterin — who initiated the BB(5) project — published a post on how well Claude Opus 4.6 can determine BB(4). You can read it here. It positioned Opus 4.6 just shy of BB(4) on the Busy Beaver Scale.

  1. Interestingly, one might get problems with writing out the full number, it is known that BB(6)>25 (see https://wiki.bbchallenge.org/wiki/BB(6)). Storing that number completely requires likely more computer memory than what all of humanity has build, we can not even properly represent the number of digits it has. Nevertheless we can still proof the Busy Beaver Champion, the Turing machine, which takes the longest to halt. Or even more general, just proving which Turing machines hold and which don't, that is the hard part.

  2. This claim is not strictly accurate. Due to the uncomputability of the Busy Beaver function, for every computable, consistent axiomatic theory T there is a threshold nT beyond which the value of BB(n) is unprovable within T (see Proposition 3 in Aaronson's survey). While this creates a theoretical upper bound, nT is likely sufficiently big for measuring intelligence in the foreseeable future. Moreover, the benchmark can adapt by also asking for a proof of independence. Once independence is proven, the challenge shifts to defining a stronger axiomatic theory. Interestingly, since generating a hierarchical sequence of such theories is also uncomputable, moving to a stronger system requires genuine novelty too.

  3. Addendum (Feb 20, 2026); There is also a more practical limit, (besides the theoretical limit from 2) with proving the Busy Beaver Values. For each BB(n), we need a halting proof / disproof of every machine of size n, but the longest proof / disproof also grows uncomputable. So there is a value nP, for which the proof to determine BB(nP), is longer than the amount of atoms in the universe or atleast larger than all of the combined computer memory of humanity. This value nP might be much smaller than the theoretical limit nT, but only time will tell.

#ai #benchmark #lean #math