In the workshop session following the overview and grand vision for using artificial intelligence (AI) to assist mathematical reasoning, several speakers explored case studies in both mathematics and computer science.
Yann LeCun, Meta Platforms, Inc., moderated the session on case studies in AI to assist mathematical reasoning; he introduced speakers François Charton, Meta AI, and Adam Wagner, Worcester Polytechnic Institute.
Charton presented the goal of solving mathematical problems with a transformer—a type of neural network (NN) model used in large language models (LLMs) like ChatGPT that trains on large amounts of data but cannot necessarily reason by itself. He described mathematical problem-solving with transformers as a translation task. The transformer should be trained on generated sets of problems and solutions in order to translate the problem, representable as a mathematical statement, into the solution, another mathematical statement. For example, one would want the transformer to be able to translate the input 7 + 9 into the output 16,
or to translate the input of a specific polynomial into the output of the polynomial’s roots.
Charton presented the problem of finding the greatest common (GCD) of two integers. The integers 10 and 32 have a GCD of 2. This can be encoded as sequences of base 10 digits, so the model would train on an input of +, 1, 0, +, 3, 2, with the matching output +, 2. He emphasized that the transformer does not reason mathematically; it learns purely by training on examples. He then presented several instances in which transformers effectively learned to solve mathematical problems by computing the symbolic integral of a function (Lample and Charton 2019) or calculating the eigenvalues of a matrix, for example (Charton 2021).
Although the model was shown to be effective at solving some problems, questions remain as to whether it had learned the underlying mathematics. Charton presented a few scenarios in which the model could not give a correct answer, regardless of the size of the transformer (i.e., the number of parameters) or the amount of data used in training. Returning to the GCD problem, he showed results for a model trained on randomly generated pairs of integers and their GCDs, encoded in base 30. Achieving 85 percent accuracy after training on 300,000 examples and 94.6 percent accuracy after 450 million examples, the model appears to be effective at finding the GCD. However, when encoding the training data in other bases, the model did not perform as well, with its accuracy depending on the base. Studying the model more closely, Charton found that it was predicting the GCD correctly for specific cases and consistently failing for others. He presented ongoing work using a different way of training the model, in which training data are logarithmically distributed, that seems to be more effective in finding the GCD for pairs of integers in all bases.
Charton showed another example in which a model learned to decompose symmetric matrices (Charton 2022). The spectral theorem provides a simple way for humans to solve this problem using a few of the matrix’s properties, such as its eigenvalues. One can ask whether the model, which is training on examples and is unaware of the underlying mathematics, learns this theorem. After training the model to 92 percent accuracy, Charton tested it on 100,000 matrices and found that though the model predicted the correct decomposition in 92 percent of the cases, it predicted the matrices’ eigenvalues correctly in 99.99 percent of cases. Even when the model failed, it respected certain properties, which suggests that some of the underlying mathematics was learned. Furthermore, when he altered the experiment and stopped training the model after it reached 70 percent accuracy, it still predicted eigenvalues correctly in 99.6 percent of test cases.
Finally, Charton presented an example in which a model was trained to predict eigenvalues of matrices. After training on only Wigner matrices, the model performed very well when tested on Wigner matrices
but could not predict eigenvalues of non-Wigner matrices. Training on matrices with Gaussian or Laplace-distributed eigenvalues, however, the model was able to generalize and predict eigenvalues for different kinds of matrices. He underscored that training data matter.
To conclude, Charton reiterated a few key points: (1) Test sets are not sufficient; a model can do well on test sets but perform catastrophically on other data. (2) It is important to perform failure case analysis, studying problems when a model fails. (3) Transformers are, amazingly, able to learn the mathematics underlying a problem just by training on examples.
Wagner remarked that reinforcement learning (RL) algorithms have achieved superhuman levels of success in games such as Go and chess, starting with only the rules of the game as input. He asked whether this same idea could be applied to mathematics by framing a mathematical problem as a game and giving it to an NN. He examined several examples of the same type of RL algorithm seeking counterexamples to open conjectures.
Wagner’s first example centered on some of his own work surrounding a conjecture stating that for any graph, the sum of its largest eigenvalue and its matching number is greater than or equal to some function of its number of vertices. The smallest counterexample found for this conjecture has 600 vertices. In general, to use RL to find a counterexample to a conjecture, one determines two things: how to phrase the conjecture as a game, and what the reward function should be. He explained that in this case the conjecture can be framed easily as a game: the algorithm is given edges one at a time and can choose whether to accept or reject the next edge. The reward function can be designed to minimize the sum of the graph’s largest eigenvalue and matching number. Wagner then showed results in which the algorithm, over time, was able to find a counterexample. He acknowledged that this was a “dream scenario,” with both an obvious way to phrase the conjecture as a game and an obvious choice of reward function.
Wagner shared five more complex examples with unique challenges in which the RL algorithm still successfully refuted an open conjecture. He presented an open conjecture similar to that of the first example, stating that for any graph, the sum of two certain parameters is positive (Aouchiche and Hansen 2016). The game can be phrased the same way, and the reward function can minimize the sum of the two parameters, as before. This time, the best graph the algorithm could find was not quite a counterexample, with the two parameters summing to approximately 0.4. However, the mathematician can easily adjust the graph produced by the
algorithm to find a true counterexample. Thus, he stated that even though the RL algorithm could not find a counterexample on its own, it provided enough insight for a human to solve the problem easily.
Wagner presented a third example related to matrices (Brualdi and Cao 2020), demonstrating the algorithm’s applicability to other types of problems besides graphs. He then gave two more examples where the choice of reward function was less obvious than in the three previous examples. The first was a conjecture on trees (Collins 1989), and the second a question about the existence of two graphs with a special relationship. Through the latter case especially, Wagner explained that sometimes with more complicated conjectures, though a reward function can be devised, RL is not necessarily the best approach for the problem.
Wagner’s final example studied a conjecture posed by Paul Erdős in 1962, which differed from his earlier examples in that any counterexample must be an infinite sequence of graphs. Wagner and his colleague Gwenaël Joret found a way to produce a counterexample with RL by using a standard “trick” in graph theory: constructing a finite graph and then taking a limit to infinity. Wagner remarked that this example demonstrates the power of these methods; if the conjecture had been made today, it would not have remained unsolved for 27 years. However, the example also reveals a significant drawback: although their algorithm found a counterexample and solved the conjecture, the graph it produced is complicated and messy, providing no understanding beyond the falsity of the conjecture. Thomason (1989) offers far more insight into the problem and solution. The understanding that usually arises from human over machine proofs is important to keep in mind, Wagner stressed.
Incorporating questions submitted by workshop participants, LeCun asked each speaker a few questions about their presentations. He asked about Wagner’s specific methodology, and Wagner explained some of the subtleties of his RL method (versus other kinds of machine learning methods) as well as the difficulties of benchmarking. LeCun drew the participants’ attention to the community project Nevergrad,1 a large PyTorch library of gradient-free optimization methods that can be useful as a baseline before studying more sophisticated methods.
Turning to Charton, LeCun inquired about the utility of finding the GCD via transformers, given that efficient algorithms already exist to calculate the GCD deterministically. Charton explained that transformers are
___________________
1 The documentation for Nevergrad can be found at https://facebookresearch.github.io/nevergrad, accessed June 29, 2023.
useful in this case because it is a simple, easily solved problem. Evaluating the effectiveness of transformers to find the GCD helps guide intuition and define the use cases for which AI may (and may not) be helpful. LeCun also asked about Charton’s current work in which training data for the GCD model are logarithmically distributed. Charton described the motivation of curriculum learning; the logarithmic distribution ensures that the model trains on a greater number of simpler cases, imitating how humans first learn simple problems before learning more difficult ones.
Delving into more general topics, LeCun posed a question about how combining generative methods like LLMs with computational and mathematical tools like WolframAlpha might be useful in the context of mathematical reasoning. Wagner answered that since LLMs cannot reason on their own, real solutions will likely come from adjusting their architectures to enable reasoning and planning. However, Charton expressed skepticism, noting that this kind of solution may work for simple arithmetic and mathematical word problems but may not be effective for more complex physics problems that require a mixture of several tools at once, especially considering the large amounts of data necessary for training. LeCun asked how researchers could combine the power of LLMs, which cannot reason, with the planning abilities that have been developed in other spaces like AI for game playing. He explained the concept of a Monte Carlo tree search, which is crucial to AI for many games. Games like Go can be represented in a tree with branches for every possible move; mathematical proving can be represented similarly. Because the branches grow exponentially, he continued, success hinges on the ability to identify which branches are most promising and should thus be explored. Charton reiterated that to apply this idea to mathematical proof, one of the main barriers lies in obtaining and effectively using data to train a model to predict efficient next steps.
Continuing to explore the question of how to develop an ability in AI to reason and rely on mental models like humans do, Charton emphasized the importance of definitions. It could be useful to add definitions to AI—mirroring the notions mathematicians have on the differences among theorems, lemmas, corollaries, etc.—as well as to build real-world elements into training—mirroring how children easily intuit mathematical concepts from real-world elements. He stressed the value of concept-building and representation at a base level. Wagner described his process of proving mathematical theorems, indicating that both a global overview of the proof and a local picture of the current step are important. Such an architecture would be crucial to advancing AI for mathematical proof. Drawing on Wagner’s description, LeCun remarked on the top-down, hierarchical approach that humans take to complete a proof and wondered whether it would ever be possible for AI to plan hierarchically.
Charton commented that hierarchical planning might be less necessary than many believe. Human processes use hierarchical planning, but NNs are much better at working in high-dimensional spaces than humans. Researchers may be influenced by the human limitations that necessitate hierarchical planning, he explained, and NNs may not need that kind of architecture to be effective.
Speakers Thierry Coquand, University of Gothenburg; Johan Commelin, University of Freiburg; and Greg Morrisett, Cornell University, presented case studies in proof verification and checking. The session was moderated by Talia Ringer, University of Illinois at Urbana-Champaign.
Coquand provided a general introduction to interactive proof assistants, discussing notions of correctness and the motivation for developing and using the tools, followed by a demonstration of a proof assistant in action. He first explained why mathematical correctness can be checked with computers. For knowledge outside of mathematics, one must rely on an external authority, but within mathematics, an objective and purely internal way to ensure correctness exists. Mathematical correctness is verified using the form, not the content, of the argument. He qualified that intuitions and social factors are still crucial to mathematics, but ultimately, an argument is accepted based on the formal rules. Before the advent of computers, absolute precision in mathematics was considered unfeasible. However, mathematicians can now write complex proofs precisely using interactive proof assistants, which he defined as “software that helps mathematicians [and] computer scientists in writing mathematical arguments.” Proof assistants themselves can be trusted to be correct because their kernels are small enough to be verifiable by humans.
To explore motivations for using proof assistants, Coquand pointed out that in both computer science and mathematics, some proofs are so long and complex that it is virtually impossible to check their correctness by hand with full confidence. He gave the Feit-Thompson theorem as an example from mathematics; its proof was 250 pages long (Feit and Thompson 1963). For some applications in computer science, it is essential that all details be correct, making proof assistants useful, he said. seL42 and CompCert3 are two examples of software that were verified using
___________________
2 The seL4 website is https://sel4.systems, accessed July 10, 2023.
3 The CompCert website is https://compcert.org, accessed July 10, 2023.
interactive proof assistants. seL4 is an operating system microkernel, and CompCert is a compiler; having proof of correctness lends great value to each. Additional motivation for using proof assistants arises from the way these tools push researchers to consider how to express formal rules for mathematics, which can lead to new insights; for example, recent work revealed new logical insight about the notion of equality—a very basic mathematical notion—and unexpected connections between equality and the field of homotopy theory. Proof assistants can provide a new avenue for mathematicians to organize their ideas and to improve their understanding of complex proofs, which could ultimately lead to producing better proofs.
Coquand explained that the development of proofs has a strong analogy with the development of programs. They have the same issues of modularity, notations, choice of names, and maintenance, and most developments in both areas happen through large team collaborations.
Proof assistants also have the advantage of being interactive, Coquand said. He cited an analogy between proof assistants and video games: “The main advantage of a proof assistant is that it gives the students immediate feedback. Incorrect proofs are rejected right away. Such rejections are an insult and challenge for motivated students. They will persist in their struggle to convince the machine” (Nipkow 2012). Coquand demonstrated the use of an interactive proof assistant by playing Kevin Buzzard and Mohammad Pedramfar’s Natural Number Game,4 which uses the Lean theorem prover. He noted its similarity to a video game in its level design and walked the workshop participants through the interactions with the computer for one level. He remarked that AI has become exceptional at playing some games, so the “game” of proving mathematical theorems could be a natural area to follow.
Commelin stated that interactive theorem provers (ITPs) form a rigorous and precise platform that can serve several functions. He broke down interactive theorem proving into three pieces: (1) formal verification, in which proofs are mechanically verified using a small set of rules, as Coquand discussed; (2) proof automation, in which the computer assists verification through neural or symbolic methods; and (3) libraries, which are necessary to underpin these large mathematical projects. He remarked that libraries are often overlooked but can be a great tool for knowledge
___________________
4 The Natural Number Game website is https://www.ma.imperial.ac.uk/~buzzard/xena/natural_number_game, accessed June 27, 2023.
management, providing a platform to search, organize, and preserve mathematical knowledge.
Referencing the Liquid Tensor Experiment5 in which users formally verified a complex proof written by Dustin Clausen and Peter Scholze, Commelin showed workshop participants an example using the Lean theorem prover. The theorem prover contains an input pane where code is written, and a section with the goal state showing the current assumptions in place and what needs to be proved. The code imports prerequisites from the library,6 asserts some global assumptions, and then states the theorem. Within the theorem statement are tactics—sets of instructions for the proof assistant—that build the proof.
Enumerating some desired features of ITPs, Commelin described the perspectives of readers, authors, and maintainers. For readers, he said that formal proofs should be readable, with good notation and documentation, as well as explorable and interactive. For example, readers want to be able to identify with ease the main idea of a proof or assumptions needed in order to apply the result. He pointed workshop participants toward a tool developed by Patrick Massot and Kyle Miller with interactive elements that allow users to hover over statements and see the current Lean proof state in ordinary language, among other functionalities.7 For authors, Commelin asserted that formal proofs should be flexible (in terms of foundations, input, and output), powerful, and expressive. Authors want a large library of preliminary results and automation that is fast, effective, and integrated into the workflow. The ideal scenario would allow an author to sketch the main idea of a proof and leave the rest to automation. However, he noted that most of these desired features do not yet exist. Finally, for maintainers of libraries, he said that formal proofs should be robust, deterministic, and fast. Maintainers desire tools to refactor large libraries and automation to fix broken proofs, as well as tools like linters and tests that can guarantee quality of both the contributions and the library. He pointed out the tension among what readers, authors, and maintainers value in an ITP.
Turning to a discussion of the state of the art of automation and ITPs, Commelin gave a few examples of ways automation is or could be applied to ITPs. He named decision problems, explaining that some fragments of mathematics, such as Euclidean geometry problems, are decidable and can be solved by certain algorithms. Another opportunity
___________________
5 The GitHub repository for the Liquid Tensor Experiment can be found at https://github.com/leanprover-community/lean-liquid, accessed June 27, 2023.
6 The mathlib library for the Lean theorem prover can be found at https://github.com/leanprover-community/mathlib, accessed June 27, 2023.
7 This tool can be found at tinyurl.com/LeanIpam, accessed June 27, 2023.
to use automation for ITPs comes from computer algebra. Connections between computer algebra systems and ITPs are emerging, but because a systematic protocol or framework linking the two does not yet exist, Commelin highlighted this as an area for further research. Satisfiability problem (SAT) and satisfiability modulo theories (SMT, which is a generalization of SAT) solvers can also be connected to ITPs. One drawback is that SAT/SMT solvers can only apply to a specific type of mathematics, and it is not clear how to generalize; another is that proofs from these solvers can be enormous, and it is nontrivial to reconstruct them in ITPs. Finally, neural AI may be applied to ITPs. He cited an example in GPT-f producing a small contribution to mathlib and expressed that promising developments are occurring in the area. However, some disadvantages to using neural AI tools in this way are that there is not yet integrated tooling with ITPs; they provide slow, hit-or-miss responses; and they depend on enormous computational power beyond that of consumer hardware, which raises sociological and ecological questions.
In closing, Commelin explored ways that AI and ITPs could enhance each other in the future. He imagined that neural and symbolic automation could help ITPs with proof discovery, explanation, and repair, as well as library design. Conversely, ITP libraries could provide large sets of training data. Intermediate goal states, proof terms, and tactic scripts could be useful for the AI that is often only trained on polished end results. He encouraged AI researchers to explore Lean 4 in particular because it exposes an enormous amount of internal structure. Researchers could begin to fine-tune LLMs in a feedback loop with ITPs, teaching the models error recovery in a process similar to how humans learn. Commelin concluded his presentation with an open challenge to create an AI that selects “beautiful” or “important” results from mathlib, suggesting that this is intimately connected to proof discovery and the ways AI can provide meaningful assistance in mathematical research.
Offering a perspective from computer science, Morrisett expressed that one of the main challenges today in software security is ensuring that there are no bugs that can be exploited. The hope in software verification is that proof assistants can check correctness, safety, and security in the massive amounts of code that cannot be checked properly by humans.
Presenting a case from several years ago, Morrisett described an instance in which Google aimed to provide a service allowing users to run native machine code in the Chrome web browser. Intending to sandbox the code (i.e., prevent the code from reading or writing outside a given data segment) to ensure integrity and security, Google developed a
software fault isolation tool called Native Client. The compiler was modified to generate self-policing code that would ensure the code was following the sandbox security policy, and a checker was built to ensure that the machine code was truly self-policing. The checker was roughly 600 lines of code—much simpler, smaller, and more easily verifiable than the compiler. However, when Google ran a competition, numerous teams found bugs in the checker’s code. At this point, Morrisett and some of his students decided to build a formally verified checker for the software fault isolation tool.
Morrisett noted that constructing code and its proof of correctness together is often easier than verifying code after the fact. With this in mind, he and his students built a completely new checker that had about 80 lines of code. With this shorter length, they were able to prove the checker’s correctness. However, he remarked that to prove correctness completely in a case like this, one needs a model for how to run the code on a machine. Because machines like the Intel x86 have thousands of instructions, which are often ill-specified, much of the project’s effort went toward building and validating this model to reason about the execution of the checker’s code. He noted that this exemplifies a difference between proofs in computer science and mathematics—in computer science, proofs can be straightforward but extremely tedious, with many cases. Thus, automation played a key role in this project. Furthermore, the main challenge often lies not in the proof itself but in modeling the environment connecting to the real world. He remarked that the speed of automation is vital for engaging with the proof in a meaningful way.
As another example, Morrisett presented the Defense Advanced Research Projects Agency’s (DARPA’s) High-Assurance Cyber Military Systems program that sought to develop a secure software stack for drones. Researchers built a clean slate software stack that included an seL4 software system and some additional features. They proved system-wide properties including that the system was memory-safe and ignored malformed and nonauthenticated messages. Although this program was a great success, Morrisett enumerated some challenges that remain: (1) The seL4 kernel is simple but still took 20 person-years to prove its correctness; (2) DARPA did not meet all safety and security requirements, just those deemed most important; and (3) models need to be faithful to the environment, and the software running on the drone needs to be verified.
Morrisett stressed that in software security, the value in formalization comes from the machine’s ability to audit proofs with an enormous amount of cases. Additionally, unlike in mathematics, software verification requires constructing and validating models, and constructing decision procedures to scale. He concluded with a “wish list,” pointing to areas where future work may be helpful. For example, a co-pilot for proof
assistants would be useful, including support for synthesis and for summarizing and navigating large proofs. He remarked that he finds himself reproving small data structures and algorithms where a library could be used, but at the moment, finding an existing library is more work than reconstructing a proof himself. Tools for generating real-world models and testing infrastructure to validate the models would also be useful.
Ringer asked Commelin how one verifies, in practice, that the formal statement of a definition or theorem in a library corresponds to the English statement. Commelin replied that for a small proof, one would check by hand that each definition matched the English statement. However, for the Liquid Tensor Experiment, this was impossible because of the recursive nature of definitions; the researchers instead used a form of abductive reasoning and provided many examples of formalized statements corresponding to English statements, establishing reasonable confidence. He referenced a blog post that expounds on this topic (see Topaz 2022).
Noting that all three presenters have substantial experience writing formal proofs, Ringer asked about the most difficult part of writing a formal proof development, and how better automation could help. Commelin named designing the global structure and hierarchy of the project overall, particularly for large experiments like the Liquid Tensor Experiment. Proof repair is another difficult part of these projects due to its tediousness, and he envisioned LLMs being useful in that area. Morrisett described writing definitions as very difficult, which Commelin and Coquand both seconded. Ringer followed up by inquiring about the most useful qualities for practical proof automation. Commelin emphasized that integration into workflows and user experience design are crucial, and Morrisett expressed frustration with the lack of integration between decision procedures. Commelin noted that in Lean, writing tactics to combine other tactics is relatively easy, allowing for smoother integration and making Lean more appealing as a proof assistant.
Aouchiche, M., and P. Hansen. 2016. “Proximity, Remoteness and Distance Eigenvalues of a Graph.” Discrete Applied Mathematics 213:17–25.
Brualdi, R.A., and L. Cao. 2020. “Pattern-Avoiding (0, 1)-Matrices.” arXiv preprint arXiv:2005.00379.
Charton, F. 2021. “Linear Algebra with Transformers.” arXiv preprint arXiv:2112.01898.
Charton, F. 2022. “What Is My Math Transformer Doing?—Three Results on Interpretability and Generalization.” arXiv preprint arXiv:2211.00170.
Collins, K.L. 1989. “On a Conjecture of Graham and Lovász About Distance Matrices.” Discrete Applied Mathematics 25(1–2):27–35.
Feit, W., and J. Thompson. 1963. “Solvability of Groups of Odd Order.” Pacific Journal of Mathematics 13(3):775–1029. http://dx.doi.org/10.2140/pjm.1963.13.775.
Lample, G., and F. Charton. 2019. “Deep Learning for Symbolic Mathematics.” arXiv preprint arXiv:1912.01412.
Nipkow, T. 2012. “Teaching Semantics with a Proof Assistant: No More LSD Trip Proofs.” Pp. 24–38 in Verification, Model Checking, and Abstract Interpretation: 13th International Conference, VMCAI 2012, Philadelphia, PA, USA, January 22–24, 2012. Proceedings 13. Berlin, Germany: Springer Berlin Heidelberg.
Thomason, A. 1989. “A Disproof of a Conjecture of Erdős in Ramsey Theory.” Journal of the London Mathematical Society 2(2):246–255.
Topaz, A. 2022. “Definitions in the Liquid Tensor Experiment.” Lean Community Blog. October 14. https://leanprover-community.github.io/blog/posts/lte-examples.