How to search for theorems in Lean 4

It's August 2024. To say the "theorem search in Lean" ecosystem is thriving is an understatement, last time I used Lean (less than a year ago?) most of these tools didn't exist.

July 24th 2024 Kevin Buzzard was giving the "The state of the art in the formalisation of research-level mathematics" talk during the "Interactive Theorem Proving in Education and Research" workshop (link).
During the questions section, I asked something along the lines of "what do mathematicians use for searching theorems in Lean lately - Loogle, Moogle, ReProver, #find tactic, something else completely?".
Kevin Buzzard responded with something along the lines of "Moogle! Once there was a 6-hour Moogle outage, and that was the only thing zulip discussed".

So - let's start from listing all of our options, perhaps with a mild bias towards Moogle.

NOTE - this is very much WIP, but already explorable.


First of all, an overview.

Overview of searching for theorems in Lean 4: Moogle, LeanSearch.net, Loogle, #find, #library_search, Lean Copilot
Overview of theorem search methods in Lean 4.
"NEURAL" search is a search based on a neural network (all such engines are marked with ✨).
"FORMAL" search is a search based on pattern-matching expressions.

While traditional search engines only begin to experiment with semantic search (say, nn-first approach of exa.ai), Lean theorem search seems to be overpowered by nn-first approaches.

Both Moogle.ai and LeanSearch.net support a formal pattern-matching approach apart from their main nn-based search entry.

Throughout this post, we'll be searching for a simple theorem - not_exists from Mathlib.

theorem not_exists {α : Sort u_1} {p : α → Prop} :
   (¬∃ (x : α), p x) ↔ ∀ (x : α), ¬p x

Neural search engines will be queried with ¬∃ (x : α), p x then ∀ (x : α), ¬p x,
formal search engines will be queried with (¬∃ (x : α), p x) → ∀ (x : α), ¬p x.

In no way is this the best possible testing query - my goal for this post is to showcase the interfaces of these systems for a theorem that everyone can understand.

NEURAL: LeanSearch.net

LinkAuthorsAnnouncement
https://leansearch.netJiang Jiedong,
Guoxiong Gao,
Haocheng Ju
Zulip, 30th July 2024
LeanSearch.net interface

LeanSearch is a semantic search engine.

Long story short - per each theorem in Mathlib, it takes a mathlib formal statement and its context (docstring, related definitions, etc.), and "informalizaes" it using GPT-3 - turns the theorem statement into a natural language description.

Then, it creates the semantic embedding [theorem name; theorem's formal statement in Lean; natural language version of this theorem]  and puts it into the vector database (one vector == one theorem; theorems are considered similar when the cosine distance between these two vectors is small).

When searching for theorem in the LeanSearch interface, the user can type anything they like - LaTeX formulas, approximate theorem name, term descriptions, natural language queries - LeanSearch can digest all of it.

Anecdotally, if you input something obviously silly like theorem name that has many letters "z", semantic search will happily return theorem List.zip_unzip. If you input theorem that they learn in school, semantic search will happily return theorem Nat.sq_sub_sq (a : ℕ) (b : ℕ) : a ^ 2 - b ^ 2 = (a + b) * (a - b).

After the query is submitted, the query text undergoes query augmentation using GPT-4 (something along the lines of "You're a maths expert, please transform this lousy query into proper maths speak AND accompanying Lean 4 code").

After this process, we have vectors representing all theorems in Mathlib, and a vector representing our augmented search query. Using cosine similarity, we find the theorem vectors that are most similar to the query. Those are our search results.

NEURAL: Moogle

LinkAuthorsAnnouncement
https://moogle.ai The Morph LabsZulip, November 2023
Moogle interface

Moogle is a hybrid. In order to perform a formal pattern-matching search, start your query with #find.

Unlike LeanSearch.net, Moogle is yet to publish a paper about their system.

It's unclear whether LeanSearch.net or Moogle.ai performs better, both are worth trying. LeanSearch.net paper claims performance superior to Moogle.ai's, with a caveeat that these are not directly comparable.

As mentioned in the introduction, Moogle seems to be the most popular search engine at the moment - probably because it was the first semantic search engine for Lean. At the moment of writing, LeanSearch.net is one-month-old.

NEURAL: Search-Mathlib (simple model on huggingface)

LinkAuthorsAnnouncement
huggingface.co/spaces/dx2102/search-mathlibDeming XuZulip, 30th July 2024

Published, quite unfortunately, on the same day as LeanSearch.net - this is a simple model on huggingface. Returns pretty good results, but this was merely a training exercise for the author as he states in his introduction on zulip.

I think the zulip message by Deming Xu is worth a read - with only 100 lines of Lean code (adjusted code taken from another library) + 200 lines of Python code (100 of which is just setting up the web page), he managed to create a search engine that performs reasonably well, which is quite inspiring to anyone who wants to experiment with semantic search in Lean 4.

The principles are similar to what LeanSearch.net (and, likely, Moogle) is doing - convert all theorems into vectors, convert a search query into a vector, put all of these vectors into vector database, and determine the theorem vectors closest to the search query vector. The main difference between LeanSearch.net and Deming's huggingface model is that LeanSearch.net enhances both the theorem vectors and the query vector using conversion to natural language - Deming just uses raw Lean data (theorem name, theorem type, theorem docstring).

FORMAL: Loogle (aka Mathlib's #find)

LinkAuthorsAnnouncement
https://loogle.lean-lang.orgJoachim BreitnerZulip
Lean's Loogle interface

Loogle was developed by Joachim Breitner, and was reportedly (link) inspired by Haskell's Hoogle.

Loogle's search is not semantic, it matches expressions by their exact structure. Technically, Loogle is a wrapper around Mathlib's #find - it adds some functionality to the default #find command, and creates a web interface around it. 

Initially, Joachim tried to merge two #find PRs (#6643 and #6363) into Mathlib 4, but the reviews went for too long, so he just went ahead and:
1) factored out #find + batteries into a separate repo (https://github.com/nomeata/loogle), and
2) created https://loogle.lean-lang.org, which uses this #find + batteries internally.

 

Mathlib's #find command

FORMAL: Mathlib's apply? and exact? (aka library_search)

From the docs:

library_search has been renamed to apply? (or exact? if you only want solutions closing the goal)

Both apply? and exact? work on a goal only.
apply? will find all theorems that are legal to execute on a goal.
exact? will find all theorems that are legal to execute on a goal, AND that do close the goal.

Here is the interface for the apply? tactic:

Lean's apply? interface
Lean's exact? interface

Good docs: https://lean-lang.org/blog/2024-4-4-lean-470/.

OTHER: Lean Copilot

LinkAuthorsAnnouncement
https://github.com/lean-dojo/LeanCopilot
(install as a Lean library)
Kaiyu Yang,
others
Zulip,
Twitter

Unlike the search engines above, Lean Copilot takes inspiration from GitHub Copilot - it looks at the current state of your proof, and tries to predict the next useful step.

Lean Copilot can be considered a search engine because, using the select_premises tactic, you can search "relevant theorems to my proof" with it (confusingly enough, when they say "premises" - they mean "theorems"). It will look something like this (image courtesy of LeanCopilot repo):