There are 3 steps that LeanSearch performs to index Mathlib:
For example, if we have the following theorem:
/-- Our theorem -/
theorem odd_plus_odd (a b : Nat) : is_odd a → is_odd b → is_even (a + b) := by
intro ha hb
unfold is_odd is_even at *
sorry
To create the informalization, LeanSearch passes the following information to DeepSeek's LLM:
Theorem name: "odd_plus_odd"
Signature: "(a b : Nat) : is_odd a → is_odd b → is_even (a + b)"
Value:
":= by
intro ha hb
unfold is_odd is_even at *
sorry"
Docstring: "Our theorem"
Kind: "theorem"
Docstring on top of this Lean file: "These are our theorems about natural numbers"
Neighbors:
- even_plus_even: (a b : Nat) : is_even a → is_even b → is_even (a + b)
- odd_plus_odd: (a b : Nat) : is_odd a → is_odd b → is_even (a + b)
- sum_four_nums: (a b c d : Nat) : is_even a → is_even b → is_odd c → is_odd d → is_even (a + b + c + d)
- is_odd: (n : Nat)
Informal name: Odd natural number
Informal description: A natural number \( n \) is odd if \( n \mod 2 = 1 \)
Dependencies:
- is_even: (n : Nat)
Informal name: Even natural number
Informal description: A natural number \( n \) is even if the remainder when divided by 2 is 0, i.e., \( n \equiv 0 \mod 2 \).
- is_odd: (n : Nat)
Informal name: Odd natural number
Informal description: A natural number \( n \) is odd if \( n \mod 2 = 1 \).
"Neighbors"
here are quite literally the theorems that are located below and above of our theorem in our .lean file."Dependencies"
here are the declarations directly mentioned in the signature and the body of the theorem. Technically, dependencies in the body of the theorem are uninteresting and could be removed (we can use what we like to prove the theorem!), though they can give one some context. Still - I would personally remove them from the informalization's input.
Notice the informalizations are in LaTeX. That's because LeanSearch specifically asks for the informalization to be in LaTeX in its user prompt to DeepSeek.
As far as I see the decision to use LaTeX comes from LeanSearch's desire to display natural language descriptions in the interface. Most (all?) user queries are not in LaTeX, they are either in Lean's unicode, or in natural language - embeddings created from unicode informalizations would be better at matching the user's query.
After the informalization (informal name + informal description) are in place, LeanSearch creates the embeddings from {kind} {name} {signature}\n{informal_name}: {informal_description}
. In our example, that would be:
theorem ['even_plus_even'] (a b : Nat) : is_even a → is_even b → is_even (a + b)
Sum of Even Numbers is Even: For any natural numbers \( a \) and \( b \), if \( a \) is even and \( b \) is even, then \( a + b \) is even.