LeanSearch: What Gets Converted Into What

Indexing

There are 3 steps that LeanSearch performs to index Mathlib:

  1. parse the Lean project
    Uses Jixia, open-sourced Lean parser (also developed by LeanSeach team).
    As of May 2025, LeanSearch indexes mathlib with the python -m database jixia <path_to_mathlib> Init,Mathlib command, meaning Lean's "Init" folder (with all descendant folders and files) and Mathlib's "Mathlib" folder (with all descendant folders and files) are indexed.
  2. informalize the parsed project
    Uses DeepSeek api. Costs around 100$ for Mathlib.
  3. create embeddings
    Uses e5-mistral-7b-instruct. Costs nothing, but takes around 3 days on their servers for Mathlib.

Example

1. Jixia

Suppose we have a .lean file with the theorem below, and we parse that file using Jixia.

/-- Our theorem -/
theorem odd_plus_odd (a b : Nat) : is_odd a → is_odd b → is_even (a + b) := by
  intro ha hb
  unfold is_odd is_even at *
  sorry

2. Informalize

To create the informalization, LeanSearch passes the following information to DeepSeek's LLM:

Theorem name: "odd_plus_odd"
Signature: "(a b : Nat) : is_odd a → is_odd b → is_even (a + b)"
Value:
  ":= by
    intro ha hb
    unfold is_odd is_even at *
    sorry"
Docstring: "Our theorem"
Kind: "theorem"
Docstring on top of this Lean file: "These are our theorems about natural numbers"
Neighbors:
  - even_plus_even: (a b : Nat) : is_even a → is_even b → is_even (a + b)
  - odd_plus_odd: (a b : Nat) : is_odd a → is_odd b → is_even (a + b)
  - sum_four_nums: (a b c d : Nat) : is_even a → is_even b → is_odd c → is_odd d → is_even (a + b + c + d)
  - is_odd: (n : Nat)
    Informal name: Odd natural number
    Informal description: A natural number \( n \) is odd if \( n \mod 2 = 1 \)
Dependencies:
  - is_even: (n : Nat)
    Informal name: Even natural number
    Informal description: A natural number \( n \) is even if the remainder when divided by 2 is 0, i.e., \( n \equiv 0 \mod 2 \).
  - is_odd: (n : Nat)
    Informal name: Odd natural number
    Informal description: A natural number \( n \) is odd if \( n \mod 2 = 1 \).

"Neighbors" here are quite literally the theorems that are located below and above of our theorem in our .lean file.
"Dependencies" here are the declarations directly mentioned in the signature and the body of the theorem. Technically, dependencies in the body of the theorem are uninteresting and could be removed (we can use what we like to prove the theorem!), though they can give one some context. Still - I would personally remove them from the informalization's input.

Notice the informalizations are in LaTeX. That's because LeanSearch specifically asks for the informalization to be in LaTeX in its user prompt to DeepSeek. 
As far as I see the decision to use LaTeX comes from LeanSearch's desire to display natural language descriptions in the interface. Most (all?) user queries are not in LaTeX, they are either in Lean's unicode, or in natural language - embeddings created from unicode informalizations would be better at matching the user's query.

3. Create embeddings

After the informalization (informal name + informal description) is in place, LeanSearch creates the embeddings from {kind} {name} {signature}\n{informal_name}: {informal_description}. In our example, that would be:

theorem ['even_plus_even']  (a b : Nat) : is_even a → is_even b → is_even (a + b)
Sum of Even Numbers is Even: For any natural numbers \( a \) and \( b \), if \( a \) is even and \( b \) is even, then \( a + b \) is even.