What exactly is Lean and why use it for theorem proving?

Lean is a proof assistant and programming language developed at Microsoft Research (now maintained by the Lean community). It allows writing mathematical proofs in a formal language that can be mechanically verified. **Why Lean for AI theorem proving:** 1. **Mathlib**: The largest library of formalized mathematics (~1 million lines), providing training data and reusable lemmas 2. **LeanDojo**: A Python interface enabling programmatic interaction, essential for reinforcement learning 3. **Tactic mode**: Proofs as sequences of commands, naturally suited to sequential generation 4. **Active development**: Lean 4 is under active development with modern features **Alternatives:** - **Coq**: More mature, used for verified compilers (CompCert) and kernels (seL4) - **Isabelle**: Strong automation, used for Flyspeck (Kepler conjecture proof) - **HOL Light**: Simpler but less automation Lean 4 has become the de facto standard for neural theorem proving due to the combination of Mathlib and LeanDojo.

How does best-first search compare to other search strategies?

**Best-First Search (BFS-Prover's approach):** - Uses priority queue ordered by cumulative log probability - Always expands most promising node - Can backtrack by choosing from queue - Requires good heuristic (neural network scores) **Depth-First Search:** - Explores one path completely before backtracking - Memory efficient (only stores current path) - Can get stuck in deep dead ends - Used by some older provers **Breadth-First Search:** - Explores all nodes at depth d before depth d+1 - Guarantees shortest proof - Exponential memory in branching factor - Impractical for deep proofs **Monte Carlo Tree Search (MCTS):** - Balances exploration and exploitation via UCB - Requires value estimation (simulation or learned) - More complex but handles uncertainty better - Used by AlphaProof and some other systems **Best-first search works well for theorem proving because:** 1. Neural network provides good heuristic guidance 2. Memory is manageable with priority queue pruning 3. Natural integration with LLM confidence scores 4. Simple to implement and parallelize

Why is the planner so important for achieving 95% accuracy?

The gap between 86.1% (no planner) and 95.08% (with planner) reveals the fundamental limitation of monolithic search. **Without planning:** - Search must find a 15-step proof in a space of 100^15 possibilities - Early mistakes compound—wrong direction at step 3 dooms 12 subsequent steps - No intermediate checkpoints—must get everything right in sequence - Neural network confidence degrades over long horizons **With planning:** - Hard theorem decomposed into 5 simpler subgoals - Each subgoal: 3-step proof in 100^3 space (manageable) - Intermediate checkpoints—errors isolated to one subgoal - Planner uses stronger reasoning (can be GPT-4 or similar) - Prover focuses on local tactics, not global strategy **The key insight:** Proof strategy and proof tactics are different skills. Strategy requires high-level mathematical reasoning ("we should establish this intermediate result"). Tactics require local manipulation ("rewrite using this lemma"). The planner handles strategy; the prover handles tactics. Separation enables each to excel at its strength.

What is the pp1 property and why does it matter?

The `pp1` property is a custom extension BFS-Prover adds to Lean's TacticState class. It extracts a simplified, single-goal view from the full tactic state. **Processing steps:** 1. Extract first paragraph of pretty-print (before `\n\n`) 2. Remove "case ..." lines via regex: `r'^\s*case .*\n?'` 3. Remove "this : True" declarations 4. Strip whitespace **Why it matters:** In plan mode, the prover works on one subgoal at a time. Showing all goals: - Confuses the model about what to prove next - Wastes context window on irrelevant information - Leads to tactics that address the wrong goal Focus mode with `pp1` ensures the model sees only the immediate subgoal, dramatically improving tactic relevance. **Example transformation:** Full state (confusing): ``` case h1 n : ℕ h : n > 0 ⊢ n + 0 = n case h2 m : ℕ ⊢ m * 1 = m ``` pp1 (focused): ``` n : ℕ h : n > 0 ⊢ n + 0 = n ```

How does the dual CPU/wall-clock timeout system work?

BFS-Prover monitors **both** CPU time and wall-clock time independently because they catch different failure modes. **CPU time monitoring:** - Tracks actual CPU cycles consumed by Lean process - Uses `psutil.Process().cpu_times()` for user + system time - Includes time from child processes (prevents timeout evasion) - Catches: infinite loops, expensive computations **Wall-clock time monitoring:** - Tracks elapsed real time - Set to 1.5x the CPU limit (provides buffer) - Catches: I/O hangs, deadlocks, scheduling delays **Timeout outcomes:** *Soft timeout (CPU limit reached):* - Return LeanError with descriptive message - Create ErrorNode in proof tree - Continue search with other tactics *Hard timeout (wall limit reached, process unresponsive):* - Kill entire process tree with SIGKILL - Raise DojoRestartRequired exception - Create fresh Dojo instance - Retry from last checkpoint (up to 3 times) **Why both are necessary:** A process stuck on I/O (e.g., waiting for a file lock) consumes zero CPU time but still hangs. Conversely, a CPU-bound infinite loop might not trigger wall-clock timeout due to OS scheduling. Monitoring both ensures robust timeout behavior.

How does node reuse (deduplication) work in the proof tree?

Different tactic sequences can reach identical proof states. BFS-Prover maintains a cache to detect and reuse these nodes. **Implementation:** ```python nodes: Dict[TacticState, Node] # State → Node mapping ``` **When creating a new edge:** 1. Execute tactic, get response (new TacticState) 2. Check if response already in `nodes` dictionary 3. If yes: reuse existing node, add edge pointing to it 4. If no: create new InternalNode, add to cache **Benefits:** - **Memory savings**: Don't store duplicate states - **Computation savings**: Don't re-explore known states - **Better status propagation**: Multiple paths converge, proof found via any path marks all as PROVED **Example:** ``` Root ├── "simp" → State A │ └── "ring" → State B └── "ring" → State B (REUSED!) Both paths reach State B. When State B is proved, both parent paths become PROVED. ```

How does dynamic replanning handle stuck subgoals?

When BFS fails to prove a plan step after timeout, replanning generates a refined plan. **Trigger conditions:** - BFS exhausted timeout on a subgoal - All explored paths led to ErrorNode - Progress stalled for configured duration **Replanning input:** 1. Original theorem statement 2. Already proven subgoals (immutable) 3. The stuck subgoal **Replanning constraints:** - Proven steps must remain exactly as given - New steps inserted immediately after proven work - Stuck step may be decomposed or corrected - Must maintain logical continuity **Example:** ``` Original: [h1, h2, h3, h4, h5] Proven: [h1, h2] Stuck: h3 Replanned: [h1, h2, h3a, h3b, h3, h4, h5] ↑ ↑ Intermediate steps to help prove h3 ``` **Key insight:** The planner decomposes h3 into smaller lemmas (h3a, h3b) that make h3 easier. This is the same hierarchical decomposition principle applied recursively.

What are the computational requirements for running BFS-Prover?

**Hardware requirements:** For the 7B model: - 1-2 GPUs (A100 40GB or equivalent) - 64GB+ system RAM - 32+ CPU cores (for parallel Dojo instances) For the 32B model: - 4-8 GPUs (A100 80GB recommended) - 128GB+ system RAM - 64+ CPU cores **Inference configuration:** Typical setup for research evaluation: - 4-8 tactic generators (GPUs) - 96 prover agents (CPU-bound) - 600 seconds timeout per theorem - ~10-15 minutes per difficult theorem **Cost estimates:** On cloud infrastructure (approximate): - Per theorem (easy): ~$0.10-0.50 - Per theorem (hard): ~$1-5 - Full miniF2F evaluation: ~$500-2000 **Reducing costs:** - Use 7B model (82.4% vs 86.1%, significant savings) - Reduce timeout for easy theorems - Skip planning for simple statements - Use quantization (available via Ollama)

How does BFS-Prover compare to AlphaProof?

AlphaProof (DeepMind, 2024) and BFS-Prover represent different approaches: **AlphaProof:** - Closed source, not publicly available - Uses AlphaZero-style MCTS with learned value network - Demonstrated IMO gold-medal performance - Massive compute (reportedly millions of GPU-hours training) - Targets IMO specifically **BFS-Prover-V2:** - Open source (Apache 2.0 license) - Uses best-first search with LLM probability guidance - State-of-the-art on miniF2F/ProofNet benchmarks - More accessible compute requirements - General-purpose (Lean 4 theorems broadly) **Philosophical differences:** - AlphaProof: Learn everything from scratch via self-play - BFS-Prover: Leverage pre-trained LLMs, focus on efficient search **Practical implications:** For researchers and practitioners, BFS-Prover is accessible and reproducible. AlphaProof demonstrates what's possible with unlimited resources but isn't available for use.

How do I use BFS-Prover for my own Lean projects?

BFS-Prover integrates with LLMLean for interactive use: **Setup (via Ollama):** 1. Install Ollama and pull the model: ```bash ollama pull zeyu-zheng/BFS-Prover-V2-7B:q8_0 ``` 2. Configure LLMLean: ```toml # ~/.config/llmlean/config.toml api = "ollama" model = "zeyu-zheng/BFS-Prover-V2-7B:q8_0" mode = "parallel" prompt = "tacticstate" responseFormat = "tactic" numSamples = "5" ``` 3. Add to your Lean project: ```lean require llmlean from git "https://github.com/cmu-l3/llmlean.git" ``` 4. Use in proofs: ```lean example : n + 0 = n := by llmstep "" -- Get suggestion ``` **Batch mode (full proof search):** For automated proving of many theorems: 1. Prepare theorems in JSONL format 2. Prepare dojo data with matching theorem entries 3. (Optional) Generate plans with planning pipeline 4. Run `bash src/search/run_local_search.sh`

Back to Blog

Theorem Proving Reasoning LLMs ML Engineering Research

Neural Theorem Proving: From Best-First Search to Gold-Medal Proofs

Q: What are preference pairs and how are they used in training?

Preference pairs are comparisons between tactics that succeeded vs tactics that failed from the same state. They enable training with ranking objectives. **Extraction process:** For each edge in a successful proof trajectory: 1. Get source node's tactic state 2. Identify the winning tactic (the one that led to proof) 3. For each sibling edge leading to ErrorNode: - Create pair: (state, winning_tactic, losing_tactic) **Format:** ```python ( "n : ℕ ⊢ n + 0 = n", # State "rfl", # Winning tactic "simp" # Losing tactic ) ``` **Use in training:** These pairs enable: - **DPO (Direct Preference Optimization)**: Train model to prefer winning tactics - **Ranking loss**: Learn relative ordering of tactics - **Contrastive learning**: Push winning tactics higher, losing tactics lower **Why valuable:** Binary success/failure labels don't indicate *how close* a tactic was to working. Preference pairs provide fine-grained signal about which tactics to prefer when multiple options exist.

End-to-end guide to neural theorem proving systems. From best-first search algorithms to planner-enhanced multi-agent architectures—understand how AI systems are achieving state-of-the-art results on formal mathematics.

September 24, 202512 min read

The Quest for Automated Mathematical Proof

Automated theorem proving represents one of the oldest and most challenging problems in artificial intelligence. The dream is simple yet profound: build systems that can discover and verify mathematical proofs automatically. Unlike other AI domains where "good enough" might suffice, theorem proving demands absolute correctness—a proof is either valid or it isn't.

Recent breakthroughs have transformed this landscape. In 2025, neural theorem provers achieved results that seemed impossible just years ago: 95.08% on miniF2F (a benchmark of competition mathematics) and solutions to problems from the International Mathematical Olympiad. These systems don't just guess answers—they construct formal proofs that can be mechanically verified.

This post provides a comprehensive technical guide to BFS-Prover-V2, ByteDance's state-of-the-art theorem proving system for Lean 4. We'll explore every layer of the architecture:

Foundations: Why theorem proving is fundamentally different from other AI tasks
Best-First Search: The core algorithm that guides proof exploration
The Proof Tree: Data structures for tracking search state
The Planning Pipeline: How LLMs decompose hard theorems into tractable subgoals
Training at Scale: Multi-stage expert iteration that overcomes performance plateaus
Inference at Scale: Planner-enhanced multi-agent tree search
Production Engineering: Timeout handling, parallelism, and robustness

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              NEURAL THEOREM PROVING TIMELINE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1960s-1990s    SYMBOLIC ERA                                            │
│  ───────────    Resolution theorem provers, ATP competitions            │
│                 Pure logic, no learning                                  │
│                                                                          │
│  2016-2019      NEURAL GUIDANCE EMERGES                                 │
│  ──────────     DeepMath, GamePad, early neural ATP                     │
│                 Neural networks guide symbolic search                    │
│                                                                          │
│  2020-2022      LANGUAGE MODELS ENTER                                   │
│  ──────────     GPT-f, PACT, LeanDojo                                   │
│                 Transformers generate proof steps                        │
│                                                                          │
│  2023-2024      SCALING BREAKTHROUGHS                                   │
│  ──────────     AlphaProof, DeepSeek-Prover, Llemma                     │
│                 RL training, expert iteration                            │
│                                                                          │
│  2025           STATE OF THE ART                                        │
│  ─────          BFS-Prover-V2: 95.08% miniF2F                           │
│                 Multi-agent tree search + planning                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part I: Understanding Theorem Proving

Why Theorem Proving is Different

Before diving into algorithms, we must understand why theorem proving poses unique challenges that distinguish it from other AI domains.

The verification asymmetry:

In most machine learning tasks, evaluation is expensive and approximate. Is this image classification correct? Is this translation good? These require human judgment or proxy metrics. Theorem proving is fundamentally different: verification is cheap and exact. Given a candidate proof, we can mechanically check its validity in milliseconds. A proof either type-checks in the proof assistant or it doesn't—there's no gray area.

This asymmetry creates a unique opportunity. We can generate many candidate proofs and filter to those that verify. The challenge shifts from "how do we know if we're right?" to "how do we efficiently explore the space of possible proofs?"

The search space explosion:

A mathematical proof is a sequence of logical steps. At each step, there might be hundreds of potentially applicable tactics (proof steps). A 10-step proof with 100 choices per step creates 100^10 = 10^20 possible paths. Exhaustive search is impossible.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE PROOF SEARCH SPACE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                        Theorem Statement                                 │
│                              │                                           │
│              ┌───────────────┼───────────────┐                          │
│              │               │               │                           │
│              ▼               ▼               ▼                           │
│          Tactic 1        Tactic 2        Tactic 3      ... (100+ options)│
│              │               │               │                           │
│         ┌────┴────┐     ┌────┴────┐     ┌────┴────┐                     │
│         │    │    │     │    │    │     │    │    │                     │
│         ▼    ▼    ▼     ▼    ▼    ▼     ▼    ▼    ▼    (100+ each)     │
│        ...  ...  ...   ...  ...  ...   ...  ...  ...                    │
│                                                                          │
│                      Search space: O(b^d)                                │
│                      b = branching factor (~100)                        │
│                      d = proof depth (~10-50)                           │
│                                                                          │
│              MOST PATHS LEAD TO DEAD ENDS                               │
│              Only tiny fraction reach valid proofs                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The credit assignment problem:

When a proof search fails after exploring thousands of paths, which decisions were mistakes? Was the initial approach wrong? Did we miss a crucial insight at step 5? Did we take a correct but inefficient path? Unlike games with clear win/loss signals, proof search failures provide sparse feedback.

Formal language constraints:

Proofs must be written in precise formal languages (Lean, Coq, Isabelle). These languages are unforgiving—a misplaced parenthesis or incorrect type annotation invalidates the entire proof. LLMs trained on informal mathematics must learn the exacting syntax of formal systems.

Proof Assistants and Lean 4

Modern theorem proving systems target proof assistants—interactive software environments where humans and computers collaborate on formal proofs. The most prominent include:

Proof Assistant	Language	Notable Uses
Lean 4	Functional	Mathlib (1M+ lines of math), BFS-Prover
Coq	Functional	CompCert (verified C compiler)
Isabelle	Functional	seL4 (verified OS kernel)
HOL Light	Functional	Flyspeck (Kepler conjecture)

Why Lean 4 dominates AI theorem proving:

Lean 4 has emerged as the preferred target for neural theorem provers for several reasons:

Mathlib: A vast library of formalized mathematics (undergraduate through research level) provides training data and useful lemmas
LeanDojo: A Python interface that enables programmatic interaction with Lean, essential for reinforcement learning
Modern design: Lean 4's metaprogramming capabilities enable sophisticated tactic automation
Active community: Rapid development of both the system and mathematical libraries

The tactic state paradigm:

In Lean, proofs proceed by transforming a tactic state—a list of goals that remain to be proven. Each tactic transforms the current state into a new state (ideally with simpler or fewer goals).

Code

Initial state:
  ⊢ ∀ n : ℕ, n + 0 = n                    [Goal: prove this statement]

After "intro n":
  n : ℕ
  ⊢ n + 0 = n                              [Now proving for specific n]

After "rfl":
  No goals                                  [Proof complete!]

The art of theorem proving is selecting the right tactic at each step. This is where neural networks enter.

Part II: Best-First Search for Theorem Proving

The Core Algorithm

Best-First Search (BFS) is a fundamental algorithm for exploring large search spaces. Unlike breadth-first search (which explores all nodes at depth d before depth d+1) or depth-first search (which goes deep before backtracking), best-first search maintains a priority queue and always expands the most promising node.

Why best-first search for theorem proving:

Informed exploration: We can use neural networks to estimate which proof states are most likely to lead to success
Anytime behavior: We can stop at any point and return the best proof found so far
Natural parallelism: Multiple workers can explore different branches simultaneously
Backtracking: Unlike greedy search, we can abandon unpromising paths and try alternatives

The priority function:

The key design decision in best-first search is the priority function—how do we score nodes to determine exploration order? BFS-Prover uses cumulative log probability:

$\text{Priority}(n) = \sum_{i=1}^{d} \log P(\text{tactic}_i | \text{state}_{i-1})$

Where $d$ is the depth (number of tactics from root to node $n$ ). This prioritizes proof paths where the neural network is confident in each step.

The depth reward parameter:

Raw cumulative log probability has a bias: longer paths accumulate more negative log probability (since each step multiplies probability < 1). This can make the search prefer short, dead-end paths over longer successful ones.

BFS-Prover introduces a depth reward parameter $\alpha$ that adjusts priorities:

$\text{AdjustedPriority}(n) = \frac{\text{Priority}(n)}{d^\alpha}$

α value	Effect	Use case
0.0	Pure cumulative logprob	Prefer confident paths (default)
1.0	Average logprob per step	Balance depth vs confidence
2.0	Strong depth normalization	Encourage deeper exploration

The Search Loop in Detail

The BFS-Prover search operates in two distinct phases when a plan exists:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              TWO-PHASE SEARCH WITH PLANNING                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PHASE 1: PLAN EXECUTION (if plan exists)                               │
│  ─────────────────────────────────────────                               │
│                                                                          │
│  Plan: [have h1: ..., have h2: ..., have h3: ..., have h4: ...]        │
│                                                                          │
│  For each plan_step in plan:                                            │
│    1. Check ProgressCache for cached progress                           │
│    2. If cached_progress > local_progress:                              │
│       → Load proof trajectory from cache                                │
│       → extend_tree() to replay solved steps                            │
│    3. Apply plan_step + "sorry" to advance context                      │
│    4. Run BFS focused on current subgoal (pp1 mode)                     │
│    5. If subgoal solved:                                                │
│       → Update ProgressCache with trajectory                            │
│       → progress += 1                                                   │
│       → reset_search_root() for next subgoal                            │
│    6. If stuck:                                                         │
│       → Trigger dynamic replanning                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PHASE 2: FINAL SEARCH (after plan completion OR no plan)               │
│  ────────────────────────────────────────────────────────                │
│                                                                          │
│  Run unrestricted BFS on remaining goals                                │
│  Uses full tactic state (pp mode, not pp1)                              │
│  Continues until:                                                        │
│    - Proof found (root.status == PROVED)                                │
│    - All paths exhausted (root.status == FAILED)                        │
│    - Timeout exceeded                                                    │
│    - Priority queue empty                                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The main expansion loop:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              BEST-FIRST SEARCH EXPANSION LOOP                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INITIALIZE:                                                             │
│  ───────────                                                             │
│  root = InternalNode(state=initial_state, depth=0, cumulative_logprob=0)│
│  priority_queue = AsyncPriorityQueue()                                  │
│  nodes = {initial_state: root}  # State → Node cache for deduplication  │
│  priority_queue.put((-0.0, root))  # Negative for max-heap behavior     │
│                                                                          │
│  TERMINATION CONDITIONS:                                                 │
│  ───────────────────────                                                 │
│  • priority_queue.empty()                                               │
│  • total_time > timeout (default 600s)                                  │
│  • root.status == PROVED                                                │
│  • root.status == FAILED                                                │
│                                                                          │
│  MAIN LOOP:                                                              │
│  ──────────                                                              │
│                                                                          │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  1. DEQUEUE                                                  │     │
│     │     priority, node = priority_queue.get()                   │     │
│     │     Skip if: node.out_edges is not None (already explored)  │     │
│     │     Skip if: node.status != OPEN                            │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  2. GET STATE REPRESENTATION                                 │     │
│     │     Plan mode: tactic_state = node.state.pp1                │     │
│     │       (First goal only, case statements removed)            │     │
│     │     Normal mode: tactic_state = node.state.pp               │     │
│     │       (Full pretty-print of all goals)                      │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  3. GENERATE TACTICS                                         │     │
│     │     suggestions = tac_gen.generate(tactic_state, params)    │     │
│     │     Returns: [(tactic₁, logprob₁), ..., (tacticₙ, logprobₙ)]│     │
│     │     Deduplication: keep first occurrence of each unique     │     │
│     │     Filtering: remove sorry, admit, native_decide, bugs     │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  4. EXECUTE EACH TACTIC                                      │     │
│     │     For each (tactic, logprob) in suggestions:              │     │
│     │       response, exec_time = run_tactic(node, tactic)        │     │
│     │                                                              │     │
│     │       response types:                                        │     │
│     │         TacticState → new proof state                       │     │
│     │         ProofFinished → no remaining goals!                 │     │
│     │         ProofGivenUp → proof abandoned                      │     │
│     │         LeanError → tactic failed                           │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  5. CREATE NODES AND EDGES                                   │     │
│     │                                                              │     │
│     │     If response is TacticState:                             │     │
│     │       If state in nodes: reuse existing node (dedup)        │     │
│     │       Else: create new InternalNode                         │     │
│     │                                                              │     │
│     │     If response is ProofFinished/ProofGivenUp:              │     │
│     │       Create ProofFinishedNode                              │     │
│     │                                                              │     │
│     │     If response is LeanError:                               │     │
│     │       Create ErrorNode                                      │     │
│     │                                                              │     │
│     │     Create Edge(tactic, src=node, dst=new_node,             │     │
│     │                 time=exec_time, logprob=logprob)            │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  6. ADD TO QUEUE                                             │     │
│     │     For each new InternalNode with status == OPEN:          │     │
│     │       new_priority = cumulative_logprob / (depth^α)         │     │
│     │       priority_queue.put((-new_priority, new_node))         │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  7. UPDATE STATUS                                            │     │
│     │     node.out_edges = [all created edges]  # Mark explored   │     │
│     │     For each edge.dst: edge.dst.in_edges.append(edge)       │     │
│     │     Trigger status propagation (see next section)           │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  8. EARLY EXIT CHECK (Plan Mode)                             │     │
│     │     If has_plan and only 1 remaining goal:                  │     │
│     │       Break to advance to next plan step                    │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part III: The Proof Tree Data Structure

Node Types

BFS-Prover maintains an explicit tree data structure representing the explored proof space. There are three types of nodes:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PROOF TREE NODE TYPES                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                     InternalNode                                 │    │
│  │                                                                  │    │
│  │  Non-terminal state with remaining goals to prove               │    │
│  │                                                                  │    │
│  │  Attributes:                                                     │    │
│  │  ───────────                                                     │    │
│  │  • state: TacticState                                           │    │
│  │      The current proof state (goals + hypotheses)               │    │
│  │                                                                  │    │
│  │  • depth: int                                                   │    │
│  │      Distance from root node (root has depth=0)                 │    │
│  │                                                                  │    │
│  │  • cumulative_logprob: float                                    │    │
│  │      Sum of log probabilities along path from root              │    │
│  │                                                                  │    │
│  │  • _out_edges: Optional[List[Edge]]                             │    │
│  │      None if unexplored, list of edges if explored              │    │
│  │      Setting this "freezes" the node's exploration              │    │
│  │                                                                  │    │
│  │  • _in_edges: List[Edge]                                        │    │
│  │      Edges pointing TO this node (may grow as other paths       │    │
│  │      discover this state through node reuse)                    │    │
│  │                                                                  │    │
│  │  • _status: Status (OPEN | PROVED | FAILED)                     │    │
│  │      Computed lazily, cached with dirty flag                    │    │
│  │                                                                  │    │
│  │  • _distance_to_proof: float                                    │    │
│  │      Minimum edges to reach ProofFinishedNode (inf if unknown)  │    │
│  │                                                                  │    │
│  │  • _marker: bool                                                │    │
│  │      Internal flag for marking subgoal boundaries in plan mode  │    │
│  │                                                                  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                   ProofFinishedNode                              │    │
│  │                                                                  │    │
│  │  Terminal success - no remaining goals                          │    │
│  │                                                                  │    │
│  │  Attributes:                                                     │    │
│  │  ───────────                                                     │    │
│  │  • inner: Union[ProofFinished, ProofGivenUp]                    │    │
│  │      The Lean response indicating completion                    │    │
│  │                                                                  │    │
│  │  • depth: int                                                   │    │
│  │                                                                  │    │
│  │  Constants:                                                      │    │
│  │  ──────────                                                      │    │
│  │  • status: PROVED (always)                                      │    │
│  │  • distance_to_proof: 0 (already at proof)                      │    │
│  │  • is_terminal: True                                            │    │
│  │                                                                  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                      ErrorNode                                   │    │
│  │                                                                  │    │
│  │  Terminal failure - tactic error or timeout                     │    │
│  │                                                                  │    │
│  │  Attributes:                                                     │    │
│  │  ───────────                                                     │    │
│  │  • inner: LeanError                                             │    │
│  │      Error message from Lean or timeout                         │    │
│  │                                                                  │    │
│  │  • depth: int                                                   │    │
│  │                                                                  │    │
│  │  Constants:                                                      │    │
│  │  ──────────                                                      │    │
│  │  • status: FAILED (always)                                      │    │
│  │  • distance_to_proof: math.inf (unreachable)                    │    │
│  │  • is_terminal: True                                            │    │
│  │                                                                  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Edge Structure

Edges represent tactic applications connecting nodes:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                         EDGE STRUCTURE                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                         Edge                                     │    │
│  │                                                                  │    │
│  │  Represents a single tactic application                         │    │
│  │                                                                  │    │
│  │  Attributes:                                                     │    │
│  │  ───────────                                                     │    │
│  │  • tactic: str                                                  │    │
│  │      The tactic text (e.g., "simp [Nat.add_zero]")              │    │
│  │                                                                  │    │
│  │  • src: InternalNode                                            │    │
│  │      Source node (where tactic was applied)                     │    │
│  │                                                                  │    │
│  │  • dst: Node                                                    │    │
│  │      Destination node (result of applying tactic)               │    │
│  │      Can be InternalNode, ProofFinishedNode, or ErrorNode       │    │
│  │                                                                  │    │
│  │  • time: float                                                  │    │
│  │      Wall-clock execution time in seconds                       │    │
│  │                                                                  │    │
│  │  • logprob: float                                               │    │
│  │      Per-token average log probability from model               │    │
│  │      For plan tactics: logprob = 1.0 (placeholder)              │    │
│  │                                                                  │    │
│  │  Methods:                                                        │    │
│  │  ────────                                                        │    │
│  │  • distance_to_proof(): 1 + dst.distance_to_proof               │    │
│  │                                                                  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  LOGPROB CALCULATION:                                                    │
│  ────────────────────                                                    │
│  Raw from vLLM: cumulative_logprob over all tokens                      │
│  Normalized: cumulative_logprob / len(token_ids)                        │
│  This gives per-token average, normalizing for tactic length            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Status Propagation

Status propagates upward from leaves to root through in-edges:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    STATUS PROPAGATION ALGORITHM                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STATUS VALUES:                                                          │
│  ──────────────                                                          │
│  • OPEN: Not yet determined (default for InternalNode)                  │
│  • PROVED: At least one path leads to ProofFinishedNode                 │
│  • FAILED: All paths lead to ErrorNode or explored dead ends            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ALGORITHM (_recompute_status):                                          │
│  ──────────────────────────────                                          │
│                                                                          │
│  1. If node.status != OPEN:                                             │
│       return  # Status is immutable once set                            │
│                                                                          │
│  2. If node.out_edges is None:                                          │
│       return  # Not yet explored, remain OPEN                           │
│                                                                          │
│  3. Check children status:                                              │
│                                                                          │
│       new_status = FAILED  # Assume worst case                          │
│                                                                          │
│       For each edge in node.out_edges:                                  │
│         child_status = edge.dst.status                                  │
│                                                                          │
│         If child_status == PROVED:                                      │
│           new_status = PROVED                                           │
│           break  # One proved child is enough!                          │
│                                                                          │
│         If child_status == OPEN:                                        │
│           new_status = OPEN  # Can't be FAILED if any child is OPEN    │
│                                                                          │
│  4. If new_status != OPEN:                                              │
│       node._status = new_status                                         │
│                                                                          │
│       # Propagate upward through in-edges                               │
│       For each in_edge in node._in_edges:                               │
│         in_edge.src._recompute_status()  # Recursive call              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INVARIANTS:                                                         │
│  ───────────────                                                         │
│  • Status never changes once set to PROVED or FAILED                    │
│  • PROVED propagates immediately (any child proved → parent proved)     │
│  • FAILED only propagates when ALL children are FAILED                  │
│  • Proof found when root.status == PROVED                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│                                                                          │
│         [Root: OPEN]                   [Root: PROVED] ←── propagates up │
│          /       \                      /       \                        │
│     [OPEN]      [OPEN]     →→→    [PROVED]    [OPEN]                   │
│      /  \         |                 /  \         |                       │
│   [ERR] [OPEN]  [ERR]           [ERR] [PF]    [ERR]                     │
│                                          ↑                               │
│                              ProofFinished found!                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Distance-to-Proof Tracking

Once a proof is found, we need to extract the shortest path:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DISTANCE-TO-PROOF PROPAGATION                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ALGORITHM (_recompute_distance_to_proof):                              │
│  ─────────────────────────────────────────                               │
│                                                                          │
│  1. If node.out_edges is None or empty:                                 │
│       distance = infinity                                               │
│                                                                          │
│  2. Else:                                                               │
│       distance = min(edge.distance_to_proof() for edge in out_edges)   │
│                                                                          │
│       where edge.distance_to_proof() = 1 + edge.dst.distance_to_proof  │
│                                                                          │
│  3. If distance improved (new < old):                                   │
│       node._distance_to_proof = distance                                │
│                                                                          │
│       # Propagate upward                                                │
│       For each in_edge in node._in_edges:                               │
│         in_edge.src._recompute_distance_to_proof()                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PROOF EXTRACTION:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Starting from root with distance d:                                    │
│    path = []                                                            │
│    node = root                                                          │
│    While node is not terminal:                                          │
│      best_edge = argmin(edge.distance_to_proof() for edge in out_edges)│
│      path.append((node.state, best_edge.tactic))                       │
│      node = best_edge.dst                                               │
│    return path                                                          │
│                                                                          │
│  Result: Optimal (shortest) proof among all discovered paths            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Node Reuse (Deduplication)

Different tactic sequences can reach the same proof state. BFS-Prover maintains a cache to avoid redundant exploration:

Code

nodes: Dict[TacticState, Node]  # Maps state to existing node

When creating a new node:
  If response (TacticState) in nodes:
    dst_node = nodes[response]      # Reuse existing!
    # Add in_edge to existing node
  Else:
    dst_node = InternalNode(...)    # Create new
    nodes[response] = dst_node      # Cache it

This dramatically reduces memory usage and computation when multiple paths converge to the same state.

Part IV: Tactic Generation

The TacticGenerator Actor

Tactic generation is handled by a Ray actor wrapping vLLM:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TACTIC GENERATOR                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RAY ACTOR: TacticGenerator                                             │
│  ──────────────────────────                                              │
│                                                                          │
│  Initialization:                                                         │
│  • Wraps vLLM AsyncLLMEngine                                            │
│  • max_num_batched_tokens: 8192                                         │
│  • max_num_seqs: 256 (concurrent sequences)                             │
│  • max_model_len: 16384 (context window)                                │
│  • enable_chunked_prefill: True                                         │
│  • tensor_parallel_size: num_gpus                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  INPUT FORMAT:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  The model receives: tactic_state + ":::"                               │
│                                                                          │
│  Example input:                                                          │
│  ┌──────────────────────────────────────────┐                           │
│  │ n : ℕ                                    │                           │
│  │ h : n > 0                                │                           │
│  │ ⊢ n + 0 = n:::                           │                           │
│  └──────────────────────────────────────────┘                           │
│                                                                          │
│  The ":::" delimiter signals "generate tactic here"                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SAMPLING PARAMETERS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  │ Parameter    │ Default │ Purpose                                │    │
│  ├──────────────┼─────────┼────────────────────────────────────────┤    │
│  │ n            │ 16      │ Number of tactic samples               │    │
│  │ temperature  │ 0.7     │ Randomness in sampling                 │    │
│  │ top_p        │ 1.0     │ Nucleus sampling threshold             │    │
│  │ max_tokens   │ 2048    │ Maximum tactic length                  │    │
│  │ logprobs     │ 1       │ Return log probabilities               │    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OUTPUT PROCESSING:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  1. For each generated sequence:                                        │
│       text = output.text.strip()                                        │
│       logprob = cumulative_logprob / len(token_ids)  # Normalize        │
│                                                                          │
│  2. Filter out invalid tactics:                                         │
│       • Contains "sorry" → proof placeholder, not real tactic           │
│       • Contains "admit" → proof placeholder                            │
│       • Contains "native_decide" → can cause infinite loops             │
│                                                                          │
│  3. Filter buggy patterns:                                              │
│       • ("rcases" OR "cases'") AND "?_" → incomplete case split         │
│       • "simpa" AND (" _" OR "_ " OR "_," OR ",_") → malformed          │
│                                                                          │
│  4. Return: List[(tactic_text, per_token_logprob)]                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

State Representation: pp vs pp1

BFS-Prover has two modes for representing the tactic state:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    STATE REPRESENTATION MODES                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FULL MODE (pp) - Used when no plan or in final search                  │
│  ─────────────────────────────────────────────────────                   │
│                                                                          │
│  Shows complete tactic state with all goals:                            │
│                                                                          │
│  ┌──────────────────────────────────────────┐                           │
│  │ case h1                                  │                           │
│  │ n : ℕ                                    │                           │
│  │ h : n > 0                                │                           │
│  │ ⊢ n + 0 = n                              │                           │
│  │                                          │                           │
│  │ case h2                                  │                           │
│  │ m : ℕ                                    │                           │
│  │ ⊢ m * 1 = m                              │                           │
│  │                                          │                           │
│  │ case h3                                  │                           │
│  │ ...more goals...                         │                           │
│  └──────────────────────────────────────────┘                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FOCUS MODE (pp1) - Used during plan execution                          │
│  ─────────────────────────────────────────────                           │
│                                                                          │
│  Shows only the first goal, cleaned up:                                 │
│                                                                          │
│  Processing via custom property:                                        │
│    1. Extract first paragraph (before "\n\n")                           │
│    2. Remove "case ..." lines via regex                                 │
│    3. Remove "this : True" declarations                                 │
│    4. Strip whitespace                                                  │
│                                                                          │
│  Result:                                                                 │
│  ┌──────────────────────────────────────────┐                           │
│  │ n : ℕ                                    │                           │
│  │ h : n > 0                                │                           │
│  │ ⊢ n + 0 = n                              │                           │
│  └──────────────────────────────────────────┘                           │
│                                                                          │
│  WHY FOCUS MODE HELPS:                                                   │
│  ─────────────────────                                                   │
│  • Reduces noise from irrelevant goals                                  │
│  • Model focuses on immediate subgoal                                   │
│  • Shorter context = faster generation                                  │
│  • Dramatically improves tactic relevance when following a plan         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Tactic Execution and Timeout

Executing tactics in Lean requires careful timeout handling:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TACTIC EXECUTION                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FUNCTION: run_tac_cpu_timeout(dojo, state, tactic, ...)                │
│  ───────────────────────────────────────────────────────                 │
│                                                                          │
│  Parameters:                                                             │
│  • cpu_sec_limit: 10.0 (default) - Max CPU time                         │
│  • wall_sec_limit: 15.0 (1.5x CPU) - Max wall-clock time                │
│  • poll_interval: 0.5 - How often to check limits                       │
│  • focus_mode: bool - Wrap tactic in "focus" command                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DUAL TIMEOUT MONITORING:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  A background thread monitors BOTH limits simultaneously:               │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │  Monitor Thread:                                                 │    │
│  │                                                                  │    │
│  │  While tactic_running:                                          │    │
│  │    cpu_time = process.cpu_times().user + .system               │    │
│  │    cpu_time += sum(child.cpu_times() for child in children)    │    │
│  │    wall_time = time.monotonic() - start                        │    │
│  │                                                                  │    │
│  │    If cpu_time > cpu_sec_limit:                                 │    │
│  │      raise LeanCPUTimedOut(limit_type="cpu", ...)              │    │
│  │                                                                  │    │
│  │    If wall_time > wall_sec_limit:                               │    │
│  │      raise LeanCPUTimedOut(limit_type="wall", ...)             │    │
│  │      killed_hard = True  # Process tree will be killed         │    │
│  │                                                                  │    │
│  │    sleep(poll_interval)                                         │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TIMEOUT OUTCOMES:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Soft timeout (CPU limit):                                              │
│    → Return LeanError with timeout message                              │
│    → Create ErrorNode, continue search                                  │
│                                                                          │
│  Hard timeout (wall limit, Lean unresponsive):                          │
│    → Kill entire process tree (SIGKILL)                                 │
│    → Raise DojoRestartRequired                                          │
│    → Caller creates fresh Dojo instance                                 │
│    → Retry up to MAX_DOJO_RESTARTS (3) times                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY DUAL MONITORING:                                                    │
│  ────────────────────                                                    │
│  • CPU time catches infinite loops in Lean                              │
│  • Wall time catches I/O hangs, deadlocks, scheduling issues            │
│  • Child process accounting prevents timeout evasion via subprocess     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part V: The Planning Pipeline

Why Planning Matters

Best-first search alone achieves strong results—but it struggles with theorems requiring long proof chains or non-obvious intermediate lemmas. The search space explodes exponentially with depth, and the neural network's confidence becomes unreliable for distant goals.

The insight: hierarchical decomposition.

Instead of searching for the entire proof at once, first decompose the theorem into intermediate subgoals. Each subgoal is easier to prove, and the planning step can leverage a more powerful reasoning model.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PLANNING IMPACT ON PERFORMANCE                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WITHOUT PLANNING                    WITH PLANNING                      │
│  ─────────────────                   ─────────────                       │
│                                                                          │
│  Complex theorem                     Complex theorem                     │
│       │                                   │                              │
│       ▼                                   ▼                              │
│  ┌─────────────┐                   ┌──────────────┐                     │
│  │ Direct BFS  │                   │   Planner    │                     │
│  │             │                   │  (GPT-4/LLM) │                     │
│  │ Depth: 15+  │                   └──────┬───────┘                     │
│  │ Many dead   │                          │                             │
│  │ ends        │                   Decompose into                       │
│  └─────────────┘                   [h1, h2, h3, h4, h5]                 │
│                                          │                              │
│                              ┌───────────┼───────────┐                  │
│                              ▼           ▼           ▼                  │
│                          ┌──────┐   ┌──────┐   ┌──────┐                │
│                          │ BFS  │   │ BFS  │   │ BFS  │                │
│                          │ d=3  │   │ d=3  │   │ d=3  │                │
│                          └──────┘   └──────┘   └──────┘                │
│                                                                          │
│  BFS-Prover-V2-32B:                 BFS-Prover-V2-32B + Planner:        │
│  86.1% on miniF2F-test              95.08% on miniF2F-test              │
│                                                                          │
│                     +8.98% absolute improvement!                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Plan Structure and Format

A plan is a sequence of have statements—intermediate claims that bridge the theorem's hypotheses to its conclusion.

Example theorem:

Code

theorem singapore2019_r1_p7 (x : ℝ) (hx : Real.tan x = 5) :
    (6 + Real.sin (2 * x)) / (1 + Real.cos (2 * x)) = 83 := by

Generated plan:

Code

have h₁ : Real.sin x = 5 * Real.cos x
have h₂ : Real.sin x ^ 2 = 25 * Real.cos x ^ 2
have h₃ : 26 * Real.cos x ^ 2 = 1
have hsin2x_val : Real.sin (2 * x) = (5 : ℝ) / (13 : ℝ)
have hcos2x_val : Real.cos (2 * x) = -(12 : ℝ) / (13 : ℝ)

Plan JSON format (demo_plan.json):

JSON

{
  "mathd_algebra_17": {
    "plan": [
      "have h_sqrt_simplify : Real.sqrt (16 + 16 * a) = 4 * Real.sqrt (1 + a)",
      "have h_eq_transformed : Real.sqrt (4 + 4 * Real.sqrt (1 + a)) + Real.sqrt (1 + Real.sqrt (1 + a)) = 6",
      "have h_sum_eq : 3 * Real.sqrt (1 + Real.sqrt (1 + a)) = 6"
    ],
    "progress": 0
  }
}

The progress field tracks how many subgoals have been solved (0 = none, persists across restarts).

The Planning Prompt

The planner is a general-purpose reasoning LLM prompted with carefully designed instructions. The prompt must specify the output format precisely and list mandatory rules for valid Lean 4 syntax.

Critical rules for Lean 4 (from prompt.yaml):

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              MANDATORY LEAN 4 RULES IN PLANNER PROMPTS                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ★★★ RULE 1: MOST CRITICAL - EXPLICIT SET/FINSET TYPES ★★★              │
│  ───────────────────────────────────────────────────────                 │
│  This causes ~40% of plan failures if violated!                         │
│                                                                          │
│  ✓ Correct:   ({-1, 0, 1} : Set ℤ)                                      │
│  ✓ Correct:   (Finset.Icc 1 42 : Finset ℕ)                              │
│  ✗ Wrong:     {-1, 0, 1}                                                │
│  ✗ Wrong:     Finset.Icc 1 42                                           │
│                                                                          │
│  Reason: Lean cannot infer element type from set literal notation       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 2: OMIT THE PROOF                                                  │
│  Never provide the proof (no `by` clause). Only the `have` statement.  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 3: VALID LEAN 4 CODE                                               │
│  Output must be type-checkable in Lean 4.10.0 environment.              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 4: USE EXISTING NAMES                                              │
│  Use exact lemma/definition names from Mathlib. Don't invent names.     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 5: NO UNDECLARED VARIABLES                                         │
│  Only use variables from the theorem statement.                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 6: EXPLICIT MULTIPLICATION                                         │
│  ✓ Correct:   a * x                                                     │
│  ✗ Wrong:     ax                                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 7: NO CHAINED INEQUALITIES                                         │
│  ✓ Correct:   a ≤ x ∧ x ≤ b                                             │
│  ✗ Wrong:     a ≤ x ≤ b                                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 8: CORRECT LOGARITHM FUNCTION                                      │
│  Real.log = natural log only                                            │
│  ✓ Correct:   Real.logb 2 8     (log base 2 of 8)                       │
│  ✗ Wrong:     Real.log 2 8                                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 9: FACTORIAL NOTATION                                              │
│  ✓ Correct:   (n)! or Nat.factorial n                                   │
│  ✗ Wrong:     n!                                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 10: EXPLICIT NUMERIC TYPE ANNOTATIONS                              │
│  Division: (1 : ℝ) / 2 = 0.5, but (1 : ℤ) / 2 = 0                       │
│  Subtraction: (1 : ℤ) - 2 = -1, but (1 : ℕ) - 2 = 0                     │
│  ✓ Correct:   (a : ℝ) / b, (n : ℤ) - m                                  │
│  ✗ Wrong:     a / b, n - m (ambiguous)                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 11: NO INTERVAL NOTATION                                           │
│  ✓ Correct:   a ≤ x ∧ x ≤ b                                             │
│  ✗ Wrong:     Icc a b                                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 12: COMPLEX NUMBERS                                                │
│  ✓ Correct:   Complex.I (imaginary unit), Complex.abs z (modulus)       │
│  ✗ Wrong:     i, |z|                                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 13: AVOID ADVANCED INEQUALITY THEOREMS                             │
│  Avoid Hölder's, Jensen's, etc. Prefer basic simplification steps.      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 14: EQUIVALENCES AS IMPLICATIONS                                   │
│  When handling ↔ (iff), produce implications:                           │
│  - Left-to-right: assume LHS, conclude RHS                              │
│  - Right-to-left: assume RHS, conclude LHS                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 15: Real.pi NOTATION                                               │
│  ✓ Correct:   Real.pi                                                   │
│  ✗ Wrong:     π                                                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 16: FINAL CHECK                                                    │
│  Before outputting, review all rules, especially Set/Finset types.      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Plan Verification

Generated plans must be verified before use:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PLAN VERIFICATION PROCESS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CLASS: TheoremData                                                      │
│  METHOD: verify_plan(theorem_id, plan) → (success, error_msg)           │
│                                                                          │
│  PROCESS:                                                                │
│  ────────                                                                │
│                                                                          │
│  1. Load theorem from dojo_data (url, commit, file_path)                │
│                                                                          │
│  2. Initialize Dojo with theorem                                        │
│                                                                          │
│  3. For each step in plan:                                              │
│       a. Execute: dojo.run_tac(state, step)                             │
│       b. Execute: dojo.run_tac(state, "sorry")  # Move to next goal     │
│                                                                          │
│       If step returns LeanError:                                        │
│         return (False, error_message)                                   │
│                                                                          │
│       If step returns wrong type:                                       │
│         return (False, "Unexpected response type")                      │
│                                                                          │
│  4. If all steps pass:                                                  │
│       return (True, "")                                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHT: "sorry" TACTIC                                            │
│  ────────────────────────────                                            │
│                                                                          │
│  After each `have` step, a "sorry" advances past the subgoal proof:    │
│                                                                          │
│  State: ⊢ have h1 : P                                                   │
│          ⊢ P      ← new goal (prove P)                                  │
│          ⊢ Q      ← rest of proof                                       │
│                                                                          │
│  After "sorry":                                                         │
│  State: h1 : P                                                          │
│          ⊢ Q      ← P is now assumed                                    │
│                                                                          │
│  This tests that plan steps are syntactically valid and type-correct   │
│  without requiring actual proofs.                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Dynamic Replanning

What happens when proof search gets stuck on a subgoal?

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DYNAMIC REPLANNING                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRIGGER: BFS fails to solve a plan step after timeout                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  REPLANNING INPUT:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  • Original theorem statement                                           │
│  • Proven subgoals: [h1, h2, ...] (verified, immutable)                 │
│  • Stuck subgoal: h3 (the step that failed)                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  REPLANNING RULES (additional to initial planning):                      │
│  ──────────────────────────────────────────────────                      │
│                                                                          │
│  RULE 16: INSERT AFTER PROVEN STEPS                                     │
│    New `have` statements go immediately after proven subgoals.          │
│    Proven steps are immutable - no edits, insertions, or extensions.   │
│                                                                          │
│  RULE 17: PROVIDE COMPLETE PLAN                                         │
│    Output the entire updated plan, not just new steps.                  │
│    Preserves proven subgoals exactly as given.                          │
│                                                                          │
│  RULE 18: ENSURE LOGICAL CONTINUITY                                     │
│    Each new step must be logically sound.                               │
│    Don't immediately repeat the stuck subgoal.                          │
│    Insert intermediate steps between proven work and stuck goal.        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  Original plan: [h1, h2, h3, h4, h5]                                    │
│  Proven: [h1, h2]                                                       │
│  Stuck: h3                                                              │
│                                                                          │
│  Replanned: [h1, h2, h3a, h3b, h3, h4, h5]                              │
│                       ↑     ↑                                            │
│               New intermediate steps                                     │
│                                                                          │
│  The planner decomposed h3 into smaller steps (h3a, h3b) that          │
│  make h3 easier to prove.                                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Progress Caching with Ray

The ProgressCache Ray actor persists progress across restarts:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PROGRESS CACHE                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RAY ACTOR: ProgressCache (named "progress_cache")                      │
│  ─────────────────────────────────────────────────                       │
│                                                                          │
│  DATA STRUCTURE:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  registry[theorem.full_name] = {                                        │
│      'progress': int,  # 0 to len(plan)                                 │
│      'proof': [                                                         │
│          [],  # Tactics for subgoal 0                                   │
│          [],  # Tactics for subgoal 1                                   │
│          ...                                                            │
│          [],  # Tactics for final step                                  │
│      ],                                                                  │
│      'proof_stats': [                                                   │
│          [],  # [(time, logprob), ...] for subgoal 0                   │
│          ...                                                            │
│      ],                                                                  │
│      'preference_pairs': [                                              │
│          [],  # [(state, win_tactic, lose_tactic), ...]                │
│          ...                                                            │
│      ]                                                                   │
│  }                                                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  METHODS:                                                                │
│  ────────                                                                │
│                                                                          │
│  initialize(theorem, num_subgoals, replace=False)                       │
│    Create empty entry for theorem                                       │
│                                                                          │
│  update(theorem, step_index, proof, proof_stats, preference_pairs)      │
│    1. Validate trajectory via validate_proof(focus_mode=True)           │
│    2. Only update if validation succeeds                                │
│    3. Increment progress                                                │
│                                                                          │
│  get_progress(theorem) → int                                            │
│    Return current progress (number of solved subgoals)                  │
│                                                                          │
│  get_proof_data(theorem) → (proof, stats, pairs)                        │
│    Return accumulated data for all solved subgoals                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THIS MATTERS:                                                       │
│  ─────────────────                                                       │
│  • Multiple prover agents may work on same theorem                      │
│  • Prevents duplicate work across restarts                              │
│  • Enables resume from checkpoints after crashes                        │
│  • Thread-safe via Ray actor model                                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VI: Scaling Training

The Performance Plateau Problem

Traditional fine-tuning on theorem proving data quickly hits diminishing returns. After a few epochs, the model stops improving—even with more data or compute.

Why plateaus occur:

Distribution shift: The model learns to prove theorems from the training distribution, but struggles with novel problems
Error accumulation: Small errors in early proof steps compound into dead ends
Memorization: The model memorizes specific proofs rather than learning general strategies
Data staleness: Static training data doesn't capture the model's current failure modes

Multi-Stage Expert Iteration

BFS-Prover addresses plateaus with a multi-stage expert iteration framework:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              MULTI-STAGE EXPERT ITERATION                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OUTER LOOP (triggered when performance plateaus):                      │
│  ────────────────────────────────────────────────                        │
│                                                                          │
│       ┌───────────────────────────────────────────────────────┐         │
│       │                                                        │         │
│       │  1. DATA RE-SYNTHESIS                                  │         │
│       │     • Run current model on all theorems               │         │
│       │     • Collect successful proof trajectories           │         │
│       │     • Extract preference pairs from failures          │         │
│       │                                                        │         │
│       │  2. DATA CURATION                                      │         │
│       │     • Filter low-quality proofs                       │         │
│       │     • Remove duplicates and near-duplicates           │         │
│       │     • Balance difficulty distribution                 │         │
│       │                                                        │         │
│       │  3. FULL RETRAINING                                    │         │
│       │     • Restart from BASE checkpoint (not current!)     │         │
│       │     • Train on curated data                           │         │
│       │     • Prevents overfitting to stale distributions     │         │
│       │                                                        │         │
│       └───────────────────────────────────────────────────────┘         │
│                              │                                           │
│                              ▼                                           │
│                                                                          │
│  INNER LOOP (while performance improves):                               │
│  ─────────────────────────────────────────                               │
│                                                                          │
│       ┌───────────────────────────────────────────────────────┐         │
│       │                                                        │         │
│       │  1. ROLLOUT                                            │         │
│       │     • Run proof search on theorem set                 │         │
│       │     • Collect new proofs and failures                 │         │
│       │                                                        │         │
│       │  2. TACTIC FILTERING                                   │         │
│       │     • Remove low-confidence tactics                   │         │
│       │     • Filter tactics that frequently fail             │         │
│       │     • Keep tactics that lead to proofs                │         │
│       │                                                        │         │
│       │  3. REFINEMENT                                         │         │
│       │     • Fine-tune on filtered data                      │         │
│       │     • Update model for next iteration                 │         │
│       │                                                        │         │
│       └───────────────────────────────────────────────────────┘         │
│                              │                                           │
│                              ▼                                           │
│                     Evaluate performance                                 │
│                              │                                           │
│              ┌───────────────┴───────────────┐                          │
│              │                               │                           │
│         Still improving?              Plateaued?                        │
│              │                               │                           │
│              ▼                               ▼                           │
│      Continue inner loop           Return to outer loop                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHT: RESTART FROM BASE                                          │
│  ──────────────────────────────                                          │
│  When plateaus occur, DON'T continue from current checkpoint.           │
│  Instead, regenerate all training data, then restart from the           │
│  original base model. This prevents accumulation of training            │
│  artifacts and distribution drift.                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Preference Pair Extraction

From failed proof attempts, BFS-Prover extracts preference pairs for training:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PREFERENCE PAIR EXTRACTION                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FUNCTION: extract_proof_data(root_node)                                │
│  ────────────────────────────────────────                                │
│                                                                          │
│  For each edge in the successful proof trajectory:                      │
│                                                                          │
│    source_node = edge.src                                               │
│    winning_tactic = edge.tactic                                         │
│                                                                          │
│    For each sibling_edge in source_node.out_edges:                      │
│      If sibling_edge.dst is ErrorNode:                                  │
│        losing_tactic = sibling_edge.tactic                              │
│                                                                          │
│        # Create preference pair                                         │
│        pair = (                                                         │
│            source_node.state.pp1,  # or .pp depending on mode           │
│            winning_tactic,                                              │
│            losing_tactic                                                │
│        )                                                                 │
│        preference_pairs.append(pair)                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  State: n : ℕ ⊢ n + 0 = n                                               │
│                                                                          │
│  Tactics tried:                                                         │
│    • "rfl"        → ProofFinished (success!)                            │
│    • "simp"       → ErrorNode (failed)                                  │
│    • "ring"       → ErrorNode (failed)                                  │
│                                                                          │
│  Preference pairs generated:                                            │
│    ("n : ℕ ⊢ n + 0 = n", "rfl", "simp")                                │
│    ("n : ℕ ⊢ n + 0 = n", "rfl", "ring")                                │
│                                                                          │
│  USE IN TRAINING:                                                        │
│  ────────────────                                                        │
│  These pairs enable DPO (Direct Preference Optimization) or similar    │
│  ranking objectives that teach the model to prefer successful tactics. │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VII: Scaling Inference

Multi-Agent Ray Architecture

BFS-Prover uses Ray for distributed computation:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              MULTI-AGENT RAY ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                     RAY CLUSTER                                  │    │
│  │                                                                  │    │
│  │  ┌────────────────────────────────────────────────────────┐     │    │
│  │  │           TACTIC GENERATORS (GPU Actors)                │     │    │
│  │  │                                                         │     │    │
│  │  │   ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐     │     │    │
│  │  │   │TacGen #1│ │TacGen #2│ │TacGen #3│ │TacGen #4│     │     │    │
│  │  │   │ vLLM    │ │ vLLM    │ │ vLLM    │ │ vLLM    │     │     │    │
│  │  │   │ (GPU)   │ │ (GPU)   │ │ (GPU)   │ │ (GPU)   │     │     │    │
│  │  │   └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘     │     │    │
│  │  │        └───────────┴─────┬─────┴───────────┘           │     │    │
│  │  │                          │                              │     │    │
│  │  │   Default: num_tac_gen = 8                             │     │    │
│  │  │   GPUs per generator: num_gpus_per_tac_gen = 1         │     │    │
│  │  │                                                         │     │    │
│  │  └────────────────────────────────────────────────────────┘     │    │
│  │                             │                                    │    │
│  │                             │ Tactic Requests (async)           │    │
│  │                             ▼                                    │    │
│  │  ┌────────────────────────────────────────────────────────┐     │    │
│  │  │              PROVER AGENTS (CPU Actors)                 │     │    │
│  │  │                                                         │     │    │
│  │  │  ┌───────┐ ┌───────┐ ┌───────┐       ┌───────┐        │     │    │
│  │  │  │Prover │ │Prover │ │Prover │  ...  │Prover │        │     │    │
│  │  │  │  #1   │ │  #2   │ │  #3   │       │ #96   │        │     │    │
│  │  │  │       │ │       │ │       │       │       │        │     │    │
│  │  │  │ Dojo  │ │ Dojo  │ │ Dojo  │       │ Dojo  │        │     │    │
│  │  │  │ BFS   │ │ BFS   │ │ BFS   │       │ BFS   │        │     │    │
│  │  │  └───────┘ └───────┘ └───────┘       └───────┘        │     │    │
│  │  │                                                         │     │    │
│  │  │  Default: num_provers = 96                             │     │    │
│  │  │  CPUs per prover: num_cpus_per_prover = 1              │     │    │
│  │  │  MAX_DOJO_RESTARTS = 3 (retries on hard timeout)       │     │    │
│  │  │                                                         │     │    │
│  │  └────────────────────────────────────────────────────────┘     │    │
│  │                             │                                    │    │
│  │                             │ Results                           │    │
│  │                             ▼                                    │    │
│  │  ┌────────────────────────────────────────────────────────┐     │    │
│  │  │              PROGRESS CACHE (Singleton Actor)           │     │    │
│  │  │                                                         │     │    │
│  │  │  • Named "progress_cache"                               │     │    │
│  │  │  • Stores solved subgoals                              │     │    │
│  │  │  • Enables resume from checkpoint                       │     │    │
│  │  │  • Thread-safe via Ray                                  │     │    │
│  │  │                                                         │     │    │
│  │  └────────────────────────────────────────────────────────┘     │    │
│  │                                                                  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  PARALLEL PROVE STRATEGY:                                                │
│  ────────────────────────                                                │
│  • Interleave task submissions across theorems for cache locality       │
│  • Early stopping: cancel remaining futures when proof validated        │
│  • Best result selection: prefer validated > proved > other             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Configuration Parameters

Complete configuration for proof search:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              COMPLETE CONFIGURATION PARAMETERS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MODEL SETTINGS:                                                         │
│  ───────────────                                                         │
│  max_num_batched_tokens    8192     Max tokens per batch                │
│  max_num_seqs              256      Max concurrent sequences            │
│  max_model_len             16384    Context window size                 │
│                                                                          │
│  SAMPLING SETTINGS:                                                      │
│  ─────────────────                                                       │
│  n_sampling_search         16       Tactics per state                   │
│  sampling_temperature      0.7      Generation randomness               │
│  sampling_top_p            1.0      Nucleus sampling threshold          │
│  max_tokens                2048     Max tactic length                   │
│                                                                          │
│  TIMEOUT SETTINGS:                                                       │
│  ────────────────                                                        │
│  timeout_per_theorem       600      Seconds per theorem                 │
│  vllm_timeout              60       Seconds for model generation        │
│  tactic_timeout            10       CPU seconds per tactic              │
│  wall_timeout              15       Wall seconds per tactic (1.5x CPU)  │
│  MODEL_READINESS_TIMEOUT   600      Seconds to wait for GPU warmup      │
│                                                                          │
│  RESOURCE ALLOCATION:                                                    │
│  ────────────────────                                                    │
│  num_tac_gen               8        Tactic generator actors             │
│  num_provers               96       Parallel prover actors              │
│  num_gpus_per_tac_gen      1        GPUs per generator                  │
│  num_cpus_per_prover       1        CPUs per prover                     │
│                                                                          │
│  SEARCH TUNING:                                                          │
│  ─────────────                                                           │
│  depth_reward              0.0      Priority normalization exponent     │
│  pass_k                    1        Attempts per theorem                │
│  MAX_DOJO_RESTARTS         3        Retries on hard timeout             │
│                                                                          │
│  LOGGING:                                                                │
│  ────────                                                                │
│  LOG_RATE                  0.01     Fraction of events to log (1%)      │
│  LOG_RATE_FREQUENT_EVENT   0.001    For very frequent events (0.1%)     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Proof Validation

Even after a "successful" proof, BFS-Prover validates the result:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PROOF VALIDATION                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FUNCTION: validate_proof(theorem, trajectory, cls, focus_mode)         │
│  ──────────────────────────────────────────────────────────────          │
│                                                                          │
│  Parameters:                                                             │
│  • theorem: The theorem to prove                                        │
│  • trajectory: List[str] of tactics                                     │
│  • cls: Expected final type (ProofFinished or tuple of types)           │
│  • focus_mode: If True, wrap each tactic in "focus"                     │
│                                                                          │
│  PROCESS:                                                                │
│  ────────                                                                │
│                                                                          │
│  1. Create fresh Dojo instance (timeout=600s)                           │
│                                                                          │
│  2. Get initial state from theorem                                      │
│                                                                          │
│  3. For each tactic in trajectory:                                      │
│       If focus_mode:                                                    │
│         tactic = "focus " + tactic                                      │
│       state = dojo.run_tac(state, tactic)                               │
│                                                                          │
│       If error: return False                                            │
│                                                                          │
│  4. Check: isinstance(final_state, cls)                                 │
│                                                                          │
│  5. Return True only if type matches exactly                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  VALIDATION CONTEXTS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  • After successful search: Validate full trajectory                    │
│  • Plan progress update: Validate partial trajectory                    │
│  • SearchResult output: Records proof_validation_passed field           │
│                                                                          │
│  WHY VALIDATE:                                                           │
│  ─────────────                                                           │
│  • Catches race conditions in tree status updates                       │
│  • Verifies proof works in clean environment                            │
│  • Guards against Dojo state corruption                                 │
│  • Provides ground truth for metrics                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VIII: Advanced Implementation Details

Pass-k Strategy with Interleaved Submission

BFS-Prover supports multiple independent proof attempts per theorem (pass_k). The implementation uses sophisticated scheduling for maximum throughput:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              PASS-K INTERLEAVED SUBMISSION STRATEGY                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM: Naive sequential attempts waste time                          │
│  ─────────────────────────────────────────────                           │
│  Theorem 1: attempt 1, attempt 2, ..., attempt 512                      │
│  Theorem 2: attempt 1, attempt 2, ..., attempt 512                      │
│  → If theorem 1 solves on attempt 3, we waste 509 attempts              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION: Interleaved round-robin submission                           │
│  ────────────────────────────────────────────────                        │
│                                                                          │
│  Round 1: T1-a1, T2-a1, T3-a1, ..., Tn-a1                              │
│  Round 2: T1-a2, T2-a2, T3-a2, ..., Tn-a2                              │
│  ...                                                                     │
│  Round k: T1-ak, T2-ak, T3-ak, ..., Tn-ak                              │
│                                                                          │
│  EARLY STOPPING:                                                         │
│  ───────────────                                                         │
│  When theorem T1 proves AND validates:                                  │
│    1. Cancel all pending futures for T1                                 │
│    2. Continue other theorems                                           │
│    3. Free up prover slots for remaining work                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BEST RESULT SELECTION:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  When multiple attempts complete, select best by priority:              │
│                                                                          │
│  Priority 1: proof_validation_passed == True                            │
│              (Trajectory replayed successfully in fresh Dojo)           │
│                                                                          │
│  Priority 2: root.status == PROVED                                      │
│              (Search found proof, not yet validated)                    │
│                                                                          │
│  Priority 3: Any other result                                           │
│              (Failed attempts, for logging)                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  submission_queue = []                                                  │
│  for attempt in range(pass_k):                                          │
│      for theorem in theorems:                                           │
│          submission_queue.append((theorem, attempt))                    │
│                                                                          │
│  # Submit in interleaved order                                          │
│  futures = {}                                                           │
│  for (theorem, attempt) in submission_queue:                            │
│      if theorem not in solved:                                          │
│          future = prover.search_attempt.remote(theorem, attempt)        │
│          futures[future] = (theorem, attempt)                           │
│                                                                          │
│  # Process completions with early stopping                              │
│  while futures:                                                          │
│      done, _ = ray.wait(list(futures.keys()), num_returns=1)           │
│      result = ray.get(done[0])                                          │
│      theorem = futures.pop(done[0])                                     │
│      if result.proof_validation_passed:                                 │
│          solved.add(theorem)                                            │
│          cancel_pending_futures(theorem)                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

vLLM Engine Configuration

The TacticGenerator wraps vLLM with specific optimizations:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              VLLM ENGINE CONFIGURATION                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ENGINE INITIALIZATION:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  AsyncLLMEngine.from_engine_args(EngineArgs(                            │
│      model=model_path,                                                  │
│      max_num_batched_tokens=8192,      # Tokens per forward pass        │
│      max_num_seqs=256,                 # Concurrent sequences           │
│      max_model_len=16384,              # Context window limit           │
│      enable_chunked_prefill=True,      # Chunk large prompts            │
│      tensor_parallel_size=num_gpus,    # Split across GPUs              │
│      disable_custom_all_reduce=True,   # Standard NCCL communication    │
│      trust_remote_code=True,           # Allow custom model code        │
│  ))                                                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY PARAMETERS EXPLAINED:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  enable_chunked_prefill: True                                           │
│    • Splits long prompts into chunks for KV-cache computation           │
│    • Prevents OOM on very long tactic states                            │
│    • Allows interleaving prefill with decode                            │
│                                                                          │
│  tensor_parallel_size: num_gpus                                         │
│    • Automatically calculated from available GPUs                       │
│    • Splits model layers across GPUs                                    │
│    • Enables 32B model on 4x A100s                                      │
│                                                                          │
│  disable_custom_all_reduce: True                                        │
│    • Uses standard NCCL instead of custom kernels                       │
│    • More compatible across GPU types                                   │
│    • Slightly slower but more reliable                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  HEALTH CHECK MECHANISM:                                                 │
│  ───────────────────────                                                 │
│                                                                          │
│  Before search begins, all generators must pass health check:           │
│                                                                          │
│  ready_count = 0                                                        │
│  while ready_count < num_tac_gen:                                       │
│      for gen in generators:                                             │
│          if gen.is_ready.remote():  # Calls engine.check_health()      │
│              ready_count += 1                                           │
│      if elapsed > MODEL_READINESS_TIMEOUT:                              │
│          raise TimeoutError("Generators not ready")                     │
│      sleep(10)                                                          │
│                                                                          │
│  This ensures vLLM has loaded weights and compiled kernels              │
│  before any proof search begins.                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The extend_tree() Recovery Method

When resuming from cached progress, BFS-Prover must replay proven tactics:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              EXTEND_TREE() FOR TRAJECTORY REPLAY                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PURPOSE: Replay cached proof trajectories to restore search state      │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SCENARIO:                                                               │
│  ─────────                                                               │
│                                                                          │
│  Prover crashed after solving subgoals 1-3 of a 5-step plan.           │
│  ProgressCache has the proof trajectory for steps 1-3.                  │
│  On restart, we need to rebuild the tree to that state.                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  METHOD: extend_tree(src_node, tactics)                                 │
│  ──────────────────────────────────────                                  │
│                                                                          │
│  Input:                                                                  │
│    src_node: Starting node (usually root)                               │
│    tactics: List[str] of tactics to replay                              │
│                                                                          │
│  Process:                                                                │
│    current = src_node                                                   │
│    for tactic in tactics:                                               │
│      response, time = run_tactic(current, tactic)                       │
│                                                                          │
│      if response is Error:                                              │
│        return None  # Replay failed, cached trajectory invalid          │
│                                                                          │
│      # Create edge with placeholder logprob                             │
│      edge = Edge(                                                       │
│          tactic=tactic,                                                 │
│          src=current,                                                   │
│          dst=new_node_from_response,                                    │
│          time=time,                                                     │
│          logprob=1.0  # Placeholder for cached tactics                  │
│      )                                                                   │
│                                                                          │
│      current.out_edges = [edge]                                         │
│      current = edge.dst                                                 │
│                                                                          │
│    return current  # Final node after replay                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY BEHAVIORS:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  1. logprob = 1.0 for all cached tactics                                │
│     • Cached tactics aren't from model, so no real logprob              │
│     • Placeholder value doesn't affect search (already proven path)     │
│                                                                          │
│  2. Atomic failure                                                       │
│     • If ANY tactic fails, entire extend_tree() returns None           │
│     • No partial tree extension                                         │
│     • Caller must handle failure (usually re-prove from scratch)        │
│                                                                          │
│  3. No queue insertion                                                   │
│     • Replayed nodes not added to priority queue                        │
│     • We're rebuilding known-good path, not exploring                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Complete Planner Rules (Including Rule 19)

The planner prompt includes additional rules beyond the 16 mentioned earlier:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              COMPLETE PLANNER RULES (INITIAL + REPLANNING)               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INITIAL PLANNING RULES (Rules 1-16):                                   │
│  ────────────────────────────────────                                    │
│  [Previously covered: explicit types, no chained inequalities, etc.]    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DYNAMIC REPLANNING RULES (Rules 16-19):                                │
│  ───────────────────────────────────────                                 │
│                                                                          │
│  RULE 16: INSERT AFTER PROVEN STEPS                                     │
│  ──────────────────────────────────────                                  │
│  New `have` statements go immediately after proven subgoals.            │
│  Proven steps are IMMUTABLE:                                            │
│    • No editing proven step text                                        │
│    • No inserting steps BETWEEN proven steps                            │
│    • No extending proven step with additional conditions                │
│                                                                          │
│  ✓ Correct: [proven1, proven2, NEW_STEP, stuck_step, ...]              │
│  ✗ Wrong:   [proven1, NEW_STEP, proven2, stuck_step, ...]              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 17: PROVIDE COMPLETE PLAN                                         │
│  ──────────────────────────────────                                      │
│  Output the ENTIRE updated plan, not just new steps.                    │
│  Preserves proven subgoals exactly as given.                            │
│                                                                          │
│  Input:                                                                  │
│    Proven: [h1: A, h2: B]                                               │
│    Stuck: h3: C                                                         │
│                                                                          │
│  Output (complete):                                                      │
│    have h1 : A        ← copied exactly                                  │
│    have h2 : B        ← copied exactly                                  │
│    have h2a : B'      ← new intermediate                                │
│    have h3 : C        ← original stuck step                             │
│    have h4 : D        ← remaining steps                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 18: ENSURE LOGICAL CONTINUITY                                     │
│  ─────────────────────────────────────                                   │
│  Each new step must be logically sound.                                 │
│                                                                          │
│  Specific requirements:                                                  │
│    • Don't immediately repeat the stuck subgoal after proven steps     │
│    • Insert meaningful intermediate steps                               │
│    • New steps must logically lead to stuck step                        │
│    • Consider what made the stuck step hard                             │
│                                                                          │
│  ✗ Wrong: [proven1, proven2, stuck_step]  ← no new help                │
│  ✓ Right: [proven1, proven2, helper1, helper2, stuck_step]             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ★★★ RULE 19: FINAL CHECK (CRITICAL) ★★★                                │
│  ─────────────────────────────────────────                               │
│  Before outputting, perform SCRUPULOUS REVIEW:                          │
│                                                                          │
│  Checklist:                                                              │
│    □ All Set/Finset literals have explicit type annotations            │
│    □ No chained inequalities (a ≤ x ≤ b)                               │
│    □ Numeric operations have type annotations                           │
│    □ Using Real.logb (not Real.log) for base logarithms                │
│    □ Using Real.pi (not π)                                             │
│    □ All variables are declared in theorem statement                   │
│    □ Lemma names match Mathlib exactly                                  │
│    □ No implicit multiplication (ax vs a * x)                          │
│    □ Proven steps copied EXACTLY (no modifications)                    │
│                                                                          │
│  This final review catches ~15% of errors missed in initial writing.   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Complete Buggy Tactic Filtering

The tactic generator filters specific patterns known to cause issues:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              COMPLETE BUGGY TACTIC FILTERING                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ALWAYS FILTERED (unconditional):                                        │
│  ─────────────────────────────────                                       │
│                                                                          │
│  "sorry"         Proof placeholder, not a real tactic                   │
│  "admit"         Same as sorry, proof placeholder                       │
│  "native_decide" Can cause infinite loops or segfaults                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CONDITIONALLY FILTERED (pattern-based):                                 │
│  ────────────────────────────────────────                                │
│                                                                          │
│  PATTERN 1: Incomplete case splits                                      │
│  ─────────────────────────────────                                       │
│  Triggers: tactic contains ("rcases" OR "cases'") AND "?_"              │
│                                                                          │
│  Why filtered:                                                           │
│    "rcases h with ⟨?_, ?_⟩" indicates model didn't know                │
│    what to name the cases. The ?_ placeholders will fail.              │
│                                                                          │
│  Examples:                                                               │
│    ✗ "rcases h with ⟨?_, h2⟩"                                          │
│    ✗ "cases' h with ?_ h2"                                             │
│    ✓ "rcases h with ⟨h1, h2⟩"      ← concrete names OK                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PATTERN 2: Malformed simpa                                              │
│  ──────────────────────────────                                          │
│  Triggers: tactic contains "simpa" AND any of:                          │
│    • " _"  (space underscore)                                           │
│    • "_ "  (underscore space)                                           │
│    • "_,"  (underscore comma)                                           │
│    • ",_"  (comma underscore)                                           │
│                                                                          │
│  Why filtered:                                                           │
│    These patterns indicate malformed simpa arguments.                   │
│    The underscore placeholders suggest incomplete generation.           │
│                                                                          │
│  Examples:                                                               │
│    ✗ "simpa using _, h"                                                 │
│    ✗ "simpa [h1, _]"                                                    │
│    ✓ "simpa [h1, h2] using h3"     ← concrete args OK                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def filter_tactic(tactic: str) -> bool:                                │
│      # Always reject                                                    │
│      if any(x in tactic for x in ["sorry", "admit", "native_decide"]): │
│          return False                                                   │
│                                                                          │
│      # Conditional: incomplete case splits                              │
│      if ("rcases" in tactic or "cases'" in tactic) and "?_" in tactic: │
│          return False                                                   │
│                                                                          │
│      # Conditional: malformed simpa                                     │
│      if "simpa" in tactic:                                              │
│          if any(p in tactic for p in [" _", "_ ", "_,", ",_"]):        │
│              return False                                               │
│                                                                          │
│      return True                                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FILTERING STATISTICS:                                                   │
│  ─────────────────────                                                   │
│  In typical runs:                                                        │
│    • sorry/admit: ~2% of generated tactics                              │
│    • native_decide: ~0.1%                                               │
│    • rcases/?_: ~0.5%                                                   │
│    • simpa/underscore: ~0.3%                                            │
│    • Total filtered: ~3% of generations                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Environment Variables

BFS-Prover behavior can be tuned via environment variables:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              ENVIRONMENT VARIABLES                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LOGGING CONTROL:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  LOG_RATE                                                                │
│    Default: 0.01 (1%)                                                   │
│    Purpose: Fraction of standard events to log                          │
│    Usage: export LOG_RATE=0.1  # Log 10% of events                      │
│    Events: Tactic executions, node expansions, state transitions        │
│                                                                          │
│  LOG_RATE_FREQUENT_EVENT                                                 │
│    Default: 0.001 (0.1%)                                                │
│    Purpose: Fraction of very frequent events to log                     │
│    Usage: export LOG_RATE_FREQUENT_EVENT=0.01                           │
│    Events: Individual tactic generations, queue operations              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TIMEOUT CONTROL:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  MODEL_READINESS_TIMEOUT                                                 │
│    Default: 600 seconds                                                 │
│    Purpose: Max wait for vLLM engine initialization                     │
│    Usage: export MODEL_READINESS_TIMEOUT=1200  # 20 minutes            │
│    When to increase: Large models, slow GPU initialization              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RAY CONFIGURATION (standard Ray env vars):                              │
│  ──────────────────────────────────────────                              │
│                                                                          │
│  RAY_ADDRESS                                                             │
│    Purpose: Connect to existing Ray cluster                             │
│    Usage: export RAY_ADDRESS="ray://head-node:10001"                    │
│                                                                          │
│  CUDA_VISIBLE_DEVICES                                                    │
│    Purpose: Restrict visible GPUs                                       │
│    Usage: export CUDA_VISIBLE_DEVICES="0,1,2,3"                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DEBUGGING:                                                              │
│  ──────────                                                              │
│                                                                          │
│  To enable verbose logging during debugging:                            │
│    export LOG_RATE=1.0                                                  │
│    export LOG_RATE_FREQUENT_EVENT=0.1                                   │
│                                                                          │
│  WARNING: Full logging generates ~100MB+ logs per theorem               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PlanIOActor and Planner Retry Logic

The planning pipeline uses specialized Ray actors for coordination:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              PLANIOACTOR: CENTRALIZED FILE I/O                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM: Concurrent plan writes cause corruption                       │
│  ─────────────────────────────────────────────────                       │
│                                                                          │
│  Multiple PlanGenerator actors work in parallel.                        │
│  If they all write to plan.json simultaneously:                         │
│    • File corruption                                                    │
│    • Lost updates                                                       │
│    • Partial writes                                                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION: Single PlanIOActor serializes all I/O                        │
│  ────────────────────────────────────────────────                        │
│                                                                          │
│  RAY ACTOR: PlanIOActor                                                 │
│  ──────────────────────                                                  │
│                                                                          │
│  Singleton actor that handles ALL file operations:                      │
│                                                                          │
│  class PlanIOActor:                                                     │
│      def __init__(self, plan_file_path):                                │
│          self._plan_file = plan_file_path                               │
│          self._plans = None  # Lazy loaded                              │
│                                                                          │
│      @property                                                          │
│      def plans(self):                                                   │
│          if self._plans is None:                                        │
│              self._plans = load_json(self._plan_file)                   │
│          return self._plans                                             │
│                                                                          │
│      def get_plan(self, theorem_id):                                    │
│          return self.plans.get(theorem_id)                              │
│                                                                          │
│      def save_plan(self, theorem_id, plan):                             │
│          self.plans[theorem_id] = plan                                  │
│          write_json(self._plan_file, self.plans)  # Atomic write        │
│                                                                          │
│      def has_plan(self, theorem_id):                                    │
│          return theorem_id in self.plans                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BENEFITS:                                                               │
│  ─────────                                                               │
│  • No concurrent write conflicts                                        │
│  • In-memory caching (fast reads after first load)                     │
│  • Single source of truth                                               │
│  • Atomic updates                                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│              PLANNER RETRY LOGIC                                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Plan generation can fail for various reasons.                          │
│  BFS-Prover implements automatic retries:                               │
│                                                                          │
│  RETRY LOOP:                                                             │
│  ───────────                                                             │
│                                                                          │
│  def generate_plan_with_retry(theorem, max_retries=3):                  │
│      for attempt in range(max_retries):                                 │
│          # Step 1: Generate plan via LLM                                │
│          plan = planner.generate(theorem.statement)                     │
│                                                                          │
│          if plan is None:                                               │
│              continue  # Generation failed, retry                       │
│                                                                          │
│          # Step 2: Verify plan syntax in Lean                           │
│          success, error = verifier.verify_plan(theorem, plan)           │
│                                                                          │
│          if success:                                                    │
│              return plan  # Valid plan!                                 │
│                                                                          │
│          # Verification failed, log error and retry                     │
│          log(f"Plan verification failed: {error}")                      │
│                                                                          │
│      return None  # All retries exhausted                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RETRY TRIGGERS:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  Retry on verification failure:                                         │
│    • Syntax error in have statement                                     │
│    • Type mismatch                                                      │
│    • Unknown identifier                                                 │
│    • Invalid Lean 4 syntax                                              │
│                                                                          │
│  DO NOT retry on generation failure:                                    │
│    • API timeout                                                        │
│    • Model refused to generate                                          │
│    • Empty response                                                     │
│                                                                          │
│  RATIONALE: Verification failures might succeed with different         │
│  LLM sampling. Generation failures indicate systemic issues.           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RETRY STATISTICS:                                                       │
│  ─────────────────                                                       │
│  In practice:                                                            │
│    • 75% of plans succeed on first attempt                              │
│    • 15% succeed on retry 2                                             │
│    • 7% succeed on retry 3                                              │
│    • 3% fail all retries                                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

SearchResult Complete Fields

The SearchResult dataclass captures comprehensive information:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              SEARCHRESULT DATACLASS (COMPLETE)                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  @dataclass                                                              │
│  class SearchResult:                                                     │
│      # ─────────────────────────────────────────────────────────────    │
│      # IDENTIFICATION                                                    │
│      # ─────────────────────────────────────────────────────────────    │
│      theorem: Theorem              # The theorem attempted              │
│      attempt_number: int           # Which attempt (0 to pass_k-1)      │
│                                                                          │
│      # ─────────────────────────────────────────────────────────────    │
│      # PROOF DATA (if successful)                                        │
│      # ─────────────────────────────────────────────────────────────    │
│      proof: Optional[List[Tuple[str, str]]]                             │
│          # List of (tactic_state, tactic) pairs                        │
│          # None if no proof found                                       │
│                                                                          │
│      proof_stats: Optional[List[Tuple[float, float]]]                   │
│          # List of (execution_time, logprob) per tactic                │
│          # Enables analysis of proof efficiency                        │
│                                                                          │
│      # ─────────────────────────────────────────────────────────────    │
│      # TRAINING DATA                                                     │
│      # ─────────────────────────────────────────────────────────────    │
│      preference_pairs: List[Tuple[str, str, str]]                       │
│          # List of (state, winning_tactic, losing_tactic)              │
│          # Extracted from successful proof vs failed siblings          │
│          # Used for DPO/preference learning                            │
│                                                                          │
│      # ─────────────────────────────────────────────────────────────    │
│      # SEARCH METRICS                                                    │
│      # ─────────────────────────────────────────────────────────────    │
│      explored_nodes: int           # Total nodes explored               │
│      total_time: float            # Wall clock time in seconds         │
│      vllm_time: float             # Time waiting for model             │
│      lean_time: float             # Time executing tactics             │
│                                                                          │
│      # ─────────────────────────────────────────────────────────────    │
│      # STATUS FLAGS                                                      │
│      # ─────────────────────────────────────────────────────────────    │
│      root_status: Status          # OPEN, PROVED, or FAILED            │
│                                                                          │
│      proof_validation_passed: Optional[bool]                            │
│          # None: Validation not attempted                              │
│          # True: Trajectory replayed successfully                      │
│          # False: Replay failed (possible bug or race condition)       │
│                                                                          │
│      # ─────────────────────────────────────────────────────────────    │
│      # ERROR TRACKING                                                    │
│      # ─────────────────────────────────────────────────────────────    │
│      crashed: bool                # True if search crashed             │
│      crash_reason: Optional[str]  # Exception message if crashed       │
│      dojo_restarts: int           # Number of Dojo restart attempts    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  USAGE IN AGGREGATION:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  After all attempts complete:                                           │
│                                                                          │
│  best_result = None                                                     │
│  for result in results:                                                 │
│      if result.proof_validation_passed == True:                         │
│          best_result = result                                           │
│          break                                                          │
│      if result.root_status == PROVED and best_result is None:          │
│          best_result = result                                           │
│                                                                          │
│  # Aggregate preference pairs across all attempts                       │
│  all_pairs = []                                                         │
│  for result in results:                                                 │
│      all_pairs.extend(result.preference_pairs)                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Process Crash Recovery

BFS-Prover handles various crash scenarios gracefully:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              PROCESS CRASH RECOVERY                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CRASH TYPE 1: Lean Process Already Gone                                │
│  ───────────────────────────────────────                                 │
│                                                                          │
│  Scenario: Lean crashed before we tried to execute a tactic             │
│                                                                          │
│  Detection:                                                              │
│    try:                                                                  │
│        lean_proc = psutil.Process(dojo.proc.pid)                        │
│    except psutil.NoSuchProcess:                                         │
│        # Process already gone!                                          │
│                                                                          │
│  Recovery:                                                               │
│    # Attempt tactic anyway - Dojo might auto-restart                    │
│    return await asyncio.to_thread(dojo.run_tac, state, tac)            │
│                                                                          │
│  Rationale: LeanDojo sometimes restarts Lean internally.               │
│  Attempting the tactic lets Dojo handle recovery.                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CRASH TYPE 2: Hard Timeout (Unresponsive Lean)                         │
│  ──────────────────────────────────────────────                          │
│                                                                          │
│  Scenario: Lean hung, wall-clock timeout triggered                      │
│                                                                          │
│  Detection: LeanCPUTimedOut exception with killed_hard=True             │
│                                                                          │
│  Recovery:                                                               │
│    1. Kill entire process tree:                                         │
│       _kill_tree(process)  # Recursive SIGKILL                          │
│                                                                          │
│    2. Clean up resources:                                               │
│       gc.collect()  # Force garbage collection                          │
│                                                                          │
│    3. Raise DojoRestartRequired                                         │
│                                                                          │
│    4. Caller catches and restarts:                                      │
│       for restart_attempt in range(MAX_DOJO_RESTARTS):  # 3 attempts   │
│           try:                                                          │
│               dojo = create_fresh_dojo(theorem)                         │
│               return await _search(dojo, ...)                           │
│           except DojoRestartRequired:                                   │
│               continue                                                  │
│       raise RuntimeError("Max restarts exceeded")                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CRASH TYPE 3: Child Process Died Mid-Monitoring                        │
│  ───────────────────────────────────────────────                         │
│                                                                          │
│  Scenario: Child process of Lean died while monitoring CPU              │
│                                                                          │
│  Detection: psutil.NoSuchProcess during _cpu_tree() traversal          │
│                                                                          │
│  Recovery:                                                               │
│    def _cpu_tree(proc):                                                 │
│        total = 0.0                                                      │
│        try:                                                             │
│            times = proc.cpu_times()                                     │
│            total += times.user + times.system                           │
│        except psutil.NoSuchProcess:                                     │
│            pass  # Process gone, skip                                   │
│                                                                          │
│        for child in proc.children(recursive=True):                      │
│            try:                                                         │
│                child_times = child.cpu_times()                          │
│                total += child_times.user + child_times.system           │
│            except psutil.NoSuchProcess:                                 │
│                pass  # Child gone, skip                                 │
│                                                                          │
│        return total                                                     │
│                                                                          │
│  Rationale: Child processes can die at any time. Graceful handling     │
│  prevents false timeouts and crashes.                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CRASH TYPE 4: Double Future Completion                                  │
│  ──────────────────────────────────────                                  │
│                                                                          │
│  Scenario: asyncio Future set twice (race condition)                    │
│                                                                          │
│  Detection: InvalidStateError when setting result                       │
│                                                                          │
│  Recovery:                                                               │
│    def _safe_set_result(future, result):                                │
│        try:                                                             │
│            future.set_result(result)                                    │
│        except asyncio.InvalidStateError:                                │
│            pass  # Already set, ignore                                  │
│                                                                          │
│    def _safe_set_exception(future, exc):                                │
│        try:                                                             │
│            future.set_exception(exc)                                    │
│        except asyncio.InvalidStateError:                                │
│            pass  # Already set, ignore                                  │
│                                                                          │
│  Rationale: Multiple code paths might try to complete the same         │
│  future. Safe setters prevent crashes.                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

TheoremType Enum and Data Loading

BFS-Prover supports different theorem splits:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              THEOREMTYPE ENUM AND DATA LOADING                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THEOREM TYPES:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  class TheoremType(Enum):                                               │
│      GENERIC = "generic"    # Load all theorems regardless of split    │
│      VALID = "valid"        # Load only validation split               │
│      TEST = "test"          # Load only test split                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  USAGE IN CONFIGURATION:                                                 │
│  ───────────────────────                                                 │
│                                                                          │
│  @dataclass                                                              │
│  class ProofSearchConfig:                                               │
│      theorem_type: str = "generic"  # or "valid" or "test"             │
│      lines_cap: int = 0             # 0 = no limit, >0 = limit for debug│
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DATA LOADING LOGIC:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  def load_theorems(statements_file, dojo_file, theorem_type, cap):      │
│      theorems = []                                                      │
│                                                                          │
│      # Load statement data                                              │
│      with open(statements_file) as f:                                   │
│          for line in f:                                                 │
│              data = json.loads(line)                                    │
│                                                                          │
│              # Filter by type                                           │
│              if theorem_type != "generic":                              │
│                  if data["split"] != theorem_type:                      │
│                      continue                                           │
│                                                                          │
│              theorems.append(data)                                      │
│                                                                          │
│      # Apply cap for debugging                                          │
│      if cap > 0:                                                        │
│          theorems = theorems[:cap]                                      │
│                                                                          │
│      # Match with dojo data                                             │
│      dojo_data = load_dojo_data(dojo_file)                              │
│      return [                                                           │
│          Theorem(                                                       │
│              statement=t["statement"],                                  │
│              full_name=t["id"],                                         │
│              **dojo_data[t["id"]]                                       │
│          )                                                              │
│          for t in theorems                                              │
│      ]                                                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TYPICAL SPLITS:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  miniF2F dataset:                                                        │
│    • valid: 244 theorems (development set)                              │
│    • test: 244 theorems (evaluation set)                                │
│                                                                          │
│  Usage:                                                                  │
│    # Develop/tune on validation set                                     │
│    python main.py --theorem_type valid                                  │
│                                                                          │
│    # Final evaluation on test set                                       │
│    python main.py --theorem_type test                                   │
│                                                                          │
│    # Debug on 10 theorems                                               │
│    python main.py --theorem_type generic --lines_cap 10                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

CPU Time Accounting Details

The timeout system uses sophisticated CPU tracking:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              CPU TIME ACCOUNTING (_cpu_tree)                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CHALLENGE: Accurate CPU time for process trees                         │
│  ──────────────────────────────────────────────                          │
│                                                                          │
│  Lean spawns child processes. We need TOTAL CPU time across:            │
│    • Main Lean process                                                  │
│    • All child processes (compiler, elaborator, etc.)                   │
│    • Grandchildren (if any)                                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SNAPSHOT-BASED MONITORING:                                              │
│  ─────────────────────────────                                           │
│                                                                          │
│  Instead of cumulative tracking (which misses retired children),        │
│  we use snapshot-based monitoring:                                      │
│                                                                          │
│  def _cpu_tree(proc: psutil.Process) -> float:                          │
│      """Get CURRENT total CPU time for process tree."""                 │
│      total = 0.0                                                        │
│                                                                          │
│      # Main process CPU time                                            │
│      try:                                                               │
│          times = proc.cpu_times()                                       │
│          total += times.user + times.system                             │
│      except psutil.NoSuchProcess:                                       │
│          return total  # Process gone                                   │
│                                                                          │
│      # Recursively add children                                         │
│      try:                                                               │
│          for child in proc.children(recursive=True):                    │
│              try:                                                       │
│                  child_times = child.cpu_times()                        │
│                  total += child_times.user + child_times.system         │
│              except psutil.NoSuchProcess:                               │
│                  pass  # Child died between listing and query           │
│      except psutil.NoSuchProcess:                                       │
│          pass  # Parent died while getting children                     │
│                                                                          │
│      return total                                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RETIRED PROCESS ACCOUNTING:                                             │
│  ───────────────────────────                                             │
│                                                                          │
│  Problem: If a child process dies, its CPU time is lost.               │
│                                                                          │
│  Solution: Track "baseline" CPU time of retired processes:              │
│                                                                          │
│  class CPUMonitor:                                                      │
│      def __init__(self):                                                │
│          self.baseline = 0.0  # CPU from dead processes                │
│          self.known_pids = set()                                        │
│                                                                          │
│      def get_total_cpu(self, proc):                                     │
│          current = _cpu_tree(proc)                                      │
│          current_pids = {p.pid for p in proc.children(recursive=True)} │
│                                                                          │
│          # Check for retired processes                                  │
│          retired = self.known_pids - current_pids                       │
│          if retired:                                                    │
│              # Their CPU is already in baseline from previous snapshot │
│              pass                                                       │
│                                                                          │
│          # Update known PIDs                                            │
│          self.known_pids = current_pids                                 │
│                                                                          │
│          return self.baseline + current                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MONITORING LOOP:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  def monitor_with_timeout(proc, cpu_limit, wall_limit, poll=0.5):       │
│      start_wall = time.monotonic()                                      │
│                                                                          │
│      while True:                                                        │
│          cpu_time = _cpu_tree(proc)                                     │
│          wall_time = time.monotonic() - start_wall                      │
│                                                                          │
│          if cpu_time > cpu_limit:                                       │
│              raise LeanCPUTimedOut(                                     │
│                  limit_type="cpu",                                      │
│                  cpu_elapsed=cpu_time,                                  │
│                  wall_elapsed=wall_time,                                │
│                  killed_hard=False                                      │
│              )                                                          │
│                                                                          │
│          if wall_time > wall_limit:                                     │
│              _kill_tree(proc)  # Force kill                             │
│              raise LeanCPUTimedOut(                                     │
│                  limit_type="wall",                                     │
│                  cpu_elapsed=cpu_time,                                  │
│                  wall_elapsed=wall_time,                                │
│                  killed_hard=True  # Triggers Dojo restart              │
│              )                                                          │
│                                                                          │
│          time.sleep(poll)                                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Production Configuration vs Defaults

The shell scripts reveal production tuning that differs from defaults:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              PRODUCTION CONFIGURATION VS DEFAULTS                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Parameter               │ Default  │ Production │ Reason               │
│  ────────────────────────┼──────────┼────────────┼──────────────────────│
│  pass_k                  │ 1        │ 512        │ More attempts        │
│  sampling_temperature    │ 0.7      │ 1.3        │ More diversity       │
│  depth_reward            │ 0.0      │ 2.0        │ Explore deeper       │
│  tactic_timeout          │ 10       │ 12         │ Harder tactics       │
│  timeout_per_theorem     │ 600      │ 600        │ Same                 │
│  n_sampling_search       │ 16       │ 4-16       │ Varies by stage      │
│  num_provers             │ 96       │ 128        │ More parallelism     │
│  num_tac_gen             │ 8        │ 7          │ Memory constraints   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY HIGHER TEMPERATURE (1.3 vs 0.7):                                   │
│  ────────────────────────────────────                                    │
│                                                                          │
│  With pass_k=512 attempts, diversity matters more than consistency.     │
│  Higher temperature:                                                     │
│    • More varied tactic suggestions                                     │
│    • Different proof strategies explored                                │
│    • Less likely to repeat same mistakes                                │
│                                                                          │
│  With pass_k=1, you want temperature=0.7 for reliable single attempts.  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY HIGHER DEPTH_REWARD (2.0 vs 0.0):                                  │
│  ─────────────────────────────────────                                   │
│                                                                          │
│  depth_reward=0.0: Priority = cumulative_logprob                        │
│    → Strongly prefers short paths with high confidence                  │
│    → May miss longer but correct proofs                                 │
│                                                                          │
│  depth_reward=2.0: Priority = cumulative_logprob / depth²               │
│    → Normalizes aggressively by depth                                   │
│    → Willing to explore deeper paths                                    │
│    → Better for complex theorems requiring 10+ steps                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PRODUCTION SHELL SCRIPT (run_local_search.sh):                         │
│  ──────────────────────────────────────────────                          │
│                                                                          │
│  python -m src.search.main \                                            │
│      --model_path "path/to/BFS-Prover-V2-32B" \                         │
│      --pass_k 512 \                                                     │
│      --sampling_temperature 1.3 \                                       │
│      --depth_reward 2.0 \                                               │
│      --tactic_timeout 12 \                                              │
│      --timeout_per_theorem 600 \                                        │
│      --num_provers 128 \                                                │
│      --num_tac_gen 7 \                                                  │
│      --theorem_type test \                                              │
│      --statement_file src/data/minif2f_statements.jsonl \               │
│      --dojo_file src/data/minif2f_dojo.jsonl \                          │
│      --plan_file src/data/minif2f_plan.json                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PLANNING SCRIPT DIFFERENCES (run_local_plan.sh):                       │
│  ────────────────────────────────────────────────                        │
│                                                                          │
│  Planning uses different settings:                                      │
│    • num_generators=7 (PlanGenerator actors)                            │
│    • num_planners=7 (parallel theorem processing)                       │
│    • max_retries=3 (verification retry attempts)                        │
│    • No pass_k (single plan per theorem)                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part IX: Benchmark Results and Analysis

Performance on miniF2F

miniF2F is the standard benchmark for neural theorem provers—a collection of 488 formalized mathematics problems from various competitions (AMC, AIME, IMO).

Model	miniF2F-test	miniF2F-valid
GPT-f (2020)	29.3%	-
PACT (2022)	33.6%	-
HTPS (2022)	41.0%	-
LeanDojo (2023)	48.4%	-
DeepSeek-Prover-V1.5 (2024)	63.5%	60.2%
BFS-Prover-V2-7B	82.4%	-
BFS-Prover-V2-32B	86.1%	85.5%
BFS-Prover-V2-32B + Planner	95.08%	95.5%

Performance on ProofNet

ProofNet contains harder, research-level mathematics:

Model	ProofNet-test
BFS-Prover-V2-32B	41.4%

Analysis: Component Contributions

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              COMPONENT CONTRIBUTION ANALYSIS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  BASE MODEL (7B)                                                        │
│  ───────────────                                                         │
│  Performance: 82.4% on miniF2F-test                                     │
│                                                                          │
│  + SCALE TO 32B                                                         │
│  ────────────────                                                        │
│  Performance: 86.1% (+3.7%)                                             │
│  Insight: Larger models improve tactic generation quality               │
│                                                                          │
│  + PLANNING                                                              │
│  ───────────                                                             │
│  Performance: 95.08% (+8.98%)                                           │
│  Insight: Hierarchical decomposition is the biggest single gain!        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHERE PLANNING HELPS MOST:                                              │
│  ─────────────────────────────                                           │
│  ✓ Long proofs (10+ steps) - decomposition reduces effective depth     │
│  ✓ Non-obvious lemmas - planner can suggest key intermediate steps     │
│  ✓ Complex algebra - multiple equivalent rewriting paths               │
│  ✓ Inequality chains - direction of manipulation matters               │
│                                                                          │
│  WHERE PLANNING HELPS LESS:                                              │
│  ──────────────────────────                                              │
│  • Short proofs (2-3 steps) - no decomposition needed                  │
│  • Direct application - single powerful tactic suffices                │
│  • Structural proofs - induction/cases with obvious structure          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Conclusion

Neural theorem proving has progressed from a research curiosity to a practical capability. BFS-Prover-V2's 95.08% success rate on miniF2F represents a remarkable achievement—five years ago, reaching 30% seemed ambitious.

Key Takeaways

Best-first search works: With good neural guidance, simple search algorithms achieve state-of-the-art results. Complex RL isn't always necessary.
Planning is transformative: The +8.98% gain from hierarchical decomposition demonstrates that proof strategy and tactics should be separated. Strong reasoning models excel at strategy; trained provers excel at tactics.
Scaling requires innovation: Performance plateaus are real. Multi-stage expert iteration—with periodic retraining from base—overcomes limitations of standard fine-tuning.
Engineering matters: Timeout handling (dual CPU/wall-clock monitoring), parallelism (Ray actors), and robustness (Dojo restarts) are as important as algorithmic innovations.
Node reuse and status propagation: Efficient tree data structures with proper deduplication and lazy status computation enable scaling to large search trees.
Open source advances the field: BFS-Prover's Apache 2.0 license enables reproduction, extension, and application.

The Broader Picture

Theorem proving occupies a unique position in AI: the hardest problems (research mathematics) remain unsolved, while significant practical value (verification, education) is within reach. As these systems improve, we'll see:

Formal methods becoming practical: Proving software correctness is expensive; automation changes the economics
Mathematics education transformed: Instant feedback on proof attempts, personalized problem generation
New mathematical workflows: Mathematicians collaborating with AI to explore proof spaces

BFS-Prover-V2 represents the current frontier. The next frontier is already visible: deeper integration of reasoning and search, broader mathematical coverage, and ultimately, AI that doesn't just verify mathematics but contributes to its discovery.

Sources

Frequently Asked Questions

Lean is a proof assistant and programming language developed at Microsoft Research (now maintained by the Lean community). It allows writing mathematical proofs in a formal language that can be mechanically verified.

Why Lean for AI theorem proving:

Mathlib: The largest library of formalized mathematics (~1 million lines), providing training data and reusable lemmas
LeanDojo: A Python interface enabling programmatic interaction, essential for reinforcement learning
Tactic mode: Proofs as sequences of commands, naturally suited to sequential generation
Active development: Lean 4 is under active development with modern features

Alternatives:

Coq: More mature, used for verified compilers (CompCert) and kernels (seL4)
Isabelle: Strong automation, used for Flyspeck (Kepler conjecture proof)
HOL Light: Simpler but less automation

Lean 4 has become the de facto standard for neural theorem proving due to the combination of Mathlib and LeanDojo.

Best-First Search (BFS-Prover's approach):

Uses priority queue ordered by cumulative log probability
Always expands most promising node
Can backtrack by choosing from queue
Requires good heuristic (neural network scores)

Depth-First Search:

Explores one path completely before backtracking
Memory efficient (only stores current path)
Can get stuck in deep dead ends
Used by some older provers

Breadth-First Search:

Explores all nodes at depth d before depth d+1
Guarantees shortest proof
Exponential memory in branching factor
Impractical for deep proofs

Monte Carlo Tree Search (MCTS):

Balances exploration and exploitation via UCB
Requires value estimation (simulation or learned)
More complex but handles uncertainty better
Used by AlphaProof and some other systems

Best-first search works well for theorem proving because:

Neural network provides good heuristic guidance
Memory is manageable with priority queue pruning
Natural integration with LLM confidence scores
Simple to implement and parallelize

The gap between 86.1% (no planner) and 95.08% (with planner) reveals the fundamental limitation of monolithic search.

Without planning:

Search must find a 15-step proof in a space of 100^15 possibilities
Early mistakes compound—wrong direction at step 3 dooms 12 subsequent steps
No intermediate checkpoints—must get everything right in sequence
Neural network confidence degrades over long horizons

With planning:

Hard theorem decomposed into 5 simpler subgoals
Each subgoal: 3-step proof in 100^3 space (manageable)
Intermediate checkpoints—errors isolated to one subgoal
Planner uses stronger reasoning (can be GPT-4 or similar)
Prover focuses on local tactics, not global strategy

The key insight: Proof strategy and proof tactics are different skills. Strategy requires high-level mathematical reasoning ("we should establish this intermediate result"). Tactics require local manipulation ("rewrite using this lemma"). The planner handles strategy; the prover handles tactics. Separation enables each to excel at its strength.

The pp1 property is a custom extension BFS-Prover adds to Lean's TacticState class. It extracts a simplified, single-goal view from the full tactic state.

Processing steps:

Extract first paragraph of pretty-print (before \n\n)
Remove "case ..." lines via regex: r'^\s*case .*\n?'
Remove "this : True" declarations
Strip whitespace

Why it matters:

In plan mode, the prover works on one subgoal at a time. Showing all goals:

Confuses the model about what to prove next
Wastes context window on irrelevant information
Leads to tactics that address the wrong goal

Focus mode with pp1 ensures the model sees only the immediate subgoal, dramatically improving tactic relevance.

Example transformation:

Full state (confusing):

case h1
n : ℕ
h : n > 0
⊢ n + 0 = n

case h2
m : ℕ
⊢ m * 1 = m

pp1 (focused):

n : ℕ
h : n > 0
⊢ n + 0 = n

BFS-Prover monitors both CPU time and wall-clock time independently because they catch different failure modes.

CPU time monitoring:

Tracks actual CPU cycles consumed by Lean process
Uses psutil.Process().cpu_times() for user + system time
Includes time from child processes (prevents timeout evasion)
Catches: infinite loops, expensive computations

Wall-clock time monitoring:

Tracks elapsed real time
Set to 1.5x the CPU limit (provides buffer)
Catches: I/O hangs, deadlocks, scheduling delays

Timeout outcomes:

Soft timeout (CPU limit reached):

Return LeanError with descriptive message
Create ErrorNode in proof tree
Continue search with other tactics

Hard timeout (wall limit reached, process unresponsive):

Kill entire process tree with SIGKILL
Raise DojoRestartRequired exception
Create fresh Dojo instance
Retry from last checkpoint (up to 3 times)

Why both are necessary: A process stuck on I/O (e.g., waiting for a file lock) consumes zero CPU time but still hangs. Conversely, a CPU-bound infinite loop might not trigger wall-clock timeout due to OS scheduling. Monitoring both ensures robust timeout behavior.

Different tactic sequences can reach identical proof states. BFS-Prover maintains a cache to detect and reuse these nodes.

Implementation:

nodes: Dict[TacticState, Node]  # State → Node mapping

When creating a new edge:

Execute tactic, get response (new TacticState)
Check if response already in nodes dictionary
If yes: reuse existing node, add edge pointing to it
If no: create new InternalNode, add to cache

Benefits:

Memory savings: Don't store duplicate states
Computation savings: Don't re-explore known states
Better status propagation: Multiple paths converge, proof found via any path marks all as PROVED

Example:

Root
├── "simp" → State A
│   └── "ring" → State B
└── "ring" → State B (REUSED!)

Both paths reach State B. When State B is proved,
both parent paths become PROVED.

Preference pairs are comparisons between tactics that succeeded vs tactics that failed from the same state. They enable training with ranking objectives.

Extraction process:

For each edge in a successful proof trajectory:

Get source node's tactic state
Identify the winning tactic (the one that led to proof)
For each sibling edge leading to ErrorNode:
- Create pair: (state, winning_tactic, losing_tactic)

Format:

(
    "n : ℕ ⊢ n + 0 = n",  # State
    "rfl",                  # Winning tactic
    "simp"                  # Losing tactic
)

Use in training:

These pairs enable:

DPO (Direct Preference Optimization): Train model to prefer winning tactics
Ranking loss: Learn relative ordering of tactics
Contrastive learning: Push winning tactics higher, losing tactics lower

Why valuable: Binary success/failure labels don't indicate how close a tactic was to working. Preference pairs provide fine-grained signal about which tactics to prefer when multiple options exist.

When BFS fails to prove a plan step after timeout, replanning generates a refined plan.

Trigger conditions:

BFS exhausted timeout on a subgoal
All explored paths led to ErrorNode
Progress stalled for configured duration

Replanning input:

Original theorem statement
Already proven subgoals (immutable)
The stuck subgoal

Replanning constraints:

Proven steps must remain exactly as given
New steps inserted immediately after proven work
Stuck step may be decomposed or corrected
Must maintain logical continuity

Example:

Original: [h1, h2, h3, h4, h5]
Proven:   [h1, h2]
Stuck:    h3

Replanned: [h1, h2, h3a, h3b, h3, h4, h5]
                   ↑     ↑
           Intermediate steps to help prove h3

Key insight: The planner decomposes h3 into smaller lemmas (h3a, h3b) that make h3 easier. This is the same hierarchical decomposition principle applied recursively.

Hardware requirements:

For the 7B model:

1-2 GPUs (A100 40GB or equivalent)
64GB+ system RAM
32+ CPU cores (for parallel Dojo instances)

For the 32B model:

4-8 GPUs (A100 80GB recommended)
128GB+ system RAM
64+ CPU cores

Inference configuration:

Typical setup for research evaluation:

4-8 tactic generators (GPUs)
96 prover agents (CPU-bound)
600 seconds timeout per theorem
~10-15 minutes per difficult theorem

Cost estimates:

On cloud infrastructure (approximate):

Per theorem (easy): ~$0.10-0.50
Per theorem (hard): ~$1-5
Full miniF2F evaluation: ~$500-2000

Reducing costs:

Use 7B model (82.4% vs 86.1%, significant savings)
Reduce timeout for easy theorems
Skip planning for simple statements
Use quantization (available via Ollama)

AlphaProof (DeepMind, 2024) and BFS-Prover represent different approaches:

AlphaProof:

Closed source, not publicly available
Uses AlphaZero-style MCTS with learned value network
Demonstrated IMO gold-medal performance
Massive compute (reportedly millions of GPU-hours training)
Targets IMO specifically

BFS-Prover-V2:

Open source (Apache 2.0 license)
Uses best-first search with LLM probability guidance
State-of-the-art on miniF2F/ProofNet benchmarks
More accessible compute requirements
General-purpose (Lean 4 theorems broadly)

Philosophical differences:

AlphaProof: Learn everything from scratch via self-play
BFS-Prover: Leverage pre-trained LLMs, focus on efficient search

Practical implications: For researchers and practitioners, BFS-Prover is accessible and reproducible. AlphaProof demonstrates what's possible with unlimited resources but isn't available for use.

Understanding failures is as important as understanding successes:

1. Search space exhaustion: Even with good heuristics, some theorems have too many plausible paths. The search explores thousands of nodes without finding the proof.

2. Planning failures: The planner generates syntactically valid but mathematically incorrect subgoals. The prover then wastes time trying to prove false statements.

3. Missing lemmas: The proof requires a lemma not in Mathlib that the model hasn't learned. Without the right building block, the proof is unreachable.

4. Long proofs: Proofs requiring 50+ steps are extremely unlikely to be found, even with planning. Each step multiplies failure probability.

5. Numerical/computational proofs: Theorems requiring explicit computation (checking 1000 cases) may timeout even though the proof is straightforward.

6. Non-standard techniques: If the proof requires techniques rarely seen in training data, the model won't generate them.

7. Type theory complications: Universe level errors, implicit argument inference failures, and type class search divergence cause cryptic failures.

BFS-Prover integrates with LLMLean for interactive use:

Setup (via Ollama):

Install Ollama and pull the model:

ollama pull zeyu-zheng/BFS-Prover-V2-7B:q8_0

Configure LLMLean:

# ~/.config/llmlean/config.toml
api = "ollama"
model = "zeyu-zheng/BFS-Prover-V2-7B:q8_0"
mode = "parallel"
prompt = "tacticstate"
responseFormat = "tactic"
numSamples = "5"

Add to your Lean project:

require llmlean from git "https://github.com/cmu-l3/llmlean.git"

Use in proofs:

example : n + 0 = n := by
  llmstep ""  -- Get suggestion

Batch mode (full proof search):

For automated proving of many theorems:

Prepare theorems in JSONL format
Prepare dojo data with matching theorem entries
(Optional) Generate plans with planning pipeline
Run bash src/search/run_local_search.sh

BFS-Prover uses three data files:

demo_statements.jsonl - Theorem statements:

{"id": "mathd_algebra_17", "split": "test", "statement": "theorem mathd_algebra_17..."}

demo_dojo.jsonl - LeanDojo metadata:

{
  "url": "https://github.com/...",
  "file_path": "Mathlib/Algebra/...",
  "commit": "abc123...",
  "full_name": "mathd_algebra_17",
  "split": "test"
}

demo_plan.json - Pre-generated plans:

{
  "mathd_algebra_17": {
    "plan": [
      "have h1 : ...",
      "have h2 : ..."
    ],
    "progress": 0
  }
}

Critical requirement: Theorem IDs in statement file must match full_name in dojo data file. Mismatches cause verification errors.

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

AgentsReasoning

Building AI Agents to Win Math Olympiads: From Training to Inference Optimization

Production-focused guide to building AI systems that achieve gold-medal performance on IMO and other advanced math competitions—covering training techniques (GRPO, RLVR), inference-time optimization, and agentic architectures with tool use.

30 min read

LLMsResearch

Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning

Production-focused guide to inference-time scaling techniques—Chain of Thought, Tree of Thoughts, Monte Carlo Tree Search, Process Reward Models, and the HuggingFace search-and-learn framework.

11 min read

LLMsResearch

Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR

A deep technical guide to training reasoning models like o1 and DeepSeek R1—covering PPO, GRPO, reward function design, RLVR, and distillation techniques.

20 min read

Table of Contents

The Quest for Automated Mathematical Proof

Part I: Understanding Theorem Proving

Why Theorem Proving is Different

Proof Assistants and Lean 4

Part II: Best-First Search for Theorem Proving

The Core Algorithm

The Search Loop in Detail

Part III: The Proof Tree Data Structure

Node Types

Edge Structure

Status Propagation

Distance-to-Proof Tracking

Node Reuse (Deduplication)

Part IV: Tactic Generation

The TacticGenerator Actor

State Representation: pp vs pp1

Tactic Execution and Timeout

Part V: The Planning Pipeline

Why Planning Matters

Plan Structure and Format

The Planning Prompt

Plan Verification

Dynamic Replanning

Progress Caching with Ray

Part VI: Scaling Training

The Performance Plateau Problem

Multi-Stage Expert Iteration

Preference Pair Extraction

Part VII: Scaling Inference

Multi-Agent Ray Architecture

Configuration Parameters

Proof Validation

Part VIII: Advanced Implementation Details

Pass-k Strategy with Interleaved Submission

vLLM Engine Configuration

The extend_tree() Recovery Method

Complete Planner Rules (Including Rule 19)

Complete Buggy Tactic Filtering

Environment Variables

PlanIOActor and Planner Retry Logic

SearchResult Complete Fields

Process Crash Recovery

TheoremType Enum and Data Loading

CPU Time Accounting Details

Production Configuration vs Defaults

Part IX: Benchmark Results and Analysis

Performance on miniF2F

Performance on ProofNet

Analysis: Component Contributions

Conclusion

Key Takeaways

The Broader Picture

Sources

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Building AI Agents to Win Math Olympiads: From Training to Inference Optimization

Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning

Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR