Skip to main content
Back to Blog

Neural Theorem Proving: From Best-First Search to Gold-Medal Proofs

A comprehensive guide to neural theorem proving systems. From best-first search algorithms to planner-enhanced multi-agent architectures—understand how AI systems are achieving state-of-the-art results on formal mathematics.

12 min read
Share:

The Quest for Automated Mathematical Proof

Automated theorem proving represents one of the oldest and most challenging problems in artificial intelligence. The dream is simple yet profound: build systems that can discover and verify mathematical proofs automatically. Unlike other AI domains where "good enough" might suffice, theorem proving demands absolute correctness—a proof is either valid or it isn't.

Recent breakthroughs have transformed this landscape. In 2025, neural theorem provers achieved results that seemed impossible just years ago: 95.08% on miniF2F (a benchmark of competition mathematics) and solutions to problems from the International Mathematical Olympiad. These systems don't just guess answers—they construct formal proofs that can be mechanically verified.

This post provides a comprehensive technical guide to BFS-Prover-V2, ByteDance's state-of-the-art theorem proving system for Lean 4. We'll explore every layer of the architecture:

  1. Foundations: Why theorem proving is fundamentally different from other AI tasks
  2. Best-First Search: The core algorithm that guides proof exploration
  3. The Proof Tree: Data structures for tracking search state
  4. The Planning Pipeline: How LLMs decompose hard theorems into tractable subgoals
  5. Training at Scale: Multi-stage expert iteration that overcomes performance plateaus
  6. Inference at Scale: Planner-enhanced multi-agent tree search
  7. Production Engineering: Timeout handling, parallelism, and robustness
Code
┌─────────────────────────────────────────────────────────────────────────┐
│              NEURAL THEOREM PROVING TIMELINE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1960s-1990s    SYMBOLIC ERA                                            │
│  ───────────    Resolution theorem provers, ATP competitions            │
│                 Pure logic, no learning                                  │
│                                                                          │
│  2016-2019      NEURAL GUIDANCE EMERGES                                 │
│  ──────────     DeepMath, GamePad, early neural ATP                     │
│                 Neural networks guide symbolic search                    │
│                                                                          │
│  2020-2022      LANGUAGE MODELS ENTER                                   │
│  ──────────     GPT-f, PACT, LeanDojo                                   │
│                 Transformers generate proof steps                        │
│                                                                          │
│  2023-2024      SCALING BREAKTHROUGHS                                   │
│  ──────────     AlphaProof, DeepSeek-Prover, Llemma                     │
│                 RL training, expert iteration                            │
│                                                                          │
│  2025           STATE OF THE ART                                        │
│  ─────          BFS-Prover-V2: 95.08% miniF2F                           │
│                 Multi-agent tree search + planning                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part I: Understanding Theorem Proving

Why Theorem Proving is Different

Before diving into algorithms, we must understand why theorem proving poses unique challenges that distinguish it from other AI domains.

The verification asymmetry:

In most machine learning tasks, evaluation is expensive and approximate. Is this image classification correct? Is this translation good? These require human judgment or proxy metrics. Theorem proving is fundamentally different: verification is cheap and exact. Given a candidate proof, we can mechanically check its validity in milliseconds. A proof either type-checks in the proof assistant or it doesn't—there's no gray area.

This asymmetry creates a unique opportunity. We can generate many candidate proofs and filter to those that verify. The challenge shifts from "how do we know if we're right?" to "how do we efficiently explore the space of possible proofs?"

The search space explosion:

A mathematical proof is a sequence of logical steps. At each step, there might be hundreds of potentially applicable tactics (proof steps). A 10-step proof with 100 choices per step creates 100^10 = 10^20 possible paths. Exhaustive search is impossible.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    THE PROOF SEARCH SPACE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                        Theorem Statement                                 │
│                              │                                           │
│              ┌───────────────┼───────────────┐                          │
│              │               │               │                           │
│              ▼               ▼               ▼                           │
│          Tactic 1        Tactic 2        Tactic 3      ... (100+ options)│
│              │               │               │                           │
│         ┌────┴────┐     ┌────┴────┐     ┌────┴────┐                     │
│         │    │    │     │    │    │     │    │    │                     │
│         ▼    ▼    ▼     ▼    ▼    ▼     ▼    ▼    ▼    (100+ each)     │
│        ...  ...  ...   ...  ...  ...   ...  ...  ...                    │
│                                                                          │
│                      Search space: O(b^d)                                │
│                      b = branching factor (~100)                        │
│                      d = proof depth (~10-50)                           │
│                                                                          │
│              MOST PATHS LEAD TO DEAD ENDS                               │
│              Only tiny fraction reach valid proofs                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The credit assignment problem:

When a proof search fails after exploring thousands of paths, which decisions were mistakes? Was the initial approach wrong? Did we miss a crucial insight at step 5? Did we take a correct but inefficient path? Unlike games with clear win/loss signals, proof search failures provide sparse feedback.

Formal language constraints:

Proofs must be written in precise formal languages (Lean, Coq, Isabelle). These languages are unforgiving—a misplaced parenthesis or incorrect type annotation invalidates the entire proof. LLMs trained on informal mathematics must learn the exacting syntax of formal systems.

Proof Assistants and Lean 4

Modern theorem proving systems target proof assistants—interactive software environments where humans and computers collaborate on formal proofs. The most prominent include:

Proof AssistantLanguageNotable Uses
Lean 4FunctionalMathlib (1M+ lines of math), BFS-Prover
CoqFunctionalCompCert (verified C compiler)
IsabelleFunctionalseL4 (verified OS kernel)
HOL LightFunctionalFlyspeck (Kepler conjecture)

Why Lean 4 dominates AI theorem proving:

Lean 4 has emerged as the preferred target for neural theorem provers for several reasons:

  1. Mathlib: A vast library of formalized mathematics (undergraduate through research level) provides training data and useful lemmas
  2. LeanDojo: A Python interface that enables programmatic interaction with Lean, essential for reinforcement learning
  3. Modern design: Lean 4's metaprogramming capabilities enable sophisticated tactic automation
  4. Active community: Rapid development of both the system and mathematical libraries

The tactic state paradigm:

In Lean, proofs proceed by transforming a tactic state—a list of goals that remain to be proven. Each tactic transforms the current state into a new state (ideally with simpler or fewer goals).

Code
Initial state:
  ⊢ ∀ n : ℕ, n + 0 = n                    [Goal: prove this statement]

After "intro n":
  n : ℕ
  ⊢ n + 0 = n                              [Now proving for specific n]

After "rfl":
  No goals                                  [Proof complete!]

The art of theorem proving is selecting the right tactic at each step. This is where neural networks enter.


Part II: Best-First Search for Theorem Proving

The Core Algorithm

Best-First Search (BFS) is a fundamental algorithm for exploring large search spaces. Unlike breadth-first search (which explores all nodes at depth d before depth d+1) or depth-first search (which goes deep before backtracking), best-first search maintains a priority queue and always expands the most promising node.

Why best-first search for theorem proving:

  1. Informed exploration: We can use neural networks to estimate which proof states are most likely to lead to success
  2. Anytime behavior: We can stop at any point and return the best proof found so far
  3. Natural parallelism: Multiple workers can explore different branches simultaneously
  4. Backtracking: Unlike greedy search, we can abandon unpromising paths and try alternatives

The priority function:

The key design decision in best-first search is the priority function—how do we score nodes to determine exploration order? BFS-Prover uses cumulative log probability:

Priority(n)=i=1dlogP(tacticistatei1)\text{Priority}(n) = \sum_{i=1}^{d} \log P(\text{tactic}_i | \text{state}_{i-1})

Where dd is the depth (number of tactics from root to node nn). This prioritizes proof paths where the neural network is confident in each step.

The depth reward parameter:

Raw cumulative log probability has a bias: longer paths accumulate more negative log probability (since each step multiplies probability < 1). This can make the search prefer short, dead-end paths over longer successful ones.

BFS-Prover introduces a depth reward parameter α\alpha that adjusts priorities:

AdjustedPriority(n)=Priority(n)dα\text{AdjustedPriority}(n) = \frac{\text{Priority}(n)}{d^\alpha}

α valueEffectUse case
0.0Pure cumulative logprobPrefer confident paths (default)
1.0Average logprob per stepBalance depth vs confidence
2.0Strong depth normalizationEncourage deeper exploration

The Search Loop in Detail

The BFS-Prover search operates in two distinct phases when a plan exists:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              TWO-PHASE SEARCH WITH PLANNING                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PHASE 1: PLAN EXECUTION (if plan exists)                               │
│  ─────────────────────────────────────────                               │
│                                                                          │
│  Plan: [have h1: ..., have h2: ..., have h3: ..., have h4: ...]        │
│                                                                          │
│  For each plan_step in plan:                                            │
│    1. Check ProgressCache for cached progress                           │
│    2. If cached_progress > local_progress:                              │
│       → Load proof trajectory from cache                                │
│       → extend_tree() to replay solved steps                            │
│    3. Apply plan_step + "sorry" to advance context                      │
│    4. Run BFS focused on current subgoal (pp1 mode)                     │
│    5. If subgoal solved:                                                │
│       → Update ProgressCache with trajectory                            │
│       → progress += 1                                                   │
│       → reset_search_root() for next subgoal                            │
│    6. If stuck:                                                         │
│       → Trigger dynamic replanning                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PHASE 2: FINAL SEARCH (after plan completion OR no plan)               │
│  ────────────────────────────────────────────────────────                │
│                                                                          │
│  Run unrestricted BFS on remaining goals                                │
│  Uses full tactic state (pp mode, not pp1)                              │
│  Continues until:                                                        │
│    - Proof found (root.status == PROVED)                                │
│    - All paths exhausted (root.status == FAILED)                        │
│    - Timeout exceeded                                                    │
│    - Priority queue empty                                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The main expansion loop:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              BEST-FIRST SEARCH EXPANSION LOOP                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INITIALIZE:                                                             │
│  ───────────                                                             │
│  root = InternalNode(state=initial_state, depth=0, cumulative_logprob=0)│
│  priority_queue = AsyncPriorityQueue()                                  │
│  nodes = {initial_state: root}  # State → Node cache for deduplication  │
│  priority_queue.put((-0.0, root))  # Negative for max-heap behavior     │
│                                                                          │
│  TERMINATION CONDITIONS:                                                 │
│  ───────────────────────                                                 │
│  • priority_queue.empty()                                               │
│  • total_time > timeout (default 600s)                                  │
│  • root.status == PROVED                                                │
│  • root.status == FAILED                                                │
│                                                                          │
│  MAIN LOOP:                                                              │
│  ──────────                                                              │
│                                                                          │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  1. DEQUEUE                                                  │     │
│     │     priority, node = priority_queue.get()                   │     │
│     │     Skip if: node.out_edges is not None (already explored)  │     │
│     │     Skip if: node.status != OPEN                            │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  2. GET STATE REPRESENTATION                                 │     │
│     │     Plan mode: tactic_state = node.state.pp1                │     │
│     │       (First goal only, case statements removed)            │     │
│     │     Normal mode: tactic_state = node.state.pp               │     │
│     │       (Full pretty-print of all goals)                      │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  3. GENERATE TACTICS                                         │     │
│     │     suggestions = tac_gen.generate(tactic_state, params)    │     │
│     │     Returns: [(tactic₁, logprob₁), ..., (tacticₙ, logprobₙ)]│     │
│     │     Deduplication: keep first occurrence of each unique     │     │
│     │     Filtering: remove sorry, admit, native_decide, bugs     │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  4. EXECUTE EACH TACTIC                                      │     │
│     │     For each (tactic, logprob) in suggestions:              │     │
│     │       response, exec_time = run_tactic(node, tactic)        │     │
│     │                                                              │     │
│     │       response types:                                        │     │
│     │         TacticState → new proof state                       │     │
│     │         ProofFinished → no remaining goals!                 │     │
│     │         ProofGivenUp → proof abandoned                      │     │
│     │         LeanError → tactic failed                           │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  5. CREATE NODES AND EDGES                                   │     │
│     │                                                              │     │
│     │     If response is TacticState:                             │     │
│     │       If state in nodes: reuse existing node (dedup)        │     │
│     │       Else: create new InternalNode                         │     │
│     │                                                              │     │
│     │     If response is ProofFinished/ProofGivenUp:              │     │
│     │       Create ProofFinishedNode                              │     │
│     │                                                              │     │
│     │     If response is LeanError:                               │     │
│     │       Create ErrorNode                                      │     │
│     │                                                              │     │
│     │     Create Edge(tactic, src=node, dst=new_node,             │     │
│     │                 time=exec_time, logprob=logprob)            │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  6. ADD TO QUEUE                                             │     │
│     │     For each new InternalNode with status == OPEN:          │     │
│     │       new_priority = cumulative_logprob / (depth^α)         │     │
│     │       priority_queue.put((-new_priority, new_node))         │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  7. UPDATE STATUS                                            │     │
│     │     node.out_edges = [all created edges]  # Mark explored   │     │
│     │     For each edge.dst: edge.dst.in_edges.append(edge)       │     │
│     │     Trigger status propagation (see next section)           │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                           │                                              │
│                           ▼                                              │
│     ┌─────────────────────────────────────────────────────────────┐     │
│     │  8. EARLY EXIT CHECK (Plan Mode)                             │     │
│     │     If has_plan and only 1 remaining goal:                  │     │
│     │       Break to advance to next plan step                    │     │
│     └─────────────────────────────────────────────────────────────┘     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part III: The Proof Tree Data Structure

Node Types

BFS-Prover maintains an explicit tree data structure representing the explored proof space. There are three types of nodes:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PROOF TREE NODE TYPES                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                     InternalNode                                 │    │
│  │                                                                  │    │
│  │  Non-terminal state with remaining goals to prove               │    │
│  │                                                                  │    │
│  │  Attributes:                                                     │    │
│  │  ───────────                                                     │    │
│  │  • state: TacticState                                           │    │
│  │      The current proof state (goals + hypotheses)               │    │
│  │                                                                  │    │
│  │  • depth: int                                                   │    │
│  │      Distance from root node (root has depth=0)                 │    │
│  │                                                                  │    │
│  │  • cumulative_logprob: float                                    │    │
│  │      Sum of log probabilities along path from root              │    │
│  │                                                                  │    │
│  │  • _out_edges: Optional[List[Edge]]                             │    │
│  │      None if unexplored, list of edges if explored              │    │
│  │      Setting this "freezes" the node's exploration              │    │
│  │                                                                  │    │
│  │  • _in_edges: List[Edge]                                        │    │
│  │      Edges pointing TO this node (may grow as other paths       │    │
│  │      discover this state through node reuse)                    │    │
│  │                                                                  │    │
│  │  • _status: Status (OPEN | PROVED | FAILED)                     │    │
│  │      Computed lazily, cached with dirty flag                    │    │
│  │                                                                  │    │
│  │  • _distance_to_proof: float                                    │    │
│  │      Minimum edges to reach ProofFinishedNode (inf if unknown)  │    │
│  │                                                                  │    │
│  │  • _marker: bool                                                │    │
│  │      Internal flag for marking subgoal boundaries in plan mode  │    │
│  │                                                                  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                   ProofFinishedNode                              │    │
│  │                                                                  │    │
│  │  Terminal success - no remaining goals                          │    │
│  │                                                                  │    │
│  │  Attributes:                                                     │    │
│  │  ───────────                                                     │    │
│  │  • inner: Union[ProofFinished, ProofGivenUp]                    │    │
│  │      The Lean response indicating completion                    │    │
│  │                                                                  │    │
│  │  • depth: int                                                   │    │
│  │                                                                  │    │
│  │  Constants:                                                      │    │
│  │  ──────────                                                      │    │
│  │  • status: PROVED (always)                                      │    │
│  │  • distance_to_proof: 0 (already at proof)                      │    │
│  │  • is_terminal: True                                            │    │
│  │                                                                  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                      ErrorNode                                   │    │
│  │                                                                  │    │
│  │  Terminal failure - tactic error or timeout                     │    │
│  │                                                                  │    │
│  │  Attributes:                                                     │    │
│  │  ───────────                                                     │    │
│  │  • inner: LeanError                                             │    │
│  │      Error message from Lean or timeout                         │    │
│  │                                                                  │    │
│  │  • depth: int                                                   │    │
│  │                                                                  │    │
│  │  Constants:                                                      │    │
│  │  ──────────                                                      │    │
│  │  • status: FAILED (always)                                      │    │
│  │  • distance_to_proof: math.inf (unreachable)                    │    │
│  │  • is_terminal: True                                            │    │
│  │                                                                  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Edge Structure

Edges represent tactic applications connecting nodes:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                         EDGE STRUCTURE                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                         Edge                                     │    │
│  │                                                                  │    │
│  │  Represents a single tactic application                         │    │
│  │                                                                  │    │
│  │  Attributes:                                                     │    │
│  │  ───────────                                                     │    │
│  │  • tactic: str                                                  │    │
│  │      The tactic text (e.g., "simp [Nat.add_zero]")              │    │
│  │                                                                  │    │
│  │  • src: InternalNode                                            │    │
│  │      Source node (where tactic was applied)                     │    │
│  │                                                                  │    │
│  │  • dst: Node                                                    │    │
│  │      Destination node (result of applying tactic)               │    │
│  │      Can be InternalNode, ProofFinishedNode, or ErrorNode       │    │
│  │                                                                  │    │
│  │  • time: float                                                  │    │
│  │      Wall-clock execution time in seconds                       │    │
│  │                                                                  │    │
│  │  • logprob: float                                               │    │
│  │      Per-token average log probability from model               │    │
│  │      For plan tactics: logprob = 1.0 (placeholder)              │    │
│  │                                                                  │    │
│  │  Methods:                                                        │    │
│  │  ────────                                                        │    │
│  │  • distance_to_proof(): 1 + dst.distance_to_proof               │    │
│  │                                                                  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  LOGPROB CALCULATION:                                                    │
│  ────────────────────                                                    │
│  Raw from vLLM: cumulative_logprob over all tokens                      │
│  Normalized: cumulative_logprob / len(token_ids)                        │
│  This gives per-token average, normalizing for tactic length            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Status Propagation

Status propagates upward from leaves to root through in-edges:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    STATUS PROPAGATION ALGORITHM                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STATUS VALUES:                                                          │
│  ──────────────                                                          │
│  • OPEN: Not yet determined (default for InternalNode)                  │
│  • PROVED: At least one path leads to ProofFinishedNode                 │
│  • FAILED: All paths lead to ErrorNode or explored dead ends            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ALGORITHM (_recompute_status):                                          │
│  ──────────────────────────────                                          │
│                                                                          │
│  1. If node.status != OPEN:                                             │
│       return  # Status is immutable once set                            │
│                                                                          │
│  2. If node.out_edges is None:                                          │
│       return  # Not yet explored, remain OPEN                           │
│                                                                          │
│  3. Check children status:                                              │
│                                                                          │
│       new_status = FAILED  # Assume worst case                          │
│                                                                          │
│       For each edge in node.out_edges:                                  │
│         child_status = edge.dst.status                                  │
│                                                                          │
│         If child_status == PROVED:                                      │
│           new_status = PROVED                                           │
│           break  # One proved child is enough!                          │
│                                                                          │
│         If child_status == OPEN:                                        │
│           new_status = OPEN  # Can't be FAILED if any child is OPEN    │
│                                                                          │
│  4. If new_status != OPEN:                                              │
│       node._status = new_status                                         │
│                                                                          │
│       # Propagate upward through in-edges                               │
│       For each in_edge in node._in_edges:                               │
│         in_edge.src._recompute_status()  # Recursive call              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INVARIANTS:                                                         │
│  ───────────────                                                         │
│  • Status never changes once set to PROVED or FAILED                    │
│  • PROVED propagates immediately (any child proved → parent proved)     │
│  • FAILED only propagates when ALL children are FAILED                  │
│  • Proof found when root.status == PROVED                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│                                                                          │
│         [Root: OPEN]                   [Root: PROVED] ←── propagates up │
│          /       \                      /       \                        │
│     [OPEN]      [OPEN]     →→→    [PROVED]    [OPEN]                   │
│      /  \         |                 /  \         |                       │
│   [ERR] [OPEN]  [ERR]           [ERR] [PF]    [ERR]                     │
│                                          ↑                               │
│                              ProofFinished found!                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Distance-to-Proof Tracking

Once a proof is found, we need to extract the shortest path:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DISTANCE-TO-PROOF PROPAGATION                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ALGORITHM (_recompute_distance_to_proof):                              │
│  ─────────────────────────────────────────                               │
│                                                                          │
│  1. If node.out_edges is None or empty:                                 │
│       distance = infinity                                               │
│                                                                          │
│  2. Else:                                                               │
│       distance = min(edge.distance_to_proof() for edge in out_edges)   │
│                                                                          │
│       where edge.distance_to_proof() = 1 + edge.dst.distance_to_proof  │
│                                                                          │
│  3. If distance improved (new < old):                                   │
│       node._distance_to_proof = distance                                │
│                                                                          │
│       # Propagate upward                                                │
│       For each in_edge in node._in_edges:                               │
│         in_edge.src._recompute_distance_to_proof()                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PROOF EXTRACTION:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Starting from root with distance d:                                    │
│    path = []                                                            │
│    node = root                                                          │
│    While node is not terminal:                                          │
│      best_edge = argmin(edge.distance_to_proof() for edge in out_edges)│
│      path.append((node.state, best_edge.tactic))                       │
│      node = best_edge.dst                                               │
│    return path                                                          │
│                                                                          │
│  Result: Optimal (shortest) proof among all discovered paths            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Node Reuse (Deduplication)

Different tactic sequences can reach the same proof state. BFS-Prover maintains a cache to avoid redundant exploration:

Code
nodes: Dict[TacticState, Node]  # Maps state to existing node

When creating a new node:
  If response (TacticState) in nodes:
    dst_node = nodes[response]      # Reuse existing!
    # Add in_edge to existing node
  Else:
    dst_node = InternalNode(...)    # Create new
    nodes[response] = dst_node      # Cache it

This dramatically reduces memory usage and computation when multiple paths converge to the same state.


Part IV: Tactic Generation

The TacticGenerator Actor

Tactic generation is handled by a Ray actor wrapping vLLM:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TACTIC GENERATOR                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RAY ACTOR: TacticGenerator                                             │
│  ──────────────────────────                                              │
│                                                                          │
│  Initialization:                                                         │
│  • Wraps vLLM AsyncLLMEngine                                            │
│  • max_num_batched_tokens: 8192                                         │
│  • max_num_seqs: 256 (concurrent sequences)                             │
│  • max_model_len: 16384 (context window)                                │
│  • enable_chunked_prefill: True                                         │
│  • tensor_parallel_size: num_gpus                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  INPUT FORMAT:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  The model receives: tactic_state + ":::"                               │
│                                                                          │
│  Example input:                                                          │
│  ┌──────────────────────────────────────────┐                           │
│  │ n : ℕ                                    │                           │
│  │ h : n > 0                                │                           │
│  │ ⊢ n + 0 = n:::                           │                           │
│  └──────────────────────────────────────────┘                           │
│                                                                          │
│  The ":::" delimiter signals "generate tactic here"                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SAMPLING PARAMETERS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  │ Parameter    │ Default │ Purpose                                │    │
│  ├──────────────┼─────────┼────────────────────────────────────────┤    │
│  │ n            │ 16      │ Number of tactic samples               │    │
│  │ temperature  │ 0.7     │ Randomness in sampling                 │    │
│  │ top_p        │ 1.0     │ Nucleus sampling threshold             │    │
│  │ max_tokens   │ 2048    │ Maximum tactic length                  │    │
│  │ logprobs     │ 1       │ Return log probabilities               │    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OUTPUT PROCESSING:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  1. For each generated sequence:                                        │
│       text = output.text.strip()                                        │
│       logprob = cumulative_logprob / len(token_ids)  # Normalize        │
│                                                                          │
│  2. Filter out invalid tactics:                                         │
│       • Contains "sorry" → proof placeholder, not real tactic           │
│       • Contains "admit" → proof placeholder                            │
│       • Contains "native_decide" → can cause infinite loops             │
│                                                                          │
│  3. Filter buggy patterns:                                              │
│       • ("rcases" OR "cases'") AND "?_" → incomplete case split         │
│       • "simpa" AND (" _" OR "_ " OR "_," OR ",_") → malformed          │
│                                                                          │
│  4. Return: List[(tactic_text, per_token_logprob)]                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

State Representation: pp vs pp1

BFS-Prover has two modes for representing the tactic state:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    STATE REPRESENTATION MODES                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FULL MODE (pp) - Used when no plan or in final search                  │
│  ─────────────────────────────────────────────────────                   │
│                                                                          │
│  Shows complete tactic state with all goals:                            │
│                                                                          │
│  ┌──────────────────────────────────────────┐                           │
│  │ case h1                                  │                           │
│  │ n : ℕ                                    │                           │
│  │ h : n > 0                                │                           │
│  │ ⊢ n + 0 = n                              │                           │
│  │                                          │                           │
│  │ case h2                                  │                           │
│  │ m : ℕ                                    │                           │
│  │ ⊢ m * 1 = m                              │                           │
│  │                                          │                           │
│  │ case h3                                  │                           │
│  │ ...more goals...                         │                           │
│  └──────────────────────────────────────────┘                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FOCUS MODE (pp1) - Used during plan execution                          │
│  ─────────────────────────────────────────────                           │
│                                                                          │
│  Shows only the first goal, cleaned up:                                 │
│                                                                          │
│  Processing via custom property:                                        │
│    1. Extract first paragraph (before "\n\n")                           │
│    2. Remove "case ..." lines via regex                                 │
│    3. Remove "this : True" declarations                                 │
│    4. Strip whitespace                                                  │
│                                                                          │
│  Result:                                                                 │
│  ┌──────────────────────────────────────────┐                           │
│  │ n : ℕ                                    │                           │
│  │ h : n > 0                                │                           │
│  │ ⊢ n + 0 = n                              │                           │
│  └──────────────────────────────────────────┘                           │
│                                                                          │
│  WHY FOCUS MODE HELPS:                                                   │
│  ─────────────────────                                                   │
│  • Reduces noise from irrelevant goals                                  │
│  • Model focuses on immediate subgoal                                   │
│  • Shorter context = faster generation                                  │
│  • Dramatically improves tactic relevance when following a plan         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Tactic Execution and Timeout

Executing tactics in Lean requires careful timeout handling:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TACTIC EXECUTION                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FUNCTION: run_tac_cpu_timeout(dojo, state, tactic, ...)                │
│  ───────────────────────────────────────────────────────                 │
│                                                                          │
│  Parameters:                                                             │
│  • cpu_sec_limit: 10.0 (default) - Max CPU time                         │
│  • wall_sec_limit: 15.0 (1.5x CPU) - Max wall-clock time                │
│  • poll_interval: 0.5 - How often to check limits                       │
│  • focus_mode: bool - Wrap tactic in "focus" command                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DUAL TIMEOUT MONITORING:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  A background thread monitors BOTH limits simultaneously:               │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │  Monitor Thread:                                                 │    │
│  │                                                                  │    │
│  │  While tactic_running:                                          │    │
│  │    cpu_time = process.cpu_times().user + .system               │    │
│  │    cpu_time += sum(child.cpu_times() for child in children)    │    │
│  │    wall_time = time.monotonic() - start                        │    │
│  │                                                                  │    │
│  │    If cpu_time > cpu_sec_limit:                                 │    │
│  │      raise LeanCPUTimedOut(limit_type="cpu", ...)              │    │
│  │                                                                  │    │
│  │    If wall_time > wall_sec_limit:                               │    │
│  │      raise LeanCPUTimedOut(limit_type="wall", ...)             │    │
│  │      killed_hard = True  # Process tree will be killed         │    │
│  │                                                                  │    │
│  │    sleep(poll_interval)                                         │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TIMEOUT OUTCOMES:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Soft timeout (CPU limit):                                              │
│    → Return LeanError with timeout message                              │
│    → Create ErrorNode, continue search                                  │
│                                                                          │
│  Hard timeout (wall limit, Lean unresponsive):                          │
│    → Kill entire process tree (SIGKILL)                                 │
│    → Raise DojoRestartRequired                                          │
│    → Caller creates fresh Dojo instance                                 │
│    → Retry up to MAX_DOJO_RESTARTS (3) times                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY DUAL MONITORING:                                                    │
│  ────────────────────                                                    │
│  • CPU time catches infinite loops in Lean                              │
│  • Wall time catches I/O hangs, deadlocks, scheduling issues            │
│  • Child process accounting prevents timeout evasion via subprocess     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part V: The Planning Pipeline

Why Planning Matters

Best-first search alone achieves strong results—but it struggles with theorems requiring long proof chains or non-obvious intermediate lemmas. The search space explodes exponentially with depth, and the neural network's confidence becomes unreliable for distant goals.

The insight: hierarchical decomposition.

Instead of searching for the entire proof at once, first decompose the theorem into intermediate subgoals. Each subgoal is easier to prove, and the planning step can leverage a more powerful reasoning model.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PLANNING IMPACT ON PERFORMANCE                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WITHOUT PLANNING                    WITH PLANNING                      │
│  ─────────────────                   ─────────────                       │
│                                                                          │
│  Complex theorem                     Complex theorem                     │
│       │                                   │                              │
│       ▼                                   ▼                              │
│  ┌─────────────┐                   ┌──────────────┐                     │
│  │ Direct BFS  │                   │   Planner    │                     │
│  │             │                   │  (GPT-4/LLM) │                     │
│  │ Depth: 15+  │                   └──────┬───────┘                     │
│  │ Many dead   │                          │                             │
│  │ ends        │                   Decompose into                       │
│  └─────────────┘                   [h1, h2, h3, h4, h5]                 │
│                                          │                              │
│                              ┌───────────┼───────────┐                  │
│                              ▼           ▼           ▼                  │
│                          ┌──────┐   ┌──────┐   ┌──────┐                │
│                          │ BFS  │   │ BFS  │   │ BFS  │                │
│                          │ d=3  │   │ d=3  │   │ d=3  │                │
│                          └──────┘   └──────┘   └──────┘                │
│                                                                          │
│  BFS-Prover-V2-32B:                 BFS-Prover-V2-32B + Planner:        │
│  86.1% on miniF2F-test              95.08% on miniF2F-test              │
│                                                                          │
│                     +8.98% absolute improvement!                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Plan Structure and Format

A plan is a sequence of have statements—intermediate claims that bridge the theorem's hypotheses to its conclusion.

Example theorem:

Code
theorem singapore2019_r1_p7 (x : ℝ) (hx : Real.tan x = 5) :
    (6 + Real.sin (2 * x)) / (1 + Real.cos (2 * x)) = 83 := by

Generated plan:

Code
have h₁ : Real.sin x = 5 * Real.cos x
have h₂ : Real.sin x ^ 2 = 25 * Real.cos x ^ 2
have h₃ : 26 * Real.cos x ^ 2 = 1
have hsin2x_val : Real.sin (2 * x) = (5 : ℝ) / (13 : ℝ)
have hcos2x_val : Real.cos (2 * x) = -(12 : ℝ) / (13 : ℝ)

Plan JSON format (demo_plan.json):

JSON
{
  "mathd_algebra_17": {
    "plan": [
      "have h_sqrt_simplify : Real.sqrt (16 + 16 * a) = 4 * Real.sqrt (1 + a)",
      "have h_eq_transformed : Real.sqrt (4 + 4 * Real.sqrt (1 + a)) + Real.sqrt (1 + Real.sqrt (1 + a)) = 6",
      "have h_sum_eq : 3 * Real.sqrt (1 + Real.sqrt (1 + a)) = 6"
    ],
    "progress": 0
  }
}

The progress field tracks how many subgoals have been solved (0 = none, persists across restarts).

The Planning Prompt

The planner is a general-purpose reasoning LLM prompted with carefully designed instructions. The prompt must specify the output format precisely and list mandatory rules for valid Lean 4 syntax.

Critical rules for Lean 4 (from prompt.yaml):

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              MANDATORY LEAN 4 RULES IN PLANNER PROMPTS                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ★★★ RULE 1: MOST CRITICAL - EXPLICIT SET/FINSET TYPES ★★★              │
│  ───────────────────────────────────────────────────────                 │
│  This causes ~40% of plan failures if violated!                         │
│                                                                          │
│  ✓ Correct:   ({-1, 0, 1} : Set ℤ)                                      │
│  ✓ Correct:   (Finset.Icc 1 42 : Finset ℕ)                              │
│  ✗ Wrong:     {-1, 0, 1}                                                │
│  ✗ Wrong:     Finset.Icc 1 42                                           │
│                                                                          │
│  Reason: Lean cannot infer element type from set literal notation       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 2: OMIT THE PROOF                                                  │
│  Never provide the proof (no `by` clause). Only the `have` statement.  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 3: VALID LEAN 4 CODE                                               │
│  Output must be type-checkable in Lean 4.10.0 environment.              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 4: USE EXISTING NAMES                                              │
│  Use exact lemma/definition names from Mathlib. Don't invent names.     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 5: NO UNDECLARED VARIABLES                                         │
│  Only use variables from the theorem statement.                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 6: EXPLICIT MULTIPLICATION                                         │
│  ✓ Correct:   a * x                                                     │
│  ✗ Wrong:     ax                                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 7: NO CHAINED INEQUALITIES                                         │
│  ✓ Correct:   a ≤ x ∧ x ≤ b                                             │
│  ✗ Wrong:     a ≤ x ≤ b                                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 8: CORRECT LOGARITHM FUNCTION                                      │
│  Real.log = natural log only                                            │
│  ✓ Correct:   Real.logb 2 8     (log base 2 of 8)                       │
│  ✗ Wrong:     Real.log 2 8                                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 9: FACTORIAL NOTATION                                              │
│  ✓ Correct:   (n)! or Nat.factorial n                                   │
│  ✗ Wrong:     n!                                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 10: EXPLICIT NUMERIC TYPE ANNOTATIONS                              │
│  Division: (1 : ℝ) / 2 = 0.5, but (1 : ℤ) / 2 = 0                       │
│  Subtraction: (1 : ℤ) - 2 = -1, but (1 : ℕ) - 2 = 0                     │
│  ✓ Correct:   (a : ℝ) / b, (n : ℤ) - m                                  │
│  ✗ Wrong:     a / b, n - m (ambiguous)                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 11: NO INTERVAL NOTATION                                           │
│  ✓ Correct:   a ≤ x ∧ x ≤ b                                             │
│  ✗ Wrong:     Icc a b                                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 12: COMPLEX NUMBERS                                                │
│  ✓ Correct:   Complex.I (imaginary unit), Complex.abs z (modulus)       │
│  ✗ Wrong:     i, |z|                                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 13: AVOID ADVANCED INEQUALITY THEOREMS                             │
│  Avoid Hölder's, Jensen's, etc. Prefer basic simplification steps.      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 14: EQUIVALENCES AS IMPLICATIONS                                   │
│  When handling ↔ (iff), produce implications:                           │
│  - Left-to-right: assume LHS, conclude RHS                              │
│  - Right-to-left: assume RHS, conclude LHS                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 15: Real.pi NOTATION                                               │
│  ✓ Correct:   Real.pi                                                   │
│  ✗ Wrong:     π                                                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 16: FINAL CHECK                                                    │
│  Before outputting, review all rules, especially Set/Finset types.      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Plan Verification

Generated plans must be verified before use:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PLAN VERIFICATION PROCESS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CLASS: TheoremData                                                      │
│  METHOD: verify_plan(theorem_id, plan) → (success, error_msg)           │
│                                                                          │
│  PROCESS:                                                                │
│  ────────                                                                │
│                                                                          │
│  1. Load theorem from dojo_data (url, commit, file_path)                │
│                                                                          │
│  2. Initialize Dojo with theorem                                        │
│                                                                          │
│  3. For each step in plan:                                              │
│       a. Execute: dojo.run_tac(state, step)                             │
│       b. Execute: dojo.run_tac(state, "sorry")  # Move to next goal     │
│                                                                          │
│       If step returns LeanError:                                        │
│         return (False, error_message)                                   │
│                                                                          │
│       If step returns wrong type:                                       │
│         return (False, "Unexpected response type")                      │
│                                                                          │
│  4. If all steps pass:                                                  │
│       return (True, "")                                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHT: "sorry" TACTIC                                            │
│  ────────────────────────────                                            │
│                                                                          │
│  After each `have` step, a "sorry" advances past the subgoal proof:    │
│                                                                          │
│  State: ⊢ have h1 : P                                                   │
│          ⊢ P      ← new goal (prove P)                                  │
│          ⊢ Q      ← rest of proof                                       │
│                                                                          │
│  After "sorry":                                                         │
│  State: h1 : P                                                          │
│          ⊢ Q      ← P is now assumed                                    │
│                                                                          │
│  This tests that plan steps are syntactically valid and type-correct   │
│  without requiring actual proofs.                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Dynamic Replanning

What happens when proof search gets stuck on a subgoal?

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DYNAMIC REPLANNING                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRIGGER: BFS fails to solve a plan step after timeout                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  REPLANNING INPUT:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  • Original theorem statement                                           │
│  • Proven subgoals: [h1, h2, ...] (verified, immutable)                 │
│  • Stuck subgoal: h3 (the step that failed)                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  REPLANNING RULES (additional to initial planning):                      │
│  ──────────────────────────────────────────────────                      │
│                                                                          │
│  RULE 16: INSERT AFTER PROVEN STEPS                                     │
│    New `have` statements go immediately after proven subgoals.          │
│    Proven steps are immutable - no edits, insertions, or extensions.   │
│                                                                          │
│  RULE 17: PROVIDE COMPLETE PLAN                                         │
│    Output the entire updated plan, not just new steps.                  │
│    Preserves proven subgoals exactly as given.                          │
│                                                                          │
│  RULE 18: ENSURE LOGICAL CONTINUITY                                     │
│    Each new step must be logically sound.                               │
│    Don't immediately repeat the stuck subgoal.                          │
│    Insert intermediate steps between proven work and stuck goal.        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  Original plan: [h1, h2, h3, h4, h5]                                    │
│  Proven: [h1, h2]                                                       │
│  Stuck: h3                                                              │
│                                                                          │
│  Replanned: [h1, h2, h3a, h3b, h3, h4, h5]                              │
│                       ↑     ↑                                            │
│               New intermediate steps                                     │
│                                                                          │
│  The planner decomposed h3 into smaller steps (h3a, h3b) that          │
│  make h3 easier to prove.                                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Progress Caching with Ray

The ProgressCache Ray actor persists progress across restarts:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PROGRESS CACHE                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RAY ACTOR: ProgressCache (named "progress_cache")                      │
│  ─────────────────────────────────────────────────                       │
│                                                                          │
│  DATA STRUCTURE:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  registry[theorem.full_name] = {                                        │
│      'progress': int,  # 0 to len(plan)                                 │
│      'proof': [                                                         │
│          [],  # Tactics for subgoal 0                                   │
│          [],  # Tactics for subgoal 1                                   │
│          ...                                                            │
│          [],  # Tactics for final step                                  │
│      ],                                                                  │
│      'proof_stats': [                                                   │
│          [],  # [(time, logprob), ...] for subgoal 0                   │
│          ...                                                            │
│      ],                                                                  │
│      'preference_pairs': [                                              │
│          [],  # [(state, win_tactic, lose_tactic), ...]                │
│          ...                                                            │
│      ]                                                                   │
│  }                                                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  METHODS:                                                                │
│  ────────                                                                │
│                                                                          │
│  initialize(theorem, num_subgoals, replace=False)                       │
│    Create empty entry for theorem                                       │
│                                                                          │
│  update(theorem, step_index, proof, proof_stats, preference_pairs)      │
│    1. Validate trajectory via validate_proof(focus_mode=True)           │
│    2. Only update if validation succeeds                                │
│    3. Increment progress                                                │
│                                                                          │
│  get_progress(theorem) → int                                            │
│    Return current progress (number of solved subgoals)                  │
│                                                                          │
│  get_proof_data(theorem) → (proof, stats, pairs)                        │
│    Return accumulated data for all solved subgoals                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THIS MATTERS:                                                       │
│  ─────────────────                                                       │
│  • Multiple prover agents may work on same theorem                      │
│  • Prevents duplicate work across restarts                              │
│  • Enables resume from checkpoints after crashes                        │
│  • Thread-safe via Ray actor model                                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VI: Scaling Training

The Performance Plateau Problem

Traditional fine-tuning on theorem proving data quickly hits diminishing returns. After a few epochs, the model stops improving—even with more data or compute.

Why plateaus occur:

  1. Distribution shift: The model learns to prove theorems from the training distribution, but struggles with novel problems
  2. Error accumulation: Small errors in early proof steps compound into dead ends
  3. Memorization: The model memorizes specific proofs rather than learning general strategies
  4. Data staleness: Static training data doesn't capture the model's current failure modes

Multi-Stage Expert Iteration

BFS-Prover addresses plateaus with a multi-stage expert iteration framework:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              MULTI-STAGE EXPERT ITERATION                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OUTER LOOP (triggered when performance plateaus):                      │
│  ────────────────────────────────────────────────                        │
│                                                                          │
│       ┌───────────────────────────────────────────────────────┐         │
│       │                                                        │         │
│       │  1. DATA RE-SYNTHESIS                                  │         │
│       │     • Run current model on all theorems               │         │
│       │     • Collect successful proof trajectories           │         │
│       │     • Extract preference pairs from failures          │         │
│       │                                                        │         │
│       │  2. DATA CURATION                                      │         │
│       │     • Filter low-quality proofs                       │         │
│       │     • Remove duplicates and near-duplicates           │         │
│       │     • Balance difficulty distribution                 │         │
│       │                                                        │         │
│       │  3. FULL RETRAINING                                    │         │
│       │     • Restart from BASE checkpoint (not current!)     │         │
│       │     • Train on curated data                           │         │
│       │     • Prevents overfitting to stale distributions     │         │
│       │                                                        │         │
│       └───────────────────────────────────────────────────────┘         │
│                              │                                           │
│                              ▼                                           │
│                                                                          │
│  INNER LOOP (while performance improves):                               │
│  ─────────────────────────────────────────                               │
│                                                                          │
│       ┌───────────────────────────────────────────────────────┐         │
│       │                                                        │         │
│       │  1. ROLLOUT                                            │         │
│       │     • Run proof search on theorem set                 │         │
│       │     • Collect new proofs and failures                 │         │
│       │                                                        │         │
│       │  2. TACTIC FILTERING                                   │         │
│       │     • Remove low-confidence tactics                   │         │
│       │     • Filter tactics that frequently fail             │         │
│       │     • Keep tactics that lead to proofs                │         │
│       │                                                        │         │
│       │  3. REFINEMENT                                         │         │
│       │     • Fine-tune on filtered data                      │         │
│       │     • Update model for next iteration                 │         │
│       │                                                        │         │
│       └───────────────────────────────────────────────────────┘         │
│                              │                                           │
│                              ▼                                           │
│                     Evaluate performance                                 │
│                              │                                           │
│              ┌───────────────┴───────────────┐                          │
│              │                               │                           │
│         Still improving?              Plateaued?                        │
│              │                               │                           │
│              ▼                               ▼                           │
│      Continue inner loop           Return to outer loop                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHT: RESTART FROM BASE                                          │
│  ──────────────────────────────                                          │
│  When plateaus occur, DON'T continue from current checkpoint.           │
│  Instead, regenerate all training data, then restart from the           │
│  original base model. This prevents accumulation of training            │
│  artifacts and distribution drift.                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Preference Pair Extraction

From failed proof attempts, BFS-Prover extracts preference pairs for training:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PREFERENCE PAIR EXTRACTION                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FUNCTION: extract_proof_data(root_node)                                │
│  ────────────────────────────────────────                                │
│                                                                          │
│  For each edge in the successful proof trajectory:                      │
│                                                                          │
│    source_node = edge.src                                               │
│    winning_tactic = edge.tactic                                         │
│                                                                          │
│    For each sibling_edge in source_node.out_edges:                      │
│      If sibling_edge.dst is ErrorNode:                                  │
│        losing_tactic = sibling_edge.tactic                              │
│                                                                          │
│        # Create preference pair                                         │
│        pair = (                                                         │
│            source_node.state.pp1,  # or .pp depending on mode           │
│            winning_tactic,                                              │
│            losing_tactic                                                │
│        )                                                                 │
│        preference_pairs.append(pair)                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  State: n : ℕ ⊢ n + 0 = n                                               │
│                                                                          │
│  Tactics tried:                                                         │
│    • "rfl"        → ProofFinished (success!)                            │
│    • "simp"       → ErrorNode (failed)                                  │
│    • "ring"       → ErrorNode (failed)                                  │
│                                                                          │
│  Preference pairs generated:                                            │
│    ("n : ℕ ⊢ n + 0 = n", "rfl", "simp")                                │
│    ("n : ℕ ⊢ n + 0 = n", "rfl", "ring")                                │
│                                                                          │
│  USE IN TRAINING:                                                        │
│  ────────────────                                                        │
│  These pairs enable DPO (Direct Preference Optimization) or similar    │
│  ranking objectives that teach the model to prefer successful tactics. │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VII: Scaling Inference

Multi-Agent Ray Architecture

BFS-Prover uses Ray for distributed computation:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              MULTI-AGENT RAY ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                     RAY CLUSTER                                  │    │
│  │                                                                  │    │
│  │  ┌────────────────────────────────────────────────────────┐     │    │
│  │  │           TACTIC GENERATORS (GPU Actors)                │     │    │
│  │  │                                                         │     │    │
│  │  │   ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐     │     │    │
│  │  │   │TacGen #1│ │TacGen #2│ │TacGen #3│ │TacGen #4│     │     │    │
│  │  │   │ vLLM    │ │ vLLM    │ │ vLLM    │ │ vLLM    │     │     │    │
│  │  │   │ (GPU)   │ │ (GPU)   │ │ (GPU)   │ │ (GPU)   │     │     │    │
│  │  │   └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘     │     │    │
│  │  │        └───────────┴─────┬─────┴───────────┘           │     │    │
│  │  │                          │                              │     │    │
│  │  │   Default: num_tac_gen = 8                             │     │    │
│  │  │   GPUs per generator: num_gpus_per_tac_gen = 1         │     │    │
│  │  │                                                         │     │    │
│  │  └────────────────────────────────────────────────────────┘     │    │
│  │                             │                                    │    │
│  │                             │ Tactic Requests (async)           │    │
│  │                             ▼                                    │    │
│  │  ┌────────────────────────────────────────────────────────┐     │    │
│  │  │              PROVER AGENTS (CPU Actors)                 │     │    │
│  │  │                                                         │     │    │
│  │  │  ┌───────┐ ┌───────┐ ┌───────┐       ┌───────┐        │     │    │
│  │  │  │Prover │ │Prover │ │Prover │  ...  │Prover │        │     │    │
│  │  │  │  #1   │ │  #2   │ │  #3   │       │ #96   │        │     │    │
│  │  │  │       │ │       │ │       │       │       │        │     │    │
│  │  │  │ Dojo  │ │ Dojo  │ │ Dojo  │       │ Dojo  │        │     │    │
│  │  │  │ BFS   │ │ BFS   │ │ BFS   │       │ BFS   │        │     │    │
│  │  │  └───────┘ └───────┘ └───────┘       └───────┘        │     │    │
│  │  │                                                         │     │    │
│  │  │  Default: num_provers = 96                             │     │    │
│  │  │  CPUs per prover: num_cpus_per_prover = 1              │     │    │
│  │  │  MAX_DOJO_RESTARTS = 3 (retries on hard timeout)       │     │    │
│  │  │                                                         │     │    │
│  │  └────────────────────────────────────────────────────────┘     │    │
│  │                             │                                    │    │
│  │                             │ Results                           │    │
│  │                             ▼                                    │    │
│  │  ┌────────────────────────────────────────────────────────┐     │    │
│  │  │              PROGRESS CACHE (Singleton Actor)           │     │    │
│  │  │                                                         │     │    │
│  │  │  • Named "progress_cache"                               │     │    │
│  │  │  • Stores solved subgoals                              │     │    │
│  │  │  • Enables resume from checkpoint                       │     │    │
│  │  │  • Thread-safe via Ray                                  │     │    │
│  │  │                                                         │     │    │
│  │  └────────────────────────────────────────────────────────┘     │    │
│  │                                                                  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  PARALLEL PROVE STRATEGY:                                                │
│  ────────────────────────                                                │
│  • Interleave task submissions across theorems for cache locality       │
│  • Early stopping: cancel remaining futures when proof validated        │
│  • Best result selection: prefer validated > proved > other             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Configuration Parameters

Complete configuration for proof search:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              COMPLETE CONFIGURATION PARAMETERS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MODEL SETTINGS:                                                         │
│  ───────────────                                                         │
│  max_num_batched_tokens    8192     Max tokens per batch                │
│  max_num_seqs              256      Max concurrent sequences            │
│  max_model_len             16384    Context window size                 │
│                                                                          │
│  SAMPLING SETTINGS:                                                      │
│  ─────────────────                                                       │
│  n_sampling_search         16       Tactics per state                   │
│  sampling_temperature      0.7      Generation randomness               │
│  sampling_top_p            1.0      Nucleus sampling threshold          │
│  max_tokens                2048     Max tactic length                   │
│                                                                          │
│  TIMEOUT SETTINGS:                                                       │
│  ────────────────                                                        │
│  timeout_per_theorem       600      Seconds per theorem                 │
│  vllm_timeout              60       Seconds for model generation        │
│  tactic_timeout            10       CPU seconds per tactic              │
│  wall_timeout              15       Wall seconds per tactic (1.5x CPU)  │
│  MODEL_READINESS_TIMEOUT   600      Seconds to wait for GPU warmup      │
│                                                                          │
│  RESOURCE ALLOCATION:                                                    │
│  ────────────────────                                                    │
│  num_tac_gen               8        Tactic generator actors             │
│  num_provers               96       Parallel prover actors              │
│  num_gpus_per_tac_gen      1        GPUs per generator                  │
│  num_cpus_per_prover       1        CPUs per prover                     │
│                                                                          │
│  SEARCH TUNING:                                                          │
│  ─────────────                                                           │
│  depth_reward              0.0      Priority normalization exponent     │
│  pass_k                    1        Attempts per theorem                │
│  MAX_DOJO_RESTARTS         3        Retries on hard timeout             │
│                                                                          │
│  LOGGING:                                                                │
│  ────────                                                                │
│  LOG_RATE                  0.01     Fraction of events to log (1%)      │
│  LOG_RATE_FREQUENT_EVENT   0.001    For very frequent events (0.1%)     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Proof Validation

Even after a "successful" proof, BFS-Prover validates the result:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PROOF VALIDATION                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FUNCTION: validate_proof(theorem, trajectory, cls, focus_mode)         │
│  ──────────────────────────────────────────────────────────────          │
│                                                                          │
│  Parameters:                                                             │
│  • theorem: The theorem to prove                                        │
│  • trajectory: List[str] of tactics                                     │
│  • cls: Expected final type (ProofFinished or tuple of types)           │
│  • focus_mode: If True, wrap each tactic in "focus"                     │
│                                                                          │
│  PROCESS:                                                                │
│  ────────                                                                │
│                                                                          │
│  1. Create fresh Dojo instance (timeout=600s)                           │
│                                                                          │
│  2. Get initial state from theorem                                      │
│                                                                          │
│  3. For each tactic in trajectory:                                      │
│       If focus_mode:                                                    │
│         tactic = "focus " + tactic                                      │
│       state = dojo.run_tac(state, tactic)                               │
│                                                                          │
│       If error: return False                                            │
│                                                                          │
│  4. Check: isinstance(final_state, cls)                                 │
│                                                                          │
│  5. Return True only if type matches exactly                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  VALIDATION CONTEXTS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  • After successful search: Validate full trajectory                    │
│  • Plan progress update: Validate partial trajectory                    │
│  • SearchResult output: Records proof_validation_passed field           │
│                                                                          │
│  WHY VALIDATE:                                                           │
│  ─────────────                                                           │
│  • Catches race conditions in tree status updates                       │
│  • Verifies proof works in clean environment                            │
│  • Guards against Dojo state corruption                                 │
│  • Provides ground truth for metrics                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VIII: Advanced Implementation Details

Pass-k Strategy with Interleaved Submission

BFS-Prover supports multiple independent proof attempts per theorem (pass_k). The implementation uses sophisticated scheduling for maximum throughput:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              PASS-K INTERLEAVED SUBMISSION STRATEGY                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM: Naive sequential attempts waste time                          │
│  ─────────────────────────────────────────────                           │
│  Theorem 1: attempt 1, attempt 2, ..., attempt 512                      │
│  Theorem 2: attempt 1, attempt 2, ..., attempt 512                      │
│  → If theorem 1 solves on attempt 3, we waste 509 attempts              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION: Interleaved round-robin submission                           │
│  ────────────────────────────────────────────────                        │
│                                                                          │
│  Round 1: T1-a1, T2-a1, T3-a1, ..., Tn-a1                              │
│  Round 2: T1-a2, T2-a2, T3-a2, ..., Tn-a2                              │
│  ...                                                                     │
│  Round k: T1-ak, T2-ak, T3-ak, ..., Tn-ak                              │
│                                                                          │
│  EARLY STOPPING:                                                         │
│  ───────────────                                                         │
│  When theorem T1 proves AND validates:                                  │
│    1. Cancel all pending futures for T1                                 │
│    2. Continue other theorems                                           │
│    3. Free up prover slots for remaining work                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BEST RESULT SELECTION:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  When multiple attempts complete, select best by priority:              │
│                                                                          │
│  Priority 1: proof_validation_passed == True                            │
│              (Trajectory replayed successfully in fresh Dojo)           │
│                                                                          │
│  Priority 2: root.status == PROVED                                      │
│              (Search found proof, not yet validated)                    │
│                                                                          │
│  Priority 3: Any other result                                           │
│              (Failed attempts, for logging)                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  submission_queue = []                                                  │
│  for attempt in range(pass_k):                                          │
│      for theorem in theorems:                                           │
│          submission_queue.append((theorem, attempt))                    │
│                                                                          │
│  # Submit in interleaved order                                          │
│  futures = {}                                                           │
│  for (theorem, attempt) in submission_queue:                            │
│      if theorem not in solved:                                          │
│          future = prover.search_attempt.remote(theorem, attempt)        │
│          futures[future] = (theorem, attempt)                           │
│                                                                          │
│  # Process completions with early stopping                              │
│  while futures:                                                          │
│      done, _ = ray.wait(list(futures.keys()), num_returns=1)           │
│      result = ray.get(done[0])                                          │
│      theorem = futures.pop(done[0])                                     │
│      if result.proof_validation_passed:                                 │
│          solved.add(theorem)                                            │
│          cancel_pending_futures(theorem)                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

vLLM Engine Configuration

The TacticGenerator wraps vLLM with specific optimizations:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              VLLM ENGINE CONFIGURATION                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ENGINE INITIALIZATION:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  AsyncLLMEngine.from_engine_args(EngineArgs(                            │
│      model=model_path,                                                  │
│      max_num_batched_tokens=8192,      # Tokens per forward pass        │
│      max_num_seqs=256,                 # Concurrent sequences           │
│      max_model_len=16384,              # Context window limit           │
│      enable_chunked_prefill=True,      # Chunk large prompts            │
│      tensor_parallel_size=num_gpus,    # Split across GPUs              │
│      disable_custom_all_reduce=True,   # Standard NCCL communication    │
│      trust_remote_code=True,           # Allow custom model code        │
│  ))                                                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY PARAMETERS EXPLAINED:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  enable_chunked_prefill: True                                           │
│    • Splits long prompts into chunks for KV-cache computation           │
│    • Prevents OOM on very long tactic states                            │
│    • Allows interleaving prefill with decode                            │
│                                                                          │
│  tensor_parallel_size: num_gpus                                         │
│    • Automatically calculated from available GPUs                       │
│    • Splits model layers across GPUs                                    │
│    • Enables 32B model on 4x A100s                                      │
│                                                                          │
│  disable_custom_all_reduce: True                                        │
│    • Uses standard NCCL instead of custom kernels                       │
│    • More compatible across GPU types                                   │
│    • Slightly slower but more reliable                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  HEALTH CHECK MECHANISM:                                                 │
│  ───────────────────────                                                 │
│                                                                          │
│  Before search begins, all generators must pass health check:           │
│                                                                          │
│  ready_count = 0                                                        │
│  while ready_count < num_tac_gen:                                       │
│      for gen in generators:                                             │
│          if gen.is_ready.remote():  # Calls engine.check_health()      │
│              ready_count += 1                                           │
│      if elapsed > MODEL_READINESS_TIMEOUT:                              │
│          raise TimeoutError("Generators not ready")                     │
│      sleep(10)                                                          │
│                                                                          │
│  This ensures vLLM has loaded weights and compiled kernels              │
│  before any proof search begins.                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The extend_tree() Recovery Method

When resuming from cached progress, BFS-Prover must replay proven tactics:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              EXTEND_TREE() FOR TRAJECTORY REPLAY                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PURPOSE: Replay cached proof trajectories to restore search state      │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SCENARIO:                                                               │
│  ─────────                                                               │
│                                                                          │
│  Prover crashed after solving subgoals 1-3 of a 5-step plan.           │
│  ProgressCache has the proof trajectory for steps 1-3.                  │
│  On restart, we need to rebuild the tree to that state.                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  METHOD: extend_tree(src_node, tactics)                                 │
│  ──────────────────────────────────────                                  │
│                                                                          │
│  Input:                                                                  │
│    src_node: Starting node (usually root)                               │
│    tactics: List[str] of tactics to replay                              │
│                                                                          │
│  Process:                                                                │
│    current = src_node                                                   │
│    for tactic in tactics:                                               │
│      response, time = run_tactic(current, tactic)                       │
│                                                                          │
│      if response is Error:                                              │
│        return None  # Replay failed, cached trajectory invalid          │
│                                                                          │
│      # Create edge with placeholder logprob                             │
│      edge = Edge(                                                       │
│          tactic=tactic,                                                 │
│          src=current,                                                   │
│          dst=new_node_from_response,                                    │
│          time=time,                                                     │
│          logprob=1.0  # Placeholder for cached tactics                  │
│      )                                                                   │
│                                                                          │
│      current.out_edges = [edge]                                         │
│      current = edge.dst                                                 │
│                                                                          │
│    return current  # Final node after replay                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY BEHAVIORS:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  1. logprob = 1.0 for all cached tactics                                │
│     • Cached tactics aren't from model, so no real logprob              │
│     • Placeholder value doesn't affect search (already proven path)     │
│                                                                          │
│  2. Atomic failure                                                       │
│     • If ANY tactic fails, entire extend_tree() returns None           │
│     • No partial tree extension                                         │
│     • Caller must handle failure (usually re-prove from scratch)        │
│                                                                          │
│  3. No queue insertion                                                   │
│     • Replayed nodes not added to priority queue                        │
│     • We're rebuilding known-good path, not exploring                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Complete Planner Rules (Including Rule 19)

The planner prompt includes additional rules beyond the 16 mentioned earlier:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              COMPLETE PLANNER RULES (INITIAL + REPLANNING)               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INITIAL PLANNING RULES (Rules 1-16):                                   │
│  ────────────────────────────────────                                    │
│  [Previously covered: explicit types, no chained inequalities, etc.]    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DYNAMIC REPLANNING RULES (Rules 16-19):                                │
│  ───────────────────────────────────────                                 │
│                                                                          │
│  RULE 16: INSERT AFTER PROVEN STEPS                                     │
│  ──────────────────────────────────────                                  │
│  New `have` statements go immediately after proven subgoals.            │
│  Proven steps are IMMUTABLE:                                            │
│    • No editing proven step text                                        │
│    • No inserting steps BETWEEN proven steps                            │
│    • No extending proven step with additional conditions                │
│                                                                          │
│  ✓ Correct: [proven1, proven2, NEW_STEP, stuck_step, ...]              │
│  ✗ Wrong:   [proven1, NEW_STEP, proven2, stuck_step, ...]              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 17: PROVIDE COMPLETE PLAN                                         │
│  ──────────────────────────────────                                      │
│  Output the ENTIRE updated plan, not just new steps.                    │
│  Preserves proven subgoals exactly as given.                            │
│                                                                          │
│  Input:                                                                  │
│    Proven: [h1: A, h2: B]                                               │
│    Stuck: h3: C                                                         │
│                                                                          │
│  Output (complete):                                                      │
│    have h1 : A        ← copied exactly                                  │
│    have h2 : B        ← copied exactly                                  │
│    have h2a : B'      ← new intermediate                                │
│    have h3 : C        ← original stuck step                             │
│    have h4 : D        ← remaining steps                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE 18: ENSURE LOGICAL CONTINUITY                                     │
│  ─────────────────────────────────────                                   │
│  Each new step must be logically sound.                                 │
│                                                                          │
│  Specific requirements:                                                  │
│    • Don't immediately repeat the stuck subgoal after proven steps     │
│    • Insert meaningful intermediate steps                               │
│    • New steps must logically lead to stuck step                        │
│    • Consider what made the stuck step hard                             │
│                                                                          │
│  ✗ Wrong: [proven1, proven2, stuck_step]  ← no new help                │
│  ✓ Right: [proven1, proven2, helper1, helper2, stuck_step]             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ★★★ RULE 19: FINAL CHECK (CRITICAL) ★★★                                │
│  ─────────────────────────────────────────                               │
│  Before outputting, perform SCRUPULOUS REVIEW:                          │
│                                                                          │
│  Checklist:                                                              │
│    □ All Set/Finset literals have explicit type annotations            │
│    □ No chained inequalities (a ≤ x ≤ b)                               │
│    □ Numeric operations have type annotations                           │
│    □ Using Real.logb (not Real.log) for base logarithms                │
│    □ Using Real.pi (not π)                                             │
│    □ All variables are declared in theorem statement                   │
│    □ Lemma names match Mathlib exactly                                  │
│    □ No implicit multiplication (ax vs a * x)                          │
│    □ Proven steps copied EXACTLY (no modifications)                    │
│                                                                          │
│  This final review catches ~15% of errors missed in initial writing.   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Complete Buggy Tactic Filtering

The tactic generator filters specific patterns known to cause issues:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              COMPLETE BUGGY TACTIC FILTERING                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ALWAYS FILTERED (unconditional):                                        │
│  ─────────────────────────────────                                       │
│                                                                          │
│  "sorry"         Proof placeholder, not a real tactic                   │
│  "admit"         Same as sorry, proof placeholder                       │
│  "native_decide" Can cause infinite loops or segfaults                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CONDITIONALLY FILTERED (pattern-based):                                 │
│  ────────────────────────────────────────                                │
│                                                                          │
│  PATTERN 1: Incomplete case splits                                      │
│  ─────────────────────────────────                                       │
│  Triggers: tactic contains ("rcases" OR "cases'") AND "?_"              │
│                                                                          │
│  Why filtered:                                                           │
│    "rcases h with ⟨?_, ?_⟩" indicates model didn't know                │
│    what to name the cases. The ?_ placeholders will fail.              │
│                                                                          │
│  Examples:                                                               │
│    ✗ "rcases h with ⟨?_, h2⟩"                                          │
│    ✗ "cases' h with ?_ h2"                                             │
│    ✓ "rcases h with ⟨h1, h2⟩"      ← concrete names OK                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PATTERN 2: Malformed simpa                                              │
│  ──────────────────────────────                                          │
│  Triggers: tactic contains "simpa" AND any of:                          │
│    • " _"  (space underscore)                                           │
│    • "_ "  (underscore space)                                           │
│    • "_,"  (underscore comma)                                           │
│    • ",_"  (comma underscore)                                           │
│                                                                          │
│  Why filtered:                                                           │
│    These patterns indicate malformed simpa arguments.                   │
│    The underscore placeholders suggest incomplete generation.           │
│                                                                          │
│  Examples:                                                               │
│    ✗ "simpa using _, h"                                                 │
│    ✗ "simpa [h1, _]"                                                    │
│    ✓ "simpa [h1, h2] using h3"     ← concrete args OK                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def filter_tactic(tactic: str) -> bool:                                │
│      # Always reject                                                    │
│      if any(x in tactic for x in ["sorry", "admit", "native_decide"]): │
│          return False                                                   │
│                                                                          │
│      # Conditional: incomplete case splits                              │
│      if ("rcases" in tactic or "cases'" in tactic) and "?_" in tactic: │
│          return False                                                   │
│                                                                          │
│      # Conditional: malformed simpa                                     │
│      if "simpa" in tactic:                                              │
│          if any(p in tactic for p in [" _", "_ ", "_,", ",_"]):        │
│              return False                                               │
│                                                                          │
│      return True                                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FILTERING STATISTICS:                                                   │
│  ─────────────────────                                                   │
│  In typical runs:                                                        │
│    • sorry/admit: ~2% of generated tactics                              │
│    • native_decide: ~0.1%                                               │
│    • rcases/?_: ~0.5%                                                   │
│    • simpa/underscore: ~0.3%                                            │
│    • Total filtered: ~3% of generations                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Environment Variables

BFS-Prover behavior can be tuned via environment variables:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              ENVIRONMENT VARIABLES                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LOGGING CONTROL:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  LOG_RATE                                                                │
│    Default: 0.01 (1%)                                                   │
│    Purpose: Fraction of standard events to log                          │
│    Usage: export LOG_RATE=0.1  # Log 10% of events                      │
│    Events: Tactic executions, node expansions, state transitions        │
│                                                                          │
│  LOG_RATE_FREQUENT_EVENT                                                 │
│    Default: 0.001 (0.1%)                                                │
│    Purpose: Fraction of very frequent events to log                     │
│    Usage: export LOG_RATE_FREQUENT_EVENT=0.01                           │
│    Events: Individual tactic generations, queue operations              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TIMEOUT CONTROL:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  MODEL_READINESS_TIMEOUT                                                 │
│    Default: 600 seconds                                                 │
│    Purpose: Max wait for vLLM engine initialization                     │
│    Usage: export MODEL_READINESS_TIMEOUT=1200  # 20 minutes            │
│    When to increase: Large models, slow GPU initialization              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RAY CONFIGURATION (standard Ray env vars):                              │
│  ──────────────────────────────────────────                              │
│                                                                          │
│  RAY_ADDRESS                                                             │
│    Purpose: Connect to existing Ray cluster                             │
│    Usage: export RAY_ADDRESS="ray://head-node:10001"                    │
│                                                                          │
│  CUDA_VISIBLE_DEVICES                                                    │
│    Purpose: Restrict visible GPUs                                       │
│    Usage: export CUDA_VISIBLE_DEVICES="0,1,2,3"                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DEBUGGING:                                                              │
│  ──────────                                                              │
│                                                                          │
│  To enable verbose logging during debugging:                            │
│    export LOG_RATE=1.0                                                  │
│    export LOG_RATE_FREQUENT_EVENT=0.1                                   │
│                                                                          │
│  WARNING: Full logging generates ~100MB+ logs per theorem               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PlanIOActor and Planner Retry Logic

The planning pipeline uses specialized Ray actors for coordination:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              PLANIOACTOR: CENTRALIZED FILE I/O                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM: Concurrent plan writes cause corruption                       │
│  ─────────────────────────────────────────────────                       │
│                                                                          │
│  Multiple PlanGenerator actors work in parallel.                        │
│  If they all write to plan.json simultaneously:                         │
│    • File corruption                                                    │
│    • Lost updates                                                       │
│    • Partial writes                                                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION: Single PlanIOActor serializes all I/O                        │
│  ────────────────────────────────────────────────                        │
│                                                                          │
│  RAY ACTOR: PlanIOActor                                                 │
│  ──────────────────────                                                  │
│                                                                          │
│  Singleton actor that handles ALL file operations:                      │
│                                                                          │
│  class PlanIOActor:                                                     │
│      def __init__(self, plan_file_path):                                │
│          self._plan_file = plan_file_path                               │
│          self._plans = None  # Lazy loaded                              │
│                                                                          │
│      @property                                                          │
│      def plans(self):                                                   │
│          if self._plans is None:                                        │
│              self._plans = load_json(self._plan_file)                   │
│          return self._plans                                             │
│                                                                          │
│      def get_plan(self, theorem_id):                                    │
│          return self.plans.get(theorem_id)                              │
│                                                                          │
│      def save_plan(self, theorem_id, plan):                             │
│          self.plans[theorem_id] = plan                                  │
│          write_json(self._plan_file, self.plans)  # Atomic write        │
│                                                                          │
│      def has_plan(self, theorem_id):                                    │
│          return theorem_id in self.plans                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BENEFITS:                                                               │
│  ─────────                                                               │
│  • No concurrent write conflicts                                        │
│  • In-memory caching (fast reads after first load)                     │
│  • Single source of truth                                               │
│  • Atomic updates                                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│              PLANNER RETRY LOGIC                                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Plan generation can fail for various reasons.                          │
│  BFS-Prover implements automatic retries:                               │
│                                                                          │
│  RETRY LOOP:                                                             │
│  ───────────                                                             │
│                                                                          │
│  def generate_plan_with_retry(theorem, max_retries=3):                  │
│      for attempt in range(max_retries):                                 │
│          # Step 1: Generate plan via LLM                                │
│          plan = planner.generate(theorem.statement)                     │
│                                                                          │
│          if plan is None:                                               │
│              continue  # Generation failed, retry                       │
│                                                                          │
│          # Step 2: Verify plan syntax in Lean                           │
│          success, error = verifier.verify_plan(theorem, plan)           │
│                                                                          │
│          if success:                                                    │
│              return plan  # Valid plan!                                 │
│                                                                          │
│          # Verification failed, log error and retry                     │
│          log(f"Plan verification failed: {error}")                      │
│                                                                          │
│      return None  # All retries exhausted                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RETRY TRIGGERS:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  Retry on verification failure:                                         │
│    • Syntax error in have statement                                     │
│    • Type mismatch                                                      │
│    • Unknown identifier                                                 │
│    • Invalid Lean 4 syntax                                              │
│                                                                          │
│  DO NOT retry on generation failure:                                    │
│    • API timeout                                                        │
│    • Model refused to generate                                          │
│    • Empty response                                                     │
│                                                                          │
│  RATIONALE: Verification failures might succeed with different         │
│  LLM sampling. Generation failures indicate systemic issues.           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RETRY STATISTICS:                                                       │
│  ─────────────────                                                       │
│  In practice:                                                            │
│    • 75% of plans succeed on first attempt                              │
│    • 15% succeed on retry 2                                             │
│    • 7% succeed on retry 3                                              │
│    • 3% fail all retries                                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

SearchResult Complete Fields

The SearchResult dataclass captures comprehensive information:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              SEARCHRESULT DATACLASS (COMPLETE)                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  @dataclass                                                              │
│  class SearchResult:                                                     │
│      # ─────────────────────────────────────────────────────────────    │
│      # IDENTIFICATION                                                    │
│      # ─────────────────────────────────────────────────────────────    │
│      theorem: Theorem              # The theorem attempted              │
│      attempt_number: int           # Which attempt (0 to pass_k-1)      │
│                                                                          │
│      # ─────────────────────────────────────────────────────────────    │
│      # PROOF DATA (if successful)                                        │
│      # ─────────────────────────────────────────────────────────────    │
│      proof: Optional[List[Tuple[str, str]]]                             │
│          # List of (tactic_state, tactic) pairs                        │
│          # None if no proof found                                       │
│                                                                          │
│      proof_stats: Optional[List[Tuple[float, float]]]                   │
│          # List of (execution_time, logprob) per tactic                │
│          # Enables analysis of proof efficiency                        │
│                                                                          │
│      # ─────────────────────────────────────────────────────────────    │
│      # TRAINING DATA                                                     │
│      # ─────────────────────────────────────────────────────────────    │
│      preference_pairs: List[Tuple[str, str, str]]                       │
│          # List of (state, winning_tactic, losing_tactic)              │
│          # Extracted from successful proof vs failed siblings          │
│          # Used for DPO/preference learning                            │
│                                                                          │
│      # ─────────────────────────────────────────────────────────────    │
│      # SEARCH METRICS                                                    │
│      # ─────────────────────────────────────────────────────────────    │
│      explored_nodes: int           # Total nodes explored               │
│      total_time: float            # Wall clock time in seconds         │
│      vllm_time: float             # Time waiting for model             │
│      lean_time: float             # Time executing tactics             │
│                                                                          │
│      # ─────────────────────────────────────────────────────────────    │
│      # STATUS FLAGS                                                      │
│      # ─────────────────────────────────────────────────────────────    │
│      root_status: Status          # OPEN, PROVED, or FAILED            │
│                                                                          │
│      proof_validation_passed: Optional[bool]                            │
│          # None: Validation not attempted                              │
│          # True: Trajectory replayed successfully                      │
│          # False: Replay failed (possible bug or race condition)       │
│                                                                          │
│      # ─────────────────────────────────────────────────────────────    │
│      # ERROR TRACKING                                                    │
│      # ─────────────────────────────────────────────────────────────    │
│      crashed: bool                # True if search crashed             │
│      crash_reason: Optional[str]  # Exception message if crashed       │
│      dojo_restarts: int           # Number of Dojo restart attempts    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  USAGE IN AGGREGATION:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  After all attempts complete:                                           │
│                                                                          │
│  best_result = None                                                     │
│  for result in results:                                                 │
│      if result.proof_validation_passed == True:                         │
│          best_result = result                                           │
│          break                                                          │
│      if result.root_status == PROVED and best_result is None:          │
│          best_result = result                                           │
│                                                                          │
│  # Aggregate preference pairs across all attempts                       │
│  all_pairs = []                                                         │
│  for result in results:                                                 │
│      all_pairs.extend(result.preference_pairs)                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Process Crash Recovery

BFS-Prover handles various crash scenarios gracefully:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              PROCESS CRASH RECOVERY                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CRASH TYPE 1: Lean Process Already Gone                                │
│  ───────────────────────────────────────                                 │
│                                                                          │
│  Scenario: Lean crashed before we tried to execute a tactic             │
│                                                                          │
│  Detection:                                                              │
│    try:                                                                  │
│        lean_proc = psutil.Process(dojo.proc.pid)                        │
│    except psutil.NoSuchProcess:                                         │
│        # Process already gone!                                          │
│                                                                          │
│  Recovery:                                                               │
│    # Attempt tactic anyway - Dojo might auto-restart                    │
│    return await asyncio.to_thread(dojo.run_tac, state, tac)            │
│                                                                          │
│  Rationale: LeanDojo sometimes restarts Lean internally.               │
│  Attempting the tactic lets Dojo handle recovery.                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CRASH TYPE 2: Hard Timeout (Unresponsive Lean)                         │
│  ──────────────────────────────────────────────                          │
│                                                                          │
│  Scenario: Lean hung, wall-clock timeout triggered                      │
│                                                                          │
│  Detection: LeanCPUTimedOut exception with killed_hard=True             │
│                                                                          │
│  Recovery:                                                               │
│    1. Kill entire process tree:                                         │
│       _kill_tree(process)  # Recursive SIGKILL                          │
│                                                                          │
│    2. Clean up resources:                                               │
│       gc.collect()  # Force garbage collection                          │
│                                                                          │
│    3. Raise DojoRestartRequired                                         │
│                                                                          │
│    4. Caller catches and restarts:                                      │
│       for restart_attempt in range(MAX_DOJO_RESTARTS):  # 3 attempts   │
│           try:                                                          │
│               dojo = create_fresh_dojo(theorem)                         │
│               return await _search(dojo, ...)                           │
│           except DojoRestartRequired:                                   │
│               continue                                                  │
│       raise RuntimeError("Max restarts exceeded")                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CRASH TYPE 3: Child Process Died Mid-Monitoring                        │
│  ───────────────────────────────────────────────                         │
│                                                                          │
│  Scenario: Child process of Lean died while monitoring CPU              │
│                                                                          │
│  Detection: psutil.NoSuchProcess during _cpu_tree() traversal          │
│                                                                          │
│  Recovery:                                                               │
│    def _cpu_tree(proc):                                                 │
│        total = 0.0                                                      │
│        try:                                                             │
│            times = proc.cpu_times()                                     │
│            total += times.user + times.system                           │
│        except psutil.NoSuchProcess:                                     │
│            pass  # Process gone, skip                                   │
│                                                                          │
│        for child in proc.children(recursive=True):                      │
│            try:                                                         │
│                child_times = child.cpu_times()                          │
│                total += child_times.user + child_times.system           │
│            except psutil.NoSuchProcess:                                 │
│                pass  # Child gone, skip                                 │
│                                                                          │
│        return total                                                     │
│                                                                          │
│  Rationale: Child processes can die at any time. Graceful handling     │
│  prevents false timeouts and crashes.                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CRASH TYPE 4: Double Future Completion                                  │
│  ──────────────────────────────────────                                  │
│                                                                          │
│  Scenario: asyncio Future set twice (race condition)                    │
│                                                                          │
│  Detection: InvalidStateError when setting result                       │
│                                                                          │
│  Recovery:                                                               │
│    def _safe_set_result(future, result):                                │
│        try:                                                             │
│            future.set_result(result)                                    │
│        except asyncio.InvalidStateError:                                │
│            pass  # Already set, ignore                                  │
│                                                                          │
│    def _safe_set_exception(future, exc):                                │
│        try:                                                             │
│            future.set_exception(exc)                                    │
│        except asyncio.InvalidStateError:                                │
│            pass  # Already set, ignore                                  │
│                                                                          │
│  Rationale: Multiple code paths might try to complete the same         │
│  future. Safe setters prevent crashes.                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

TheoremType Enum and Data Loading

BFS-Prover supports different theorem splits:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              THEOREMTYPE ENUM AND DATA LOADING                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THEOREM TYPES:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  class TheoremType(Enum):                                               │
│      GENERIC = "generic"    # Load all theorems regardless of split    │
│      VALID = "valid"        # Load only validation split               │
│      TEST = "test"          # Load only test split                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  USAGE IN CONFIGURATION:                                                 │
│  ───────────────────────                                                 │
│                                                                          │
│  @dataclass                                                              │
│  class ProofSearchConfig:                                               │
│      theorem_type: str = "generic"  # or "valid" or "test"             │
│      lines_cap: int = 0             # 0 = no limit, >0 = limit for debug│
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DATA LOADING LOGIC:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  def load_theorems(statements_file, dojo_file, theorem_type, cap):      │
│      theorems = []                                                      │
│                                                                          │
│      # Load statement data                                              │
│      with open(statements_file) as f:                                   │
│          for line in f:                                                 │
│              data = json.loads(line)                                    │
│                                                                          │
│              # Filter by type                                           │
│              if theorem_type != "generic":                              │
│                  if data["split"] != theorem_type:                      │
│                      continue                                           │
│                                                                          │
│              theorems.append(data)                                      │
│                                                                          │
│      # Apply cap for debugging                                          │
│      if cap > 0:                                                        │
│          theorems = theorems[:cap]                                      │
│                                                                          │
│      # Match with dojo data                                             │
│      dojo_data = load_dojo_data(dojo_file)                              │
│      return [                                                           │
│          Theorem(                                                       │
│              statement=t["statement"],                                  │
│              full_name=t["id"],                                         │
│              **dojo_data[t["id"]]                                       │
│          )                                                              │
│          for t in theorems                                              │
│      ]                                                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TYPICAL SPLITS:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  miniF2F dataset:                                                        │
│    • valid: 244 theorems (development set)                              │
│    • test: 244 theorems (evaluation set)                                │
│                                                                          │
│  Usage:                                                                  │
│    # Develop/tune on validation set                                     │
│    python main.py --theorem_type valid                                  │
│                                                                          │
│    # Final evaluation on test set                                       │
│    python main.py --theorem_type test                                   │
│                                                                          │
│    # Debug on 10 theorems                                               │
│    python main.py --theorem_type generic --lines_cap 10                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

CPU Time Accounting Details

The timeout system uses sophisticated CPU tracking:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              CPU TIME ACCOUNTING (_cpu_tree)                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CHALLENGE: Accurate CPU time for process trees                         │
│  ──────────────────────────────────────────────                          │
│                                                                          │
│  Lean spawns child processes. We need TOTAL CPU time across:            │
│    • Main Lean process                                                  │
│    • All child processes (compiler, elaborator, etc.)                   │
│    • Grandchildren (if any)                                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SNAPSHOT-BASED MONITORING:                                              │
│  ─────────────────────────────                                           │
│                                                                          │
│  Instead of cumulative tracking (which misses retired children),        │
│  we use snapshot-based monitoring:                                      │
│                                                                          │
│  def _cpu_tree(proc: psutil.Process) -> float:                          │
│      """Get CURRENT total CPU time for process tree."""                 │
│      total = 0.0                                                        │
│                                                                          │
│      # Main process CPU time                                            │
│      try:                                                               │
│          times = proc.cpu_times()                                       │
│          total += times.user + times.system                             │
│      except psutil.NoSuchProcess:                                       │
│          return total  # Process gone                                   │
│                                                                          │
│      # Recursively add children                                         │
│      try:                                                               │
│          for child in proc.children(recursive=True):                    │
│              try:                                                       │
│                  child_times = child.cpu_times()                        │
│                  total += child_times.user + child_times.system         │
│              except psutil.NoSuchProcess:                               │
│                  pass  # Child died between listing and query           │
│      except psutil.NoSuchProcess:                                       │
│          pass  # Parent died while getting children                     │
│                                                                          │
│      return total                                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RETIRED PROCESS ACCOUNTING:                                             │
│  ───────────────────────────                                             │
│                                                                          │
│  Problem: If a child process dies, its CPU time is lost.               │
│                                                                          │
│  Solution: Track "baseline" CPU time of retired processes:              │
│                                                                          │
│  class CPUMonitor:                                                      │
│      def __init__(self):                                                │
│          self.baseline = 0.0  # CPU from dead processes                │
│          self.known_pids = set()                                        │
│                                                                          │
│      def get_total_cpu(self, proc):                                     │
│          current = _cpu_tree(proc)                                      │
│          current_pids = {p.pid for p in proc.children(recursive=True)} │
│                                                                          │
│          # Check for retired processes                                  │
│          retired = self.known_pids - current_pids                       │
│          if retired:                                                    │
│              # Their CPU is already in baseline from previous snapshot │
│              pass                                                       │
│                                                                          │
│          # Update known PIDs                                            │
│          self.known_pids = current_pids                                 │
│                                                                          │
│          return self.baseline + current                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MONITORING LOOP:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  def monitor_with_timeout(proc, cpu_limit, wall_limit, poll=0.5):       │
│      start_wall = time.monotonic()                                      │
│                                                                          │
│      while True:                                                        │
│          cpu_time = _cpu_tree(proc)                                     │
│          wall_time = time.monotonic() - start_wall                      │
│                                                                          │
│          if cpu_time > cpu_limit:                                       │
│              raise LeanCPUTimedOut(                                     │
│                  limit_type="cpu",                                      │
│                  cpu_elapsed=cpu_time,                                  │
│                  wall_elapsed=wall_time,                                │
│                  killed_hard=False                                      │
│              )                                                          │
│                                                                          │
│          if wall_time > wall_limit:                                     │
│              _kill_tree(proc)  # Force kill                             │
│              raise LeanCPUTimedOut(                                     │
│                  limit_type="wall",                                     │
│                  cpu_elapsed=cpu_time,                                  │
│                  wall_elapsed=wall_time,                                │
│                  killed_hard=True  # Triggers Dojo restart              │
│              )                                                          │
│                                                                          │
│          time.sleep(poll)                                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Production Configuration vs Defaults

The shell scripts reveal production tuning that differs from defaults:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              PRODUCTION CONFIGURATION VS DEFAULTS                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Parameter               │ Default  │ Production │ Reason               │
│  ────────────────────────┼──────────┼────────────┼──────────────────────│
│  pass_k                  │ 1        │ 512        │ More attempts        │
│  sampling_temperature    │ 0.7      │ 1.3        │ More diversity       │
│  depth_reward            │ 0.0      │ 2.0        │ Explore deeper       │
│  tactic_timeout          │ 10       │ 12         │ Harder tactics       │
│  timeout_per_theorem     │ 600      │ 600        │ Same                 │
│  n_sampling_search       │ 16       │ 4-16       │ Varies by stage      │
│  num_provers             │ 96       │ 128        │ More parallelism     │
│  num_tac_gen             │ 8        │ 7          │ Memory constraints   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY HIGHER TEMPERATURE (1.3 vs 0.7):                                   │
│  ────────────────────────────────────                                    │
│                                                                          │
│  With pass_k=512 attempts, diversity matters more than consistency.     │
│  Higher temperature:                                                     │
│    • More varied tactic suggestions                                     │
│    • Different proof strategies explored                                │
│    • Less likely to repeat same mistakes                                │
│                                                                          │
│  With pass_k=1, you want temperature=0.7 for reliable single attempts.  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY HIGHER DEPTH_REWARD (2.0 vs 0.0):                                  │
│  ─────────────────────────────────────                                   │
│                                                                          │
│  depth_reward=0.0: Priority = cumulative_logprob                        │
│    → Strongly prefers short paths with high confidence                  │
│    → May miss longer but correct proofs                                 │
│                                                                          │
│  depth_reward=2.0: Priority = cumulative_logprob / depth²               │
│    → Normalizes aggressively by depth                                   │
│    → Willing to explore deeper paths                                    │
│    → Better for complex theorems requiring 10+ steps                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PRODUCTION SHELL SCRIPT (run_local_search.sh):                         │
│  ──────────────────────────────────────────────                          │
│                                                                          │
│  python -m src.search.main \                                            │
│      --model_path "path/to/BFS-Prover-V2-32B" \                         │
│      --pass_k 512 \                                                     │
│      --sampling_temperature 1.3 \                                       │
│      --depth_reward 2.0 \                                               │
│      --tactic_timeout 12 \                                              │
│      --timeout_per_theorem 600 \                                        │
│      --num_provers 128 \                                                │
│      --num_tac_gen 7 \                                                  │
│      --theorem_type test \                                              │
│      --statement_file src/data/minif2f_statements.jsonl \               │
│      --dojo_file src/data/minif2f_dojo.jsonl \                          │
│      --plan_file src/data/minif2f_plan.json                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PLANNING SCRIPT DIFFERENCES (run_local_plan.sh):                       │
│  ────────────────────────────────────────────────                        │
│                                                                          │
│  Planning uses different settings:                                      │
│    • num_generators=7 (PlanGenerator actors)                            │
│    • num_planners=7 (parallel theorem processing)                       │
│    • max_retries=3 (verification retry attempts)                        │
│    • No pass_k (single plan per theorem)                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part IX: Benchmark Results and Analysis

Performance on miniF2F

miniF2F is the standard benchmark for neural theorem provers—a collection of 488 formalized mathematics problems from various competitions (AMC, AIME, IMO).

ModelminiF2F-testminiF2F-valid
GPT-f (2020)29.3%-
PACT (2022)33.6%-
HTPS (2022)41.0%-
LeanDojo (2023)48.4%-
DeepSeek-Prover-V1.5 (2024)63.5%60.2%
BFS-Prover-V2-7B82.4%-
BFS-Prover-V2-32B86.1%85.5%
BFS-Prover-V2-32B + Planner95.08%95.5%

Performance on ProofNet

ProofNet contains harder, research-level mathematics:

ModelProofNet-test
BFS-Prover-V2-32B41.4%

Analysis: Component Contributions

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              COMPONENT CONTRIBUTION ANALYSIS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  BASE MODEL (7B)                                                        │
│  ───────────────                                                         │
│  Performance: 82.4% on miniF2F-test                                     │
│                                                                          │
│  + SCALE TO 32B                                                         │
│  ────────────────                                                        │
│  Performance: 86.1% (+3.7%)                                             │
│  Insight: Larger models improve tactic generation quality               │
│                                                                          │
│  + PLANNING                                                              │
│  ───────────                                                             │
│  Performance: 95.08% (+8.98%)                                           │
│  Insight: Hierarchical decomposition is the biggest single gain!        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHERE PLANNING HELPS MOST:                                              │
│  ─────────────────────────────                                           │
│  ✓ Long proofs (10+ steps) - decomposition reduces effective depth     │
│  ✓ Non-obvious lemmas - planner can suggest key intermediate steps     │
│  ✓ Complex algebra - multiple equivalent rewriting paths               │
│  ✓ Inequality chains - direction of manipulation matters               │
│                                                                          │
│  WHERE PLANNING HELPS LESS:                                              │
│  ──────────────────────────                                              │
│  • Short proofs (2-3 steps) - no decomposition needed                  │
│  • Direct application - single powerful tactic suffices                │
│  • Structural proofs - induction/cases with obvious structure          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles