Self-Improving Language Models with Bidirectional Evolutionary Search

Paper · arXiv 2605.28814 · Published May 27, 2026
Novel LLM Architectures

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable sub-goals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer.

Introduction. Large language models (LLMs) and agentic systems have demonstrated remarkable capabilities on complex reasoning problems [39, 28, 9]. They can even solve open problems across mathematical and scientific domains [26, 15] and surpass the best human performance on tasks such as code generation [21, 47]. In this context, the question of how to do better sampling from LLMs and agentic systems is of critical importance [8]. This is particularly significant for problems at the frontier of model capability, where naive sampling methods may require too many samples to obtain a correct answer or may simply fail [6, 46]. At training time, higher-quality samples enable more effective post-training and self-improvement [48, 45]; at inference time, they serve as a natural mechanism for test-time scaling [49, 40, 19], which can further push the boundary of what models can achieve. Currently, the two dominant sampling methods in post-training, self-improvement, and inference for LLMs and agentic systems are best-of-N sampling and tree search. Best-of-N sampling is simple and efficient.

Discussion / Conclusion. In this paper, we present BES, a bidirectional evolutionary search framework that addresses two fundamental limitations of existing search methods for LLMs and agents: sparse verification signals and confined candidate generation. BES couples forward search, which evolves candidates through expansion and four evolution operators, with backward search, which recursively decomposes problems into verifiable sub-goals to provide dense intermediate feedback. We provide theoretical justification showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolution operators can escape it, and that backward sub-goal decomposition exponentially reduces the number of candidates needed to find a solution.