Why does autoregressive generation fail at constraint satisfaction?

Explores whether the 20-23% performance ceiling on constraint satisfaction benchmarks reflects model limitations or a fundamental architectural mismatch between how LLMs generate tokens and how constraint solvers need to work.

Synthesis note · 2026-05-02 · sourced from Reasoning Methods CoT ToT

The 20-23% ceiling on LR²Bench is not a model-quality issue. It is the empirical price of an architectural mismatch between what CSPs require and what autoregressive transformers can do. A CSP solver maintains multiple partial assignments simultaneously, propagates constraints across them, and discards branches when violations occur. The discard operation is primitive to constraint solving — it is what makes the algorithm a constraint solver rather than a generator that happens to satisfy constraints sometimes.

Autoregressive LLMs have no native discard operator. Every emitted token enters the context window and conditions all subsequent token predictions. "Backtracking" in chain-of-thought is not backtracking in the algorithmic sense — it is forward-writing a new attempt while the failed attempt remains visible in context, biasing the next attempt toward the failed one. The model cannot delete tokens it has already produced; it can only generate over them. This is why Why can't language models reverse learned facts? is structurally unsurprising, and why Can large language models translate natural language to logic faithfully? runs into similar walls — the architecture's commitment direction is one-way.

For the Last Token framing, this is load-bearing. The stop token is the only true commitment in a generation; every interior token is a soft commitment that biases the trajectory without sealing it. But "soft" here does not mean "retractable" — it means "still influential while pretending not to be." When an LRM writes "Wait, let me reconsider," it has not retracted the prior tokens; it has appended a meta-comment about them, and now the model conditions on both the original wrong attempt and the meta-comment. The retraction is performed in language but not in computation.

This converges with Can symbolic solvers fix how LLMs reason about logic? from the opposite direction. Symbolic solvers have native retraction; LLMs do not. The hybrid case works because the symbolic component supplies what the architecture lacks. CSPs are the cleanest place to see the gap because constraint violation is a hard signal that cannot be glossed over with reflective language. The 20% ceiling is the architecture meeting the wall.

Inquiring lines that read this note 80

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why does autoregressive generation fail at constraint satisfaction?

Inquiring lines that read this note 80

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4