INQUIRING LINE

How do output format constraints compare to input exemplar brittleness?

This explores two ways prompt *surface* sabotages model substance — squeezing the output into a rigid format vs. depending on hand-picked input examples — and asks whether the corpus sees them as the same underlying problem or two different ones.


This explores two ways the shape of a prompt can quietly wreck what the model actually does: constraining the *output* (forcing JSON, a schema, a fixed template) versus depending on *input* examples (the few-shot demonstrations you paste in to steer it). The corpus treats both as symptoms of one deeper fact — for these models, form and content compete for the same limited budget, and neither side of the prompt is as stable as it looks.

On the output side, strict formatting measurably eats reasoning. When a schema is imposed, accuracy drops across multiple models, and loosening the format — keeping the type but dropping the rigid schema — recovers most of what was lost, which suggests compliance and reasoning are drawing from the same well Do strict output formats hurt LLM reasoning ability?. There's an even sharper version of this: models trained to hide their chain-of-thought actually compute the right answer in their early layers, then *overwrite* it in the final layers to emit format-compliant filler tokens. The reasoning is still recoverable underneath — the format requirement literally buries it Do transformers hide reasoning before producing filler tokens?.

On the input side, exemplars turn out to be brittle along four separate axes at once — reorder them and you get 3.3% swings, mismatch their complexity to the problem, give them no diversity, or just have a different person write them, and you see up to 28.2% variance. These compound, which is why hand-curating examples never transfers cleanly across tasks Why do chain-of-thought examples fail across different conditions?. The unsettling part is *why* this works at all: logically invalid reasoning examples perform nearly as well as valid ones, because the model is copying the *form* of reasoning, not the inference Does logical validity actually drive chain-of-thought gains?. So exemplars don't teach the model to think — they configure a surface pattern, and that pattern is fragile to cosmetic change.

Here's the lateral payoff: both failures are the same shape seen from two ends. Output constraints hurt because the model spends generation capacity on form instead of thought; input exemplars are brittle because they were only ever transmitting form in the first place. The reason neither is obvious from a benchmark is that surface success routinely masks broken substance — models can hit perfect accuracy while their internal representations are fractured and won't survive a perturbation or a distribution shift Can models be smart without organized internal structure?, and many models that look like they're reasoning about constraints are really just defaulting conservatively, scoring *worse* when the constraint is removed Are models actually reasoning about constraints or just defaulting conservatively?.

The thing you didn't know you wanted to know: format and exemplars aren't separate prompt-engineering knobs at all. Both are levers on the gap between what a model *displays* and what it *computes* — and the corpus suggests the more rigidly you control the display, on either the input or output end, the more you risk paying for it in the computation you actually wanted.


Sources 6 notes

Do strict output formats hurt LLM reasoning ability?

Schema-specific format requirements cause measurable reasoning decline across multiple models. Removing schema constraints while keeping loose format type recovers most lost performance, suggesting format compliance and reasoning compete for the model's generation capacity.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why do chain-of-thought examples fail across different conditions?

Human-written CoT exemplars degrade performance when reordered (3.3% swings), mismatched to problem complexity, lacking diversity, or written by different annotators (up to 28.2% variance). These four dimensions compound, making manual exemplar curation unreliable across tasks.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a prompt-engineering researcher re-testing claims about format brittleness in LLMs. The core question remains: Do output format constraints and input exemplar brittleness stem from the same underlying failure mode, or have newer models, training methods, or evaluation harnesses since decoupled them?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key claims:
- Strict output schemas degrade reasoning accuracy; loosening format recovers ~70–80% of losses (arXiv:2408.02442, 2024-08). Models compute correct answers in early layers, then overwrite them for format compliance (arXiv:2412.04537, 2025-03).
- Few-shot exemplars vary by 3.3–28.2% on reordering, complexity mismatch, and author drift; logically invalid examples perform nearly as well as valid ones (arXiv:2307.10573, 2023-07; arXiv:2302.12822, 2023-02).
- Identical performance metrics mask fractured internal representations that fail under distribution shift (arXiv:2505.11581, 2025-05). Surface heuristics override implicit constraints (arXiv:2603.29025, 2026-03).

Anchor papers (verify; mind their dates):
- arXiv:2408.02442 (2024-08): Format Restrictions and Performance
- arXiv:2307.10573 (2023-07): Invalid Logic, Equivalent Gains
- arXiv:2412.04537 (2025-03): Hidden Computations in Chain-of-Thought
- arXiv:2505.11581 (2025-05): Representational Optimism critique

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim (schema-reasoning tradeoff, exemplar brittleness, metric masking), judge whether post-training (e.g., arXiv:2504.07912, 2025-04), newer scaffolding (agentic memory, caching, diffusion-based generation arXiv:2502.09992, 2025-02), or tighter evaluation protocols have since relaxed or overturned it. Separate the durable question (likely: does form still compete with substance?) from perishable limitations (e.g., schema overhead). Cite what relaxed each constraint.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers claiming format constraints no longer matter, or exemplars are stable, or surface/hidden computation are decoupled.

(3) Propose 2 research questions that assume the regime may have moved: (a) Do RL-aligned models (post-training) re-couple format and reasoning differently than base models? (b) Can multi-agent or recursive architectures (arXiv:2512.24601, 2025-12) sandbox format compliance away from reasoning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines