INQUIRING LINE

Why do smaller models favor code formats while larger models prefer natural language?

This explores whether the corpus explains why small models lean on structured/code-like outputs while large models do better with free-form natural language — and the honest answer is that no note tackles this head-on, but several circle the same territory: small models' relationship to rigid format.


This reads the question as asking about a capacity difference — why structure helps the small and constrains the large — and the corpus doesn't have a paper aimed squarely at that comparison. What it does have is a cluster of findings about *why small models cling to format in the first place*, which is the more interesting half of the story.

The strongest thread is that small models struggle most precisely where rigid output structure is required, and that the fix is teaching them the shape, not the meaning. Small models fine-tuned with DPO on correct-vs-incorrect function-calling examples beat plain supervised fine-tuning because the negative examples directly target *format* failures — the model learns the rigid schema it otherwise fumbles Can small models match large models on function calling?. Read alongside the finding that small models are genuinely sufficient for the repetitive, well-defined slices of agent work Can small language models handle most agent tasks?, a picture emerges: code-like formats are a *scaffold*. A constrained output space (call this function, fill these fields) does the structuring work the small model can't generate on its own.

Why might that scaffold matter less — or even get in the way — at larger scale? Two notes hint at the mechanism. MobileLLM shows tiny models gain accuracy from being deep-and-thin, composing abstract concepts up through layers rather than spreading capacity across width Does depth matter more than width for tiny language models? — abstraction lives in depth, and small models have little of it to spare. And the logit-lens work shows models compute their actual answer in early layers, then spend the final layers *suppressing* that representation to emit format-compliant tokens Do transformers hide reasoning before producing filler tokens?. Format compliance, in other words, is a tax paid in the late layers — a tax a large, abstraction-rich model can afford to skip in favor of open-ended language, but one a small model is happy to pay because the structure substitutes for reasoning it doesn't have.

There's a cautionary undercurrent worth knowing about. When models *appear* to reason inside a rigid format, they're often exploiting the format rather than thinking — defaulting to conservative or template-matched answers that look structured but aren't Are models actually reasoning about constraints or just defaulting conservatively?, or emitting plausible-looking values for problems they recognize by template without actually solving them Do large language models actually perform iterative optimization?. So a small model's preference for code formats may be less a *strength* than a tell: structure is where pattern-matching can masquerade as competence, which is exactly where a model short on real abstraction would gravitate.

If you want the direct empirical comparison — small-favors-code vs. large-favors-language, measured — this collection doesn't have it. But it gives you the better question underneath: format isn't a stylistic preference, it's a proxy for how much abstraction a model can hold, and the smaller the model, the more the structure is doing the thinking for it.


Sources 6 notes

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-examining a claim about code-format preference in small models vs. natural-language preference in large models. The question remains open: *why* does model scale shift output modality preference?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as potentially superseded:
• Small models fine-tuned with DPO on function-calling examples beat supervised fine-tuning because negative examples target *format* failures directly, not semantic ones (2024-10, arXiv:2410.18890).
• Small models gain accuracy from depth over width, composing abstraction through layers rather than spreading capacity; tiny models have little depth to spare (2024-02, arXiv:2402.14905).
• Models compute answers in early layers, then suppress those representations in final layers to emit format-compliant tokens—a tax large models can skip, small models are forced to pay (2024-12, arXiv:2412.04537).
• Small models often exploit format rather than reason within it: defaulting to conservative, template-matched answers that *appear* structured without actual problem-solving (2026-03, arXiv:2603.29025).
• Format preference may signal abstraction ceiling, not reasoning strength (2023–2026 cluster).

Anchor papers (verify; mind their dates):
• arXiv:2410.18890 (2024-10): Small-model function calling via DPO.
• arXiv:2402.14905 (2024-02): Depth-vs.-width scaling for sub-billion models.
• arXiv:2412.04537 (2024-12): Hidden reasoning computations and layer-wise token suppression.
• arXiv:2603.29025 (2026-03): Surface heuristics overriding implicit constraints.

Your task:
(1) RE-TEST EACH CONSTRAINT. Has scaling—new model families, synthetic training data, in-context learning, or tool-use harnesses—since *relaxed* the small model's dependence on rigid format? Have larger models' reluctance to emit code been overturned by prompt engineering, system design (e.g., tool-use APIs), or architectural shifts? Separate the durable question (why abstraction and output modality correlate) from perishable limitations (current small/large model behavior). Cite what resolved constraints, and flag where format-dependence still holds.
(2) Surface the strongest *contradicting or superseding* work from the last ~6 months. Does anything show small models reasoning *without* format scaffolding, or large models thriving *inside* strict schemas?
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Does in-context exemplar diversity flip the format preference?" or "Can LoRA+format-fine-tuning push small models past their abstraction ceiling?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines