INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›What determines success in trainin…›this inquiring line

Can an AI solve task combinations it was never trained on, or does it just look that way?

How much of the combinatorial task space must training data cover?

This explores how much of the space of possible task-combinations training data has to span before a model can handle the rest — and the corpus suggests the honest answer is 'more than you'd hope,' with some clever escape hatches.

This question is really asking whether models generalize to task-combinations they never saw, or whether they only replay regions of the space they were trained on — and the most sobering result in the collection says the latter. The DataAlchemy experiments Does chain-of-thought reasoning actually generalize beyond training data? show chain-of-thought reasoning degrades *predictably* the moment you shift task, length, or format away from the training distribution. The model keeps producing fluent reasoning, but the logic underneath stops being valid. So in the pessimistic reading, coverage isn't a nice-to-have: capability is bounded to the slice of combinatorial space the data actually touched, and what looks like generalization is interpolation inside that slice.

There's a quieter, stranger finding that reframes what 'coverage' even means. Instruction tuning experiments Does instruction tuning teach task understanding or output format? show models trained on *semantically empty or deliberately wrong* instructions perform about as well as those trained on correct ones (43% vs. 42.6%). What transfers isn't understanding of the tasks — it's familiarity with the shape of the output space. If that's true, then a lot of what we think we're covering (task semantics) is irrelevant, and the thing data actually needs to span is the distribution of *answer formats*, which is a much smaller space.

The most practical escape from brute-force coverage is decomposition. Granite's function-calling work Can breaking function calling into subtasks improve model generalization? found that breaking the job into seven atomic subtasks — nested calls, chaining, parallel functions, parameter detection, and so on — and training each explicitly generalizes *better* than dumping one giant umbrella dataset on the model. This is the combinatorial trick: if the space factors into a handful of reusable primitives, you cover the primitives, not their exponential product. DPO training pushes the same idea from the other direction Can small models match large models on function calling? — feeding explicit *wrong* examples teaches the boundaries of a subtask cheaply, so small models match large ones without seeing every variant.

But here's the thing you might not have known you wanted to know: some failures have nothing to do with coverage at all. The 'embers of autoregression' work Can we predict where language models will fail? predicted *in advance* that tasks with low-probability target outputs — reciting the alphabet backwards, counting letters — would stay hard no matter how logically trivial they are, because the model is fundamentally a next-token probability machine. You could cover those tasks exhaustively in training and the autoregressive prior would still fight you. So the real answer isn't a single coverage percentage. It's that the space has structure: factorable regions where decomposition lets you cover a fraction and compose the rest, and probability-cursed regions where coverage barely helps.

If you want to chase the optimistic thread further, look at how training *order* over that space matters too — scheduling structured tasks before open-ended ones changes what survives Does training order reshape how models handle different task types? — which hints that *how* you walk through the combinatorial space may matter as much as how much of it you cover.

Sources 6 notes

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Show all 6 sources

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

A Survey on Post-training of Large Language Models1.70 match · arxiv ↗
Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks1.70 match · arxiv ↗
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey1.68 match · arxiv ↗
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens0.92 match · arxiv ↗
Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning0.89 match · arxiv ↗
When More is Less: Understanding Chain-of-Thought Length in LLMs0.89 match · arxiv ↗
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks0.88 match · arxiv ↗
Hierarchical Reasoning Model0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether LLM generalization to unseen task-combinations remains bounded by training-data coverage, or whether recent advances (newer models, RL scaling, test-time compute, hybrid reward scheduling) have relaxed the constraints a 2023–2025 library documented.

What a curated library found — and when (these are dated claims, not current truth):
• Chain-of-thought reasoning degrades predictably when task, length, or format shift from training distribution; fluency masks invalid logic (2025-08).
• Instruction tuning transfers output-format distribution (~43% performance gain on correct vs. wrong instructions); task semantics may be irrelevant (2023-05).
• Function-calling decomposes into ~7 atomic subtasks; explicit training on primitives generalizes better than monolithic datasets (2024-06, 2024-10).
• DPO on *wrong* examples teaches subtask boundaries cheaply; small models match large ones without exhaustive coverage (2024-10).
• Autoregressive priors curse low-probability outputs (alphabet backwards, counting); coverage barely helps for probability-cursed tasks (2025-08).
• Training order over task space matters: structured→open-ended scheduling preserves capability differently than reverse (2025-07).

Anchor papers (verify; mind their dates):
• arXiv:2508.01191 — Chain-of-thought mirage via data-distribution lens (2025-08).
• arXiv:2404.12253 — Self-improvement via imagination & search (2024-04).
• arXiv:2510.13786 — Scaling RL compute for LLMs (2025-10).
• arXiv:2507.14783 — Hybrid reward + task scheduling in multi-task RL (2025-07).

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether test-time compute scaling (reasoning models), RL post-training (reward-driven policy learning), task scheduling (Omni-Thinker, Titans memorization), or hybrid reward schemes since mid-2025 have RELAXED the coverage bottleneck or shifted what "coverage" means. Separate durable question (does composition still face exponential blow-up?) from perishable limitation (can RL + test-time search dissolve brute-force coverage requirements?). Cite what resolved it.
(2) Surface the strongest work from the last 3 months that CONTRADICTS the "coverage-is-destiny" reading—especially hybrid RL, scheduling, or test-time reasoning claims.
(3) Propose 2 research questions that assume the regime may have moved: e.g., "Does scaling test-time compute on a *single* high-value task-primitive now eliminate the need for broad coverage?" and "Can RL post-training reorder the task-combination graph to make sequential coverage more efficient than combinatorial coverage?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can an AI solve task combinations it was never trained on, or does it just look that way?

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8