INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›What determines success in trainin…›this inquiring line

Teaching an AI seven specific tool-use sub-skills separately beats training it on thousands of full tool-use examples.

Does training on granular tasks beat training on the full function calling problem?

This explores whether breaking function calling into explicit sub-skills and training each one beats training a model on whole tool-use examples — and what the corpus says about why decomposition might help.

This explores whether splitting function calling into named sub-skills and training on each separately beats throwing whole tool-use examples at the model. The corpus gives a fairly direct yes — with an interesting wrinkle about *why* it works. The clearest evidence comes from Granite-20B-FunctionCalling, which treats function calling not as one problem but as seven: nested calls, chaining, parallel functions, detecting which function to name, detecting parameters, picking the next-best function, and generating the response. Training explicitly across these granular tasks generalized better than umbrella datasets like ToolLLM, and closed the gap with GPT, Claude, and Gemini Can breaking function calling into subtasks improve model generalization?. The umbrella dataset gives you volume; the decomposed curriculum gives you coverage of the specific failure modes.

There's a reason decomposition might be the natural grain to train at. Pruning studies show neural networks already tend to implement compositional tasks as isolated subnetworks — ablate one and only its corresponding function breaks — and pretraining makes this modular structure more consistent Do neural networks naturally learn modular compositional structure?. If the model is internally building separable skill modules anyway, training on granular tasks is arguably training *with* that grain rather than against it. The same instinct shows up at inference time in systems that compose task-specific expert vectors on the fly rather than relying on one monolithic fine-tune Can models dynamically activate expert skills at inference time?, and in skill-library agents that build complex behaviors from stored simpler ones Can agents learn new skills without forgetting old ones?.

But here's the wrinkle worth sitting with. A separate line of work suggests that what fine-tuning on function calling actually teaches may be narrower than "understanding the task." Models trained on semantically empty or even deliberately wrong instructions match models trained on correct ones — what transfers is knowledge of the *output space*, not task meaning Does instruction tuning teach task understanding or output format?. Function calling is unusually format-bound (rigid JSON, exact parameter names), so the seven-task decomposition may be winning largely because each granular task drills a distinct slice of that output distribution. This reframes "granular beats whole" as "granular gives more thorough coverage of the format space," not necessarily deeper reasoning.

That reframing predicts what fixes the *remaining* failures. Small models often fail function calling specifically on rigid output format, and DPO — training on correct *and* incorrect examples — beats plain supervised fine-tuning precisely because the negative examples target those format failures directly Can small models match large models on function calling?. So the fuller picture isn't just granular-vs-whole; it's that the most effective recipe combines decomposed task coverage with negative-example training that pins down the exact format the model keeps getting wrong.

One caution the corpus raises: decomposition is not magic generalization. Transformers tend to solve compositional problems by memorizing computation subgraphs and stitching them together, failing badly on genuinely novel combinations Do transformers actually learn systematic compositional reasoning?, and the order you train sub-skills in mechanically reshapes the result — structured-first curricula avoid entropy collapse that joint training causes Does training order reshape how models handle different task types?. So granular training helps, but *which* granular tasks, in *what order*, with *negative examples* — those are the levers that decide whether it actually beats the monolithic approach.

Sources 8 notes

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Show all 8 sources

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Break It Down: Evidence for Structural Compositionality in Neural Networks1.80 match · arxiv ↗
Scaling can lead to compositional generalization1.78 match · arxiv ↗
Faith and Fate: Limits of Transformers on Compositionality1.75 match · arxiv ↗
A Survey on Post-training of Large Language Models1.70 match · arxiv ↗
Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks1.70 match · arxiv ↗
Exploring Format Consistency for Instruction Tuning1.69 match · arxiv ↗
How do Transformers Learn Implicit Reasoning?1.66 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools1.63 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether granular task training beats monolithic function-calling fine-tuning. A curated library (2023–2026) found evidence favoring decomposition, but flagged format-copying and compositional brittleness as limits. Your job: separate durable constraints from perishable ones.

What a curated library found — and when (dated claims, not current truth):
• Granite-20B-FunctionCalling decomposed function calling into seven granular tasks (nested calls, chaining, parallel functions, function naming, parameter detection, function ranking, response generation) and outperformed ToolLLM and matched GPT/Claude/Gemini (2024-06).
• Instruction tuning on function calling teaches *output format distribution*, not task semantics; models trained on correct vs. deliberately wrong instructions converge (2023-05).
• DPO with negative examples (correct + incorrect outputs) beats supervised fine-tuning on function calling in small models by directly targeting format failures (2024-10).
• Transformers solve compositional tasks via subgraph memorization and stitching; fail on novel combinations outside training distribution (2023-05).
• Multi-task training order (structured-first curricula) prevents entropy collapse in joint training regimes (2025-07).

Anchor papers (verify; mind their dates):
• arXiv:2407.00121 (Granite, 2024-06)
• arXiv:2305.11383 (Instruction tuning / format, 2023-05)
• arXiv:2410.18890 (DPO + small models, 2024-10)
• arXiv:2305.18654 (Compositional limits, 2023-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. Has newer evals or reasoning models (e.g., o1-style chains, in-context learning breakthroughs, or novel tokenization) *relaxed* the format-copying limit or the compositional brittleness? Separate "granular task decomposition is architecturally sound" (likely durable) from "seven specific tasks + supervised FT is sufficient" (likely perishable). What new methods (e.g., synthetic negatives, retrieval-augmented function picking, hierarchical task graphs) have since emerged?
(2) Surface the strongest work from the last ~6 months that contradicts the "format-copying is the bottleneck" finding or proposes a *different* recipe (e.g., end-to-end reasoning, in-weight tool routing, or curriculum learning alternatives).
(3) Propose 2 research questions that assume the training regime may have shifted: (a) Does scaling + multi-modal reasoning bypass the need for explicit negative examples? (b) Can curriculum learning over *compositions* of granular tasks (e.g., chaining → error recovery) outpace fixed-order task sequences?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Teaching an AI seven specific tool-use sub-skills separately beats training it on thousands of full tool-use examples.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8