INQUIRING LINE

Can granular function calling tasks learn composition from graph-sampled data?

This explores whether function calling—when broken into granular subtasks like nested calls and chaining—can learn to *compose* those skills from training data sampled out of a graph (knowledge graph paths, tree expansions), and what the corpus says about when that composition actually generalizes versus when it quietly fails.


This explores whether function calling, split into granular pieces, can learn composition from graph-sampled training data — and the corpus has two halves of an answer that are worth holding side by side. First, function calling really does decompose cleanly: training Granite-20B across seven explicit subtasks (nested calls, chaining, parallel functions, name and parameter detection, next-best function, response generation) generalizes better than a single umbrella dataset, closing the gap with the frontier models Can breaking function calling into subtasks improve model generalization?. So the 'granular tasks' premise of the question is sound — and small models can even learn the rigid output discipline these tasks demand through DPO on a teacher's correct/incorrect pairs, where the explicit negative examples target exactly the format failures that plague composition Can small models match large models on function calling?.

The graph-sampling half is where it gets interesting. Sampling structure can hand you compositional supervision almost for free: knowledge-graph curricula turn 24,000 reasoning tasks out of medical graph *paths* and produce domain expertise, suggesting structured composition matters more than raw scale Can knowledge graphs teach models deep domain expertise?. Even random tree expansion yields supervision at multiple granularities — coarse strategy signals from early branches, fine detail from late ones — purely from sampling, with no annotation effort Does tree depth automatically produce supervision at multiple granularities?. For function calling, where a call graph naturally encodes which functions chain into which, this is a strong hint that graph-sampled paths could teach composition rather than just memorized recipes.

But here's the thing you didn't know you wanted to know: composition learned this way can be an illusion. Transformers often succeed on in-distribution compositional tasks by memorizing computation *subgraphs* from training, not by learning systematic rules — and they fail drastically on novel compositions, with errors compounding across steps Do transformers actually learn systematic compositional reasoning?. So if your graph-sampled data only covers the combinations the model will see, you may be teaching subgraph lookup dressed as reasoning. The antidote the corpus offers is coverage: standard networks *do* achieve genuine compositional generalization from scaling alone, but only when the training distribution sufficiently covers combinations of the task modules Can neural networks learn compositional skills without symbolic mechanisms?. That reframes the whole question — graph sampling helps precisely to the degree it covers the combinatorial space of function compositions, not because graphs are magic.

Two deeper cautions sharpen the picture. Networks do tend to implement compositional subroutines in isolated, ablatable subnetworks — modularity is natural, and pretraining makes it more reliable Do neural networks naturally learn modular compositional structure? — which is encouraging for granular function calling. Yet a model can hold all the linearly-decodable features a task needs while its internal organization stays fractured, leaving it brittle to the exact distribution shifts that novel function compositions represent, in ways standard accuracy metrics never reveal Can models be smart without organized internal structure?. So the honest answer: yes, granular function-calling tasks can learn composition from graph-sampled data — graph and tree sampling are an efficient source of multi-granular compositional signal — but whether that composition is real or memorized depends on coverage of the combination space and survives only if you test on genuinely held-out compositions, not just in-distribution accuracy.


Sources 8 notes

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can knowledge graphs teach models deep domain expertise?

Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.

Does tree depth automatically produce supervision at multiple granularities?

Tree-GRPO's random expansion strategy naturally produces supervision at varying granularities—early branches provide coarse strategy-level signals while late branches provide fine-grained detail supervision. This multi-resolution signal emerges from sampling structure alone, without annotation effort or granularity scheduling.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether granular function-calling tasks can learn genuine compositional reasoning from graph-sampled training data—a question that sits between capability frontier and brittleness risk.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable constraints to verify:
• Granular decomposition works: splitting function calling into seven explicit subtasks (nesting, chaining, parallel execution, parameter detection, next-best selection, response generation) closes the gap with frontier models; small models trained via DPO on teacher pairs match large-model format discipline (2024-06, 2024-10).
• Graph and tree sampling yield multi-granular compositional supervision almost for free: knowledge-graph curricula generate 24,000+ reasoning tasks from medical graph paths; random tree expansion depth-maps to process supervision granularity without annotation (2025-07, 2025-09).
• Composition can be an illusion: Transformers often memorize linearized subgraph patterns from training rather than learning systematic compositional rules, failing drastically on novel compositions with compounding step errors (2023-05).
• Genuine compositional generalization emerges only with sufficient coverage: standard networks achieve real compositional generalization via scaling alone, but only when training distribution covers task-module combinations (2025-07).
• Modularity emerges naturally but fragility persists: networks do decompose into ablatable, modular subnetworks (2023-01), yet can hold all linearly-decodable features while remaining brittle to distribution shifts that novel function compositions represent (2024-05).

Anchor papers (verify; mind their dates):
• arXiv:2407.00121 (2024-06): Granite-Function Calling Model multi-task learning
• arXiv:2305.18654 (2023-05): Faith and Fate—Transformers and compositionality limits
• arXiv:2507.07207 (2025-07): Scaling and compositional generalization
• arXiv:2507.13966 (2025-09): Bottom-up domain-specific reasoning via knowledge graphs

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (post-2026), novel training methods (multi-agent distillation, synthetic function-call curricula), evaluation harnesses (held-out compositional benchmarks), or tooling (function-calling SDKs, agentic memory) have relaxed or overturned it. Separate the durable question (does graph sampling help composition?) from the perishable limitation (does composition learned this way generalize beyond training coverage?). Cite what resolved it; flag where brittleness still appears to bind.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—anything that directly challenges the "memorization vs. real reasoning" tension or claims genuine compositional transfer from graph-sampled data.
(3) Propose 2 research questions that ASSUME the frontier has shifted: e.g., "Can function-calling agents trained on graph-sampled data generalize to novel API compositions unseen in training?" or "Does agentic caching + retrieval-augmented composition sidestep the coverage problem?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines