INQUIRING LINE

Why do vision and language have different optimal scaling curves?

This explores why image models and text models follow different rules for how much data and compute they each need to keep improving — and what the corpus says about reconciling them in one model.


This explores why vision and language don't improve at the same rate when you scale them up, and what that mismatch reveals about each modality. The cleanest answer in the corpus is that they sit in different regimes: language scales close to the Chinchilla balance (roughly proportional growth of data and parameters), while vision is far more data-hungry — it keeps wanting more images relative to its size Why do vision and language scale so differently?. The practical fix that work proposes is sparse mixture-of-experts: routing tokens to modality-specific experts effectively shifts language toward vision's data-hungry regime, letting both coexist optimally inside a single model instead of forcing one compromise curve on both.

Why would the appetites differ in the first place? A useful adjacent framing is that text is a lossy compression of reality — it strips out the physics, geometry, and causal structure that images still carry Are text-only language models fundamentally limited by abstraction?. Language arrives pre-abstracted by humans, so a model extracts a lot per token; vision carries raw, redundant, high-dimensional signal, so it needs far more examples to distill comparable structure. The scaling exponent gap isn't an accident of architecture — it tracks how much each medium has already been pre-digested before the model ever sees it.

The corpus also pushes back on the idea that there's one universal scaling curve at all. Below a billion parameters, depth beats width for language — composing concepts through more layers outperforms spreading parameters sideways, directly contradicting the smooth Kaplan-style law Does depth matter more than width for tiny language models?. And once you fold architectural variables (hidden size, MLP-to-attention ratio, GQA) into the scaling law itself, the "optimal" point moves — there's no single curve, there's a curve conditional on the shape you chose Can architecture choices improve inference efficiency without sacrificing accuracy?. So "different optimal curves" is partly a story about the modality and partly about the fact that scaling laws are local, not universal.

There's a sharper twist when the two modalities are forced together. Verbose chain-of-thought — which reliably helps language reasoning — actively degrades multimodal perception, because the real bottleneck there is visual attention allocation, not more text tokens Does verbose chain-of-thought actually help multimodal perception tasks?. That's the scaling-curve divergence showing up at the optimization level: pouring more of language's favored resource into a vision task optimizes the wrong target. The lesson running through all of these is that "scale" isn't one knob — each modality has its own bottleneck, and the curve you should follow depends on which bottleneck you're actually fighting.


Sources 5 notes

Why do vision and language scale so differently?

IsoFLOP analysis shows language scales near Chinchilla balance while vision is significantly more data-hungry. Sparse MoE shifts language toward the data-hungry regime, enabling both modalities to coexist optimally in one model by routing tokens to modality-specific experts.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Does verbose chain-of-thought actually help multimodal perception tasks?

Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a scaling-law researcher. The question remains open: why do vision and language exhibit different optimal scaling curves, and what does that reveal about how modalities learn?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable constraints to re-test:
• Language scales near Chinchilla balance (data ∝ parameters); vision is far more data-hungry — keeps wanting more images relative to model size (~2024).
• Sparse mixture-of-experts routing to modality-specific experts can align both modalities to a shared curve inside one model, rather than forcing compromise (~2024).
• Below 1B parameters, depth beats width for language, contradicting smooth Kaplan-style scaling laws (~2024).
• Scaling laws are conditional on architecture (hidden size, MLP-to-attention ratio, GQA); no single universal curve exists (~2024).
• Verbose chain-of-thought degrades multimodal perception tasks; the real bottleneck is visual attention, not text tokens (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2510.18245 (2025-10): Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
• arXiv:2603.03276 (2026-03): Beyond Language Modeling: An Exploration of Multimodal Pretraining
• arXiv:2502.07266 (2025-02): When More is Less: Understanding Chain-of-Thought Length in LLMs
• arXiv:2503.14858 (2025-03): 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Cap

Your task:
(1) RE-TEST EACH CONSTRAINT. For the vision–language scaling divergence, the Chinchilla imbalance, and the depth-vs-width claim, determine whether newer models (GPT-4V follow-ups, Gemini variants), training techniques (continued scaling studies, constitutional AI in vision), or evaluation harnesses have since relaxed, overturned, or confirmed these limits. Separate durable structure (modalities do learn differently) from perishable specifics (exact exponents, optimal mixtures). Cite what resolved or sustained each constraint.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has anyone unified vision and language under one scaling regime? Does multimodal pre-training (arXiv:2603.03276) overturn the data-hunger claim?
(3) Propose 2 research questions that assume the regime may have shifted — e.g., do inference-time scaling (test-time compute) or retrieval-augmented approaches erase the original modality gap? Can architectural unification (shared embedding spaces, learned routing) make the distinction moot?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines