Does verbose chain-of-thought actually help multimodal perception tasks?
Extending RLHF to MLLMs through longer rationales follows the successful reasoning playbook, but may backfire on perception tasks. This explores when and why the standard CoT-and-RL recipe fails.
The default playbook for improving LLM reasoning under RL is well-known: longer chains of thought, more intermediate tokens, more verbose rationales. This helps on math, code, and multi-hop reasoning. Reinforced Attention Learning documents a domain where it actively hurts: multimodal perception tasks.
The failure mode is structural. Perception tasks — fine-grained visual question-answering, grounding, attribute identification — depend on the model precisely attending to the right region of the visual input. The bottleneck is not what to say; it is what to look at. Verbose rationales pile text tokens on top of the visual attention task, and the optimization signal flows to those text tokens rather than to the underlying attention. The model becomes more elaborate in its descriptions of what it sees without becoming more accurate about what it sees.
This contradicts the assumption that the CoT-and-RL recipe is universally beneficial. The recipe works when the bottleneck is reasoning steps that can be made externally visible — math derivations, logical chains, step-by-step planning. It does not work when the bottleneck is the model's internal information-allocation decisions, which are not visible in the token stream and which RL on tokens cannot directly reach.
The diagnostic generalizes. Before applying the verbose-CoT playbook to a new domain, ask: is the bottleneck on this task something the model can verbalize, or is it something happening inside the attention pattern? If verbalizable, verbose CoT and outcome RL help. If internal, they may add noise without addressing the actual problem — and in some cases, as in MLLM perception, may degrade performance by reinforcing the wrong policy object.
This connects to the broader pattern of CoT limitations. CoT is constrained imitation of reasoning form; it does not access mechanisms not encoded in token sequences. For tasks whose mechanism lives in attention, in latent state trajectories, or in cross-modal alignment, training the verbalization layer is training the wrong thing.
Inquiring lines that use this note as a source 37
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can parsing screens into structured elements before acting improve vision models?
- What role does visual perception play alongside accessibility tree information?
- What makes multimodal conditioning effective when features are decomposed to the right granularity?
- Why do contrastive reasoning approaches outperform single-path belief evaluation?
- Are potemkin understanding and split-brain syndrome describing the same phenomenon?
- Why do more capable models prefer shorter chains of thought?
- How do semantic failure modes map to attentional and intentional layers?
- Why does pure-vision underperform when parsing semantics and action prediction mix?
- Do self-correction and chain-of-thought prompting reduce hallucination rates?
- What saliency patterns distinguish successful from failed chain-of-thought reasoning?
- Why might latent reasoning capture types of thinking that verbalized CoT cannot?
- What distinguishes perception contribution from decision authority in collaboration?
- What fine-grained distinctions matter most for human situated action in categories?
- Can parallel evaluation reduce position and length bias in LLM judging?
- How does factoring perception from reasoning improve sparse-label learning?
- How do cognitive load dimensions interact with hallucination awareness in prompts?
- Do chain-of-thought prompts help RLVR models predict annotation disagreement?
- Are RLVR models worse than non-reasoning models for subjective annotation?
- What limits RLVR effectiveness beyond mathematical and coding domains?
- Should GUI perception happen inside or outside the foundation model?
- What alternatives to RLHF better preserve truth-seeking in AI outputs?
- Why must world models be nested rather than flat and uniform?
- Why are receiver attention heads narrower in reasoning models than base models?
- Why does chain-of-thought fail to improve multimodal model perception performance?
- Why does reasoning volume fail to improve theory of mind performance?
- What are the scaling law differences between vision and language learning?
- Can dense models partially address modality friction without full expert specialization?
- How does interleaving reasoning with action prevent hallucination?
- What training interventions could close the perception-action gap?
- Why do vision and language have different optimal scaling curves?
- Can multimodal architectures successfully integrate vision without replicating past failures?
- How do sparse mixture-of-experts models resolve modality capacity competition?
- What scaling exponent would audio or other modalities require in a truly multimodal system?
- What visual patterns transfer between infographic and UI tasks when trained jointly?
- How does causal multimodal modeling differ from encoder-decoder architectures?
- Why do multimodal models fail on rare and underrepresented concepts?
- Why do small specialized models match frontier multimodal models on screen tasks?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can optimizing attention patterns improve multimodal RL better than optimizing tokens?
Standard RL training optimizes token outputs in multimodal models, but the real bottleneck may be where the model attends to visual information. Does steering attention directly outperform indirect optimization through final outputs?
same paper, the alternative to verbose CoT
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
adjacent: structural limit of CoT that applies broadly
-
Does chain of thought reasoning actually explain model decisions?
When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
adjacent: CoT degradation in another domain (agentic pipelines)
-
Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
adjacent: another case where verbose CoT does not address the actual bottleneck
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- When More is Less: Understanding Chain-of-Thought Length in LLMs
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Pixels, Patterns, but No Poetry: To See The World like Humans
- Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
- Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
- Efficient Reasoning with Hidden Thinking
- Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
- Reasoning Models Are More Easily Gaslighted Than You Think
Original note title
verbose chain-of-thought degrades MLLM perception tasks — text-token RL is the wrong policy objective when the bottleneck is visual grounding