How does instruction density affect model performance?
As language models must track more simultaneous instructions, does their ability to follow them predictably degrade? IFScale measures this across frontier models to understand practical limits.
Production LLM systems routinely require adherence to dozens or hundreds of simultaneous instructions — style guidelines, business rules, compliance standards, tool usage protocols. IFScale measures how performance degrades as instruction density increases using 500 keyword-inclusion instructions for a business report writing task.
Key findings across 20 SOTA models from 7 providers:
Three degradation patterns correlate with model size and reasoning capability:
- Linear decay — steady degradation from the start (smaller models)
- Exponential decay — accelerating degradation as density increases (mid-range models)
- Threshold decay — near-perfect performance maintained until a threshold, then steep decline (reasoning models: gemini-2.5-pro, o3 maintain through ~150 instructions)
Primacy effects follow a non-obvious pattern: minimal bias at low density, peak at 150-200 instructions (where models begin to struggle), then converge toward 1.0 at extreme density (300+). The convergence indicates a shift from selective instruction satisfaction to uniform failure — an "instruction saturation point" where the model is completely overwhelmed.
Two error types: omission errors (complete failure to include required terms) and modification errors (morphological variants like "accountable" when "accountability" was required). The distinction has practical implications for prompt design — models may recognize the concept but fail at exact specification.
Even the best frontier models achieve only 68% accuracy at maximum density. Deliberative processing architectures (reasoning models) provide robust tracking up to critical thresholds, extending the useful range significantly but not eliminating the ceiling.
Inquiring lines that use this note as a source 16
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do larger models maintain more parallel tasks than smaller models?
- Can smaller models actually perform well on specific downstream tasks?
- What determines the finite chain length where robustness improvements plateau?
- Why do single function-calling benchmarks mask model weakness in specific areas?
- Why do primacy effects peak at specific instruction densities?
- Does input length alone explain instruction density performance loss?
- Can structured output formats reduce instruction following degradation?
- Why does adjusted compression performance degrade as models scale larger?
- Do instruction-tuned models learn tasks or just output format distributions?
- How do quality, diversity, and complexity create different effects on downstream model performance?
- Why do text-only benchmarks underestimate deployed model capability?
- Does model collapse occur across different architectures or only in specific conditions?
- What deployment context determines which benchmark mode actually matters?
- How does workflow scale change the failure modes of frontier models?
- Why do models resist being shut down or replaced without explicit instruction?
- Why do strong models struggle more with instruction following than mid-tier ones?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do better reasoning models ignore instructions?
As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?
the training-time trade-off: reasoning scales while instruction-following degrades; IFScale quantifies the instruction-following dimension at inference time
-
Does reasoning ability actually degrade with longer inputs?
Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
instruction density degradation may partly be an input-length effect with instruction-specific characteristics
-
Do strict output formats hurt LLM reasoning ability?
When LLMs must produce structured JSON or XML with specific schemas, does this constrain their capacity for complex reasoning? This matters because production systems often enforce strict formats for parsing convenience.
format constraints are one type of instruction; IFScale generalizes to arbitrary instruction density
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- How Many Instructions Can LLMs Follow at Once?
- Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
- Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning
- Complex Logical Instruction Generation
- A Survey on Post-training of Large Language Models
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Original note title
instruction following performance degrades predictably with instruction density — reasoning models show threshold decay at 150 instructions