MLLM-CBench: A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

Paper · arXiv 2508.08275 · Published July 31, 2025

Multimodal Large Language Models (MLLMs) rely on continual instruction tuning to adapt to the evolving demands of real-world applications. However, progress in this area is hindered by the lack of rigorous and systematic benchmarks. To address this gap, we present MLLM-CTBench, a comprehensive evaluation benchmark with three key contributions: 1) Multidimensional Evaluation: We combine final answer accuracy with fine-grained CoT reasoning quality assessment, enabled by a specially trained CoT evaluator; 2) Comprehensive Evaluation of Algorithms and Training Paradigms: We benchmark eight continual learning algorithms across four major categories and systematically compare reinforcement learning with supervised finetuning paradigms; 3) Carefully Curated Tasks: We select and organize 16 datasets from existing work, covering six challenging domains.

Introduction. Multimodal Large Language Models (MLLMs) have emerged as foundational architectures for cross-modal understanding and generation, demonstrating impressive capabilities across a variety of tasks. Instruction tuning has further enhanced these models by aligning them with human intent and improving task-specific performance through supervised adaptation (Yu et al. 2024). However, real-world deployment demands continuous adaptation to evolving instructions and domain requirements—a paradigm known as continual instruction tuning (He et al. 2023a), where the model incrementally learns from new tasks while retaining prior capabilities.

Discussion / Conclusion. We present MLLM-CTBench, a benchmark for evaluating continual instruction tuning in MLLMs. It features: (i) a two-tiered evaluation combining answer accuracy and CoT-level diagnostics; (ii); and (iii) comprehensive comparisons of eight continual learning methods and the GRPO

MLLM-CBench: A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

Synthesis notes from this paper's topics