Long-context LLMs Struggle with Long In-context Learning

Paper · arXiv 2404.02060 · Published April 2, 2024
Test-Time Compute

Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs.

Introduction. Large language models have already entered the long context era. A myriad of LLMs has been released to support long context windows from 32K to 2M tokens. These methods (Hao et al., 2022; Chen et al., 2023a; Peng et al., 2023b; Ratner et al., 2023; Xiao et al., 2024; Jin et al., 2024) can unlock lots of complex real-world applications, such as long-document question-answering, multi-document summarization, long-horizon agent tasks, and repo-level code understanding. One line of research is based on AliBi (Press et al., 2022) and RoPE (Su et al., 2024) embeddings, which allows us to train Transformers with short sequences and subsequently apply them to longer sequences during inference. Recently, different approaches (Xiong et al., 2023; Fu et al., 2024; Liu et al., 2024) help the model to extrapolate to 128K window size with continued pre-training. Later on, LongRoPE (Ding et al., 2024) was proposed to further extend the context window to 2M tokens.

Discussion / Conclusion. In summary, our research explores the capability of LLMs on long in-context learning tasks, particularly in extreme-label classification scenarios. We curate a dataset LongICLBench consisting of long in-context learning tasks with different difficulty levels in terms of context length. Through our study, we have discovered that LLMs demonstrate dramatic performance degradation when it comes to more difficult tasks. Our exploratory experiments further highlight the impact of the distribution of examples within prompts on model performance. We hope LongICLBench and our findings contribute to the ongoing efforts to enhance LLMs’ understanding of long contexts.