Context Embeddings for Efficient Answer Generation in RAG

Paper · arXiv 2407.09252 · Published July 12, 2024

Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates an inference speed-up of up to 5.69× while achieving higher performance compared to existing efficient context compression methods. Model checkpoints: https://huggingface.co/naver/cocom-v1-128- mistral-7b.

Introduction. Large Language Models (LLMs) are pre-trained on massive amounts of textual data; for instance, Llama 2 [32] has been trained on 3 trillion tokens during pre-training. Through billions of learnable parameters, LLMs not only excel at modeling language but at the same time, build up a knowledge base that could be later used for question answering. On the other hand, the model is limited to the knowledge contained in the pre-training data. In knowledgeintensive scenarios, relying solely on the parametric memory of the model is often insufficient. To alleviate this, context can be provided explicitly from an external source through a preceding retrieval step (Retrieval-Augmented Generation–RAG). Although LLMs show notable improvements when given additional relevant context in knowledge-intensive tasks, this approach has limitations. A key drawback is that adding more context to the input considerably slows down generation during inference. This occurs because the self-attention mechanism in transformers grows exponentially in space and memory requirements with increasing input length.

Discussion / Conclusion. In this paper, we presented our novel approach COCOM approach for context compression. Our main finding is that COCOM accelerates answer generation, by reducing the model’s input, by compressing multiple contexts into context embeddings that, once pre-computed serve to augment the answer generation. Our approach maximizes the potential of the LLM by tuning all components outperforming existing methods for context compression in RAG. By offering a trade-off between efficiency and effectiveness, our method allows for the selection of varying numbers of context compression tokens. This flexibility enables us to balance higher answer quality against faster generation times as needed. Unlike previous methods, our approach allows for the input of multiple contexts, which enhances generation quality and optimally makes use of the reduced decoding time. This is because only for very long inputs, the distinction between the context in token form and a reduced set of embeddings becomes most apparent.

Context Embeddings for Efficient Answer Generation in RAG

Synthesis notes that discuss concepts related to this paper