DocLLM: A layout-aware generative language model for multimodal document understanding

Paper · arXiv 2401.00908 · Published December 31, 2023
Multimodal Models

Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks.

Introduction. Documents with rich layouts, including invoices, receipts, contracts, orders, and forms, constitute a significant portion of enterprise corpora. The automatic interpretation and analysis of these documents offer considerable advantages [1], which has spurred the development of AI-driven solutions. These visually rich documents feature complex layouts, bespoke type-setting, and often exhibit variations in templates, formats and quality. Although Document AI (DocAI) has made tremendous progress in various tasks including extraction, classification and question answering, there remains a significant performance gap in real-world applications. In particular, accuracy, reliability, contextual understanding and generalization to previously unseen domains continues to be a challenge [2]. Document intelligence is inherently a multi-modal problem with both the text content and visual layout cues being critical to understanding the documents.

Discussion / Conclusion. In addition to its immediate utility in visually rich document understanding tasks, we posit that DocLLM offers an opportunity to change the landscape of generative pre-training by enabling language models to go beyond next token prediction in plain text settings. By accommodating complex layout structures, DocLLM allows for e-books, e-publications, and other documents with rich layouts to be incorporated into the pre-training corpus without requiring extensive preprocessing. The spatial-aware reading approach enables the model to perceive the document as inherently structured knowledge. Moreover, the multi-page awareness, of both page breaks and document boundaries, enhances the model’s ability to comprehend documents of various lengths. This addresses the limitations of previous smaller multi-modal models (which are mainly for single-page documents) and the existing multimodal LLMs (which are primarily designed for images). In supervised instruction tuning, we can adhere to the established practices used in other works, based on desired outputs such as text or images.