Context-PEFT: Efficient Multi-Modal, Multi-Task Fine-Tuning
This paper introduces a novel Parameter-Efficient Fine- Tuning (PEFT) framework for multi-modal, multi-task transfer learning with pre-trained language models. PEFT techniques such as LoRA, BitFit and IA3 have demonstrated comparable performance to full fine-tuning of pre-trained models for specific downstream tasks, all while demanding significantly fewer trainable parameters and reduced GPU memory consumption. However, in the context of multimodal fine-tuning, the need for architectural modifications or full fine-tuning often becomes apparent. To address this we propose Context-PEFT, which learns different groups of adaptor parameters based on the token’s domain. This approach enables LoRA-like weight injection without requiring additional architectural changes. Our method is evaluated on the COCO captioning task, where it outperforms full fine-tuning under similar data constraints while simultaneously offering a substantially more parameter-efficient and computationally economical solution.
Introduction. Recent advancements in Parameter-Efficient Fine-Tuning (PEFT), exemplified by techniques such as LoRA, BitFit, and IA3 [16, 21, 40], have demonstrated their ability to deliver performance on par with full fine-tuning of pretrained models for specific downstream tasks, all while significantly reducing the number of trainable parameters and GPU memory consumption. However, in the realm of multi-modal transfer-learning, there often arises a need for architectural adjustments – such as Multi-Modal Causal Attention [39] – or full fine-tuning – as demonstrated by LLaVA [23]. In response to these challenges we introduce a novel PEFT framework, Context-PEFT, which learns different groups of adaptor parameters based on the token’s domain or purpose. This approach necessitates no further architectural modifications and is adaptable to various PEFT techniques.
Discussion / Conclusion. The primary application for Context-PEFT is in compute and data-constrained environments; for example when there is a low volume of training data, when there are GPU memory restrictions necessitating the use of smaller models and/or fewer trainable parameters, when there is limited training time which may prevent full fine-tuning from converging to a good minima, or when models must be deployed for inference on low-resource hardware such as consumer grade workstations, mobile devices and embedded systems. However, if the aforementioned constraints are not a concern, it is unlikely Context-PEFT will outperform full finetuning. This is likely due to the fact that the embedding spaces of SOTA language models are often wide enough to represent multiple modalities without the need for contextspecific weights when trained to convergence on a large volume of training data, as shown by Qwen-VL and LLaVa [2, 23]. In this paper, we have introduced Context-PEFT as a multimodal, multi-task fine-tuning framework to efficiently adapt pre-trained language models to other modalities in a dataand parameter-efficient manor by learning multiple groups of adaptor parameters which are specific to each token’s domain.