Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference
We view large language models (LLMs) as stochastic language layers in a network, where the learnable parameters are the natural language prompts at each layer. We stack two such layers, feeding the output of one layer to the next. We call the stacked architecture a Deep Language Network (DLN). We first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1). We then show how to train 2-layer DLNs (DLN-2), where two prompts must be learnt. We consider the output of the first layer as a latent variable to marginalize, and devise a variational inference algorithm for joint prompt training. A DLN-2 reaches higher performance than a single layer, sometimes comparable to few-shot GPT-4 even when each LLM in the network is smaller and less powerful. The DLN code is open source.1
Introduction. The size of large language models (LLMs) has grown significantly over the last few years, mainly because of emerging capabilities [6, 34], but at considerable technical and societal costs [51, 2, 4]. Recent efforts have focused either on learning smaller models matching the abilities of larger ones on some tasks using distillation [46, 38, 32, 15], or offloading part of the computation to other dedicated components [31, 25, 28, 21]. In the latter case, this is done through carefully crafted instructions to retrieve the necessary information from these additional modules [50, 44, 8, 56, 27]. Our work falls in the general research direction of building modular systems. Particularly, we view LLMs as language layers in a Deep Language Network (DLN). The learnable parameters of each layer are the associated natural language prompts and the LLM at a given layer receives as input the output of the LLM at the previous layer, like in a traditional deep network. Each layer uses a template to organize the conditioning information into a string before calling the LLM (see Fig. 1).
Discussion / Conclusion. At a moment where new techniques to engineer prompts abound, we believe DLNs could serve as an effective framework to systematically characterize these techniques and identify new opportunities for prompt optimization. Indeed, many prompt tuning techniques could be considered as a special instance of DLNs. We demonstrated how CoT can be seen as a DLN-2 with a residual connection. Generated Knowledge Prompting [22] could be considered as a fixed forward-only DLN-2 where, in the first layer, an LLM generates related knowledge, and in the second layer, another LLM takes the generated knowledge as input and generates the final answer. Prompting techniques like ReAct [56], Reflexicon [41], and Self-Consistency [48] could all be ensembles of DLN-1s with different prompt initializations. Our hope is that the DLN framework could inspire new prompt learning techniques by considering networks with different widths, depths, and connections. Although we only tested 1-layer and 2-layer language networks so far, we already show that the performance of smaller LLMs can be boosted when stacked and prompted properly.