Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways

Paper · arXiv 2310.17591 · Published October 26, 2023
Training and Fine-TuningNLP and Linguistics

We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained our masked language models with three ingredients: an initial pretraining with music data, training on shorter sequences before training on longer ones, and masking specific tokens to target some of the BLiMP subtasks. Overall, our baseline models performed above chance, but far below the performance levels of larger LLMs trained on more data. We found that training on short sequences performed better than training on longer sequences. Pretraining on music may help performance marginally, but, if so, the effect seems small. Our targeted Masked Language Modeling augmentation did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks that we were targeting (e.g., Negative Polarity Items). Training performant LLMs on small amounts of data is a difficult but potentially informative task. While some of our techniques showed some promise, more work is needed to explore whether they can improve performance more than the modest gains here. Our code and models are available online.1.

Introduction. Large Language Models (LLMs) generate complex and largely grammatical strings and display impressive performance with structures traditionally thought to require abstract and hierarchical syntax (Linzen et al., 2016; Linzen and Baroni, 2021; Wilcox et al., 2022; Futrell and Levy, 2019). They have achieved human-like performance at a wide range of natural language tasks (Bubeck et al., 2023; Frank, 2023), particularly those having to do with linguistic form (Mahowald et al., 2023). This state of affairs has led to claims that such models should be taken seriously as cognitive models of human language (Piantadosi, 2023; Baroni, 2022; Frank, 2023), in line with claims from the neuroscience literature to “take mechanistic abstraction seriously” (Cao and Yamins, 2021). One reason that has been posited not to take LLMs seriously as cognitive models, though, is the immense amount of data they are trained on relative to what a human child is exposed to (Warstadt and Bowman, 2022; van Schijndel et al., 2019).

Discussion / Conclusion. Overall, we found that, for BabyLM’s, sequence length matters, music pretraining may help a little (but may be spurious), and targeted MLM training may help on specific tasks. These results are far from exhaustive, and we see a number of areas for future improvement using these methods. To fully understand the role of initial pretraining on music, one could construct a series of synthetically-generated music datasets, with varying degrees of complexity. Would pretraining on music that is more “language-like” (Lerdahl, 1996) in some sense improve performance on downstream tasks? Perhaps there is a principled way to interpolate between music and language, using the same kind of data format (MIDI). At one end of the spectrum one would have MAESTRO, and at the other end, text that has been encoded into MIDI events. Related to the use of varying sequence lengths, future work could consider improvements in data preprocessing and batching; in particular, knowing the beginning and ending of coherent chunks of text (e.g., dialogues or documents) could help improve the model.