The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
Language models have shown impressive performance on tasks within their training distribution, but often struggle with novel problems requiring complex reasoning. We investigate the effectiveness of test-time training (TTT)—updating model parameters temporarily during inference using a loss derived from input data—as a mechanism for improving models’ reasoning capabilities, using the Abstraction and Reasoning Corpus (ARC) as a benchmark. Through systematic experimentation, we identify three crucial components for successful TTT: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training. TTT significantly improves performance on ARC tasks, achieving up to 6× improvement in accuracy compared to base fine-tuned models; applying TTT to an 8B-parameter language model, we achieve 53% accuracy on the ARC’s public validation set, improving the state-of-the-art by nearly 25% for public and purely neural approaches. By ensembling our method with recent program generation approaches, we get SoTA public validation accuracy of 61.9%, matching the average human score.
Introduction. Large-scale neural language models (LMs) excel at performing tasks that occur in their training data, and often elementary variations or compositions of those tasks (Brown et al., 2020; Todd et al., 2024). Given natural language task specifications or a small number of examples, LMs often successfully infer the desired task and produce an appropriate output. But can LMs also solve new problems, involving non-trivial reasoning, planning, or string manipulation of a kind very different from their pre-training data? This question is central to understanding the novel skill acquisition capabilities of current AI systems, which has been proposed as a key measure of intelligence (Chollet, 2019).
Discussion / Conclusion. In this work, we conduct an investigation of test-time training and demonstrate that it can significantly improve LM performance on the popular ARC dataset. We find that learning task-specific LoRA adapters and generating augmented test-time datasets using geometric transformations are crucial. We also develop an augmented inference pipeline that uses invertible transformations to generate multiple predictions and then uses self-consistency to select the best candidates. Our overall pipeline applies multiple test-time computation methods, with each component contributing positively. This suggests that not only can testtime compute improve LM performance, but different test-time methods can also complement one another. Our TTT pipeline, combined with an existing method (BARC), achieves state-of-the-art results on the ARC public set and performs comparably to an average human. Our findings suggest that test-time methods could play a pivotal role in advancing the next generation of LMs. Evaluation Framework The ARC challenge maintains separate public and private leaderboards in which the private evaluation conducted on hidden tasks.