Can Large Language Models Transform Computational Social Science?

Paper · Source

Large language models (LLMs) are capable of successfully performing many language processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify and explain social phenomena like persuasiveness and political ideology, then LLMs could augment the computational social science (CSS) pipeline in important ways. This work provides a road map for using LLMs as CSS tools. Towards this end, we contribute a set of prompting best practices and an extensive evaluation pipeline to measure the zero-shot performance of 13 language models on 25 representative English CSS benchmarks. On taxonomic labeling tasks (classification), LLMs fail to outperform the best fine-tuned models but still achieve fair levels of agreement with humans. On free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowdworkers’ gold references. We conclude that the performance of today’s LLMs can augment the CSS research pipeline in two ways: (1) serving as zero-shot data annotators on human annotation teams, and (2) bootstrapping challenging creative generation tasks (e.g., explaining the underlying attributes of a text). In summary, LLMs are posed to meaningfully participate in social science analysis in partnership with humans.

Introduction. The most surprising scientific changes tend to arrive, not from accumulated facts and discoveries, but from the invention of new tools and methodologies that trigger “paradigm shifts” (Kuhn 1962). Computational social science (CSS) (Lazer et al. 2020) was born from the immense growth of human data traces on the Web and the rapid acceleration of computational resources for processing this data. These developments allowed researchers to study language and behavior at an unprecedented scale (Lazer et al. 2009), with both global and fine-grained observations (Golder and Macy 2014). From the early days of content dictionaries (Stone, Dunphy, and Smith 1966), statistical text analysis facilitated CSS research by providing structure to non-numeric data. Now, large language models (LLMs) may be poised to change the CSS landscape by providing such capabilities without custom training data. The goal of this work is to assess the degree to which LLMs can transform CSS.

Discussion / Conclusion. This work presents a comprehensive evaluation of LLMs on a representative suite of CSS tasks. We contribute a robust evaluation pipeline, which allows us to benchmark performance alongside supervised baselines on a wide range of tasks. Our research questions and empirical results are designed to help CSS researchers make decisions about when LLMs are suitable and which models are best suited for different research needs. In summary, we find that LLMs can augment but not entirely replace the traditional CSS research pipeline. More concretely, we make the following recommendations to CSS researchers: Social scientists are not often interested in classification labels or generative codes merely for their own sake. Labeled text is almost always used to explain a wider phenomenon using downstream inferential statistics such as regression.

Lines of inquiry this paper opens 24

Research framings built by reading the notes related to this paper — the questions it feeds into.

How does AI-generated content transformation affect public discourse quality?

Can AI-generated outputs constitute genuine knowledge or valid claims?

How can humans calibrate appropriate trust in AI systems?

Does AI fluency substitute for verifiable accuracy in human judgment?

Why do human raters miss factual errors that domain experts catch?

What mechanisms enable AI systems to generate and spread false beliefs?

Does conversational format create illusions of genuine AI communication?

What does a receiver project onto AI that the system never performed?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

What factors beyond surface content determine how readers extract meaning differently?

Does removing information about who wrote something change how we interpret it?

How do chatbots affect human self-disclosure and emotional engagement?

Do people who might cheat deliberately choose machines to avoid lying to humans?

Why do reasoning models fail at systematic problem-solving and search?

Does sentence-level granularity capture enough structure for complex reasoning tasks?

Can Large Language Models Transform Computational Social Science?

Synthesis notes from this paper's topics 8

Lines of inquiry this paper opens 24