Can large language models help predict results from a complex behavioural science study?

Abstract
We tested whether large language models (LLMs) can help predict results from a complex behavioural science experiment. In study 1, we investigated the performance of the widely used LLMs GPT-3.5 and GPT-4 in forecasting the empirical findings of a large-scale experimental study of emotions, gender, and social perceptions. We found that GPT-4, but not GPT-3.5, matched the performance of a cohort of 119 human experts, with correlations of 0.89 (GPT-4), 0.07 (GPT-3.5) and 0.87 (human experts) between aggregated forecasts and realized effect sizes. In study 2, providing participants from a university subject pool the opportunity to query a GPT-4 powered chatbot significantly increased the accuracy of their forecasts. Results indicate promise for artificial intelligence (AI) to help anticipate-at scale and minimal cost-which claims about human behaviour will find empirical support and which ones will not. Our discussion focuses on avenues for human-AI collaboration in science.
Description
Keywords
forecasting, large language models, meta-research
Citation
Lippert S, Dreber A, Johannesson M, Tierney W, Cyrus-Lai W, Uhlmann EL, Emotion Expression Collaboration , Pfeiffer T. (2024). Can large language models help predict results from a complex behavioural science study?. R Soc Open Sci. 11. 9. (pp. 240682-).
Collections