Browsing by Author "Pfeiffer T"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
- ItemCan large language models help predict results from a complex behavioural science study?(The Royal Society, 2024-09) Lippert S; Dreber A; Johannesson M; Tierney W; Cyrus-Lai W; Uhlmann EL; Emotion Expression Collaboration; Pfeiffer TWe tested whether large language models (LLMs) can help predict results from a complex behavioural science experiment. In study 1, we investigated the performance of the widely used LLMs GPT-3.5 and GPT-4 in forecasting the empirical findings of a large-scale experimental study of emotions, gender, and social perceptions. We found that GPT-4, but not GPT-3.5, matched the performance of a cohort of 119 human experts, with correlations of 0.89 (GPT-4), 0.07 (GPT-3.5) and 0.87 (human experts) between aggregated forecasts and realized effect sizes. In study 2, providing participants from a university subject pool the opportunity to query a GPT-4 powered chatbot significantly increased the accuracy of their forecasts. Results indicate promise for artificial intelligence (AI) to help anticipate-at scale and minimal cost-which claims about human behaviour will find empirical support and which ones will not. Our discussion focuses on avenues for human-AI collaboration in science.
- ItemExamining the generalizability of research findings from archival data(PNAS, 2022-07-26) Delios A; Clemente EG; Wu T; Tan H; Wang Y; Gordon M; Viganola D; Chen Z; Dreber A; Johannesson M; Pfeiffer T; Generalizability Tests Forecasting Collaboration; Uhlmann ELThis initiative examined systematically the extent to which a large set of archival research findings generalizes across contexts. We repeated the key analyses for 29 original strategic management effects in the same context (direct reproduction) as well as in 52 novel time periods and geographies; 45% of the reproductions returned results matching the original reports together with 55% of tests in different spans of years and 40% of tests in novel geographies. Some original findings were associated with multiple new tests. Reproducibility was the best predictor of generalizability—for the findings that proved directly reproducible, 84% emerged in other available time periods and 57% emerged in other geographies. Overall, only limited empirical evidence emerged for context sensitivity. In a forecasting survey, independent scientists were able to anticipate which effects would find support in tests in new samples.
- ItemForecasting the publication and citation outcomes of COVID-19 preprints(The Royal Society, 2022-09) Gordon M; Bishop M; Chen Y; Dreber A; Goldfedder B; Holzmeister F; Johannesson M; Liu Y; Tran L; Twardy C; Wang J; Pfeiffer TMany publications on COVID-19 were released on preprint servers such as medRxiv and bioRxiv. It is unknown how reliable these preprints are, and which ones will eventually be published in scientific journals. In this study, we use crowdsourced human forecasts to predict publication outcomes and future citation counts for a sample of 400 preprints with high Altmetric score. Most of these preprints were published within 1 year of upload on a preprint server (70%), with a considerable fraction (45%) appearing in a high-impact journal with a journal impact factor of at least 10. On average, the preprints received 162 citations within the first year. We found that forecasters can predict if preprints will be published after 1 year and if the publishing journal has high impact. Forecasts are also informative with respect to Google Scholar citations within 1 year of upload on a preprint server. For both types of assessment, we found statistically significant positive correlations between forecasts and observed outcomes. While the forecasts can help to provide a preliminary assessment of preprints at a faster pace than traditional peer-review, it remains to be investigated if such an assessment is suited to identify methodological problems in preprints.
- ItemOn the trajectory of discrimination: A meta-analysis and forecasting survey capturing 44 years of field experiments on gender and hiring decisions(Elsevier Inc, 2023-11) Schaerer M; Plessis CD; Nguyen MHB; van Aert RCM; Tiokhin L; Lakens D; Clemente EG; Pfeiffer T; Dreber A; Johannesson M; Clark CJ; Uhlmann EL; Abraham AT; Adamus M; Akinci C; Alberti F; Alsharawy AM; Alzahawi S; Anseel F; Arndt F; Balkan B; Baskin E; Bearden CE; Benotsch EG; Bernritter S; Black SR; Bleidorn W; Boysen AP; Brienza JP; Brown M; Brown SEV; Brown JW; Buckley J; Buttliere B; Byrd N; Cígler H; Capitan T; Cherubini P; Chong SY; Ciftci EE; Conrad CD; Conway P; Costa E; Cox JA; Cox DJ; Cruz F; Dawson IGJ; Demiral EE; Derrick JL; Doshi S; Dunleavy DJ; Durham JD; Elbaek CT; Ellis DA; Ert E; Espinoza MP; Füllbrunn SC; Fath S; Furrer R; Fiala L; Fillon AA; Forsgren M; Fytraki AT; Galarza FB; Gandhi L; Garrison SM; Geraldes D; Ghasemi O; Gjoneska B; Gothilander J; Grühn D; Grieder M; Hafenbrädl S; Halkias G; Hancock R; Hantula DA; Harton HC; Hoffmann CP; Holzmeister F; Hoŕak F; Hosch A-K; Imada H; Ioannidis K; Jaeger B; Janas M; Janik B; Pratap KC R; Keel PK; Keeley JW; Keller L; Kenrick DT; Kiely KM; Knutsson M; Kovacheva A; Kovera MB; Krivoshchekov V; Krumrei-Mancuso EJ; Kulibert D; Lacko D; Lemay EPA preregistered meta-analysis, including 244 effect sizes from 85 field audits and 361,645 individual job applications, tested for gender bias in hiring practices in female-stereotypical and gender-balanced as well as male-stereotypical jobs from 1976 to 2020. A “red team” of independent experts was recruited to increase the rigor and robustness of our meta-analytic approach. A forecasting survey further examined whether laypeople (n = 499 nationally representative adults) and scientists (n = 312) could predict the results. Forecasters correctly anticipated reductions in discrimination against female candidates over time. However, both scientists and laypeople overestimated the continuation of bias against female candidates. Instead, selection bias in favor of male over female candidates was eliminated and, if anything, slightly reversed in sign starting in 2009 for mixed-gender and male-stereotypical jobs in our sample. Forecasters further failed to anticipate that discrimination against male candidates for stereotypically female jobs would remain stable across the decades.
- ItemPredicting replicability—Analysis of survey and prediction market data from large-scale forecasting projects(Public Library of Science (PLoS), 2021-04-14) Gordon M; Viganola D; Dreber A; Johannesson M; Pfeiffer TThe reproducibility of published research has become an important topic in science policy. A number of large-scale replication projects have been conducted to gauge the overall reproducibility in specific academic fields. Here, we present an analysis of data from four studies which sought to forecast the outcomes of replication projects in the social and behavioural sciences, using human experts who participated in prediction markets and answered surveys. Because the number of findings replicated and predicted in each individual study was small, pooling the data offers an opportunity to evaluate hypotheses regarding the performance of prediction markets and surveys at a higher power. In total, peer beliefs were elicited for the replication outcomes of 103 published findings. We find there is information within the scientific community about the replicability of scientific findings, and that both surveys and prediction markets can be used to elicit and aggregate this information. Our results show prediction markets can determine the outcomes of direct replications with 73% accuracy (n = 103). Both the prediction market prices, and the average survey responses are correlated with outcomes (0.581 and 0.564 respectively, both p < .001). We also found a significant relationship between p-values of the original findings and replication outcomes. The dataset is made available through the R package “pooledmaRket” and can be used to further study community beliefs towards replications outcomes as elicited in the surveys and prediction markets.
- ItemUsing prediction markets to predict the outcomes in the Defense Advanced Research Projects Agency's next-generation social science programme(The Royal Society, 2021-07) Viganola D; Buckles G; Chen Y; Diego-Rosell P; Johannesson M; Nosek BA; Pfeiffer T; Siegel A; Dreber AThere is evidence that prediction markets are useful tools to aggregate information on researchers' beliefs about scientific results including the outcome of replications. In this study, we use prediction markets to forecast the results of novel experimental designs that test established theories. We set up prediction markets for hypotheses tested in the Defense Advanced Research Projects Agency's (DARPA) Next Generation Social Science (NGS2) programme. Researchers were invited to bet on whether 22 hypotheses would be supported or not. We define support as a test result in the same direction as hypothesized, with a Bayes factor of at least 10 (i.e. a likelihood of the observed data being consistent with the tested hypothesis that is at least 10 times greater compared with the null hypothesis). In addition to betting on this binary outcome, we asked participants to bet on the expected effect size (in Cohen's d) for each hypothesis. Our goal was to recruit at least 50 participants that signed up to participate in these markets. While this was the case, only 39 participants ended up actually trading. Participants also completed a survey on both the binary result and the effect size. We find that neither prediction markets nor surveys performed well in predicting outcomes for NGS2.