Browsing by Author "Pfeiffer T"
Now showing 1 - 8 of 8
Results Per Page
Sort Options
- ItemCan large language models help predict results from a complex behavioural science study?(The Royal Society, 2024-09) Lippert S; Dreber A; Johannesson M; Tierney W; Cyrus-Lai W; Uhlmann EL; Emotion Expression Collaboration; Pfeiffer TWe tested whether large language models (LLMs) can help predict results from a complex behavioural science experiment. In study 1, we investigated the performance of the widely used LLMs GPT-3.5 and GPT-4 in forecasting the empirical findings of a large-scale experimental study of emotions, gender, and social perceptions. We found that GPT-4, but not GPT-3.5, matched the performance of a cohort of 119 human experts, with correlations of 0.89 (GPT-4), 0.07 (GPT-3.5) and 0.87 (human experts) between aggregated forecasts and realized effect sizes. In study 2, providing participants from a university subject pool the opportunity to query a GPT-4 powered chatbot significantly increased the accuracy of their forecasts. Results indicate promise for artificial intelligence (AI) to help anticipate-at scale and minimal cost-which claims about human behaviour will find empirical support and which ones will not. Our discussion focuses on avenues for human-AI collaboration in science.
- ItemExamining the generalizability of research findings from archival data(PNAS, 2022-07-26) Delios A; Clemente EG; Wu T; Tan H; Wang Y; Gordon M; Viganola D; Chen Z; Dreber A; Johannesson M; Pfeiffer T; Generalizability Tests Forecasting Collaboration; Uhlmann ELThis initiative examined systematically the extent to which a large set of archival research findings generalizes across contexts. We repeated the key analyses for 29 original strategic management effects in the same context (direct reproduction) as well as in 52 novel time periods and geographies; 45% of the reproductions returned results matching the original reports together with 55% of tests in different spans of years and 40% of tests in novel geographies. Some original findings were associated with multiple new tests. Reproducibility was the best predictor of generalizability—for the findings that proved directly reproducible, 84% emerged in other available time periods and 57% emerged in other geographies. Overall, only limited empirical evidence emerged for context sensitivity. In a forecasting survey, independent scientists were able to anticipate which effects would find support in tests in new samples.
- ItemExamining the replicability of online experiments selected by a decision market.(Nature Research, 2024-11-19) Holzmeister F; Johannesson M; Camerer CF; Chen Y; Ho T-H; Hoogeveen S; Huber J; Imai N; Imai T; Jin L; Kirchler M; Ly A; Mandl B; Manfredi D; Nave G; Nosek BA; Pfeiffer T; Sarafoglou A; Schwaiger R; Wagenmakers E-J; Waldén V; Dreber AHere we test the feasibility of using decision markets to select studies for replication and provide evidence about the replicability of online experiments. Social scientists (n = 162) traded on the outcome of close replications of 41 systematically selected MTurk social science experiments published in PNAS 2015-2018, knowing that the 12 studies with the lowest and the 12 with the highest final market prices would be selected for replication, along with 2 randomly selected studies. The replication rate, based on the statistical significance indicator, was 83% for the top-12 and 33% for the bottom-12 group. Overall, 54% of the studies were successfully replicated, with replication effect size estimates averaging 45% of the original effect size estimates. The replication rate varied between 54% and 62% for alternative replication indicators. The observed replicability of MTurk experiments is comparable to that of previous systematic replication projects involving laboratory experiments.
- ItemForecasting the publication and citation outcomes of COVID-19 preprints(The Royal Society, 2022-09) Gordon M; Bishop M; Chen Y; Dreber A; Goldfedder B; Holzmeister F; Johannesson M; Liu Y; Tran L; Twardy C; Wang J; Pfeiffer TMany publications on COVID-19 were released on preprint servers such as medRxiv and bioRxiv. It is unknown how reliable these preprints are, and which ones will eventually be published in scientific journals. In this study, we use crowdsourced human forecasts to predict publication outcomes and future citation counts for a sample of 400 preprints with high Altmetric score. Most of these preprints were published within 1 year of upload on a preprint server (70%), with a considerable fraction (45%) appearing in a high-impact journal with a journal impact factor of at least 10. On average, the preprints received 162 citations within the first year. We found that forecasters can predict if preprints will be published after 1 year and if the publishing journal has high impact. Forecasts are also informative with respect to Google Scholar citations within 1 year of upload on a preprint server. For both types of assessment, we found statistically significant positive correlations between forecasts and observed outcomes. While the forecasts can help to provide a preliminary assessment of preprints at a faster pace than traditional peer-review, it remains to be investigated if such an assessment is suited to identify methodological problems in preprints.
- ItemOn the trajectory of discrimination: A meta-analysis and forecasting survey capturing 44 years of field experiments on gender and hiring decisions(Elsevier Inc, 2023-11) Schaerer M; Plessis CD; Nguyen MHB; van Aert RCM; Tiokhin L; Lakens D; Clemente EG; Pfeiffer T; Dreber A; Johannesson M; Clark CJ; Uhlmann EL; Abraham AT; Adamus M; Akinci C; Alberti F; Alsharawy AM; Alzahawi S; Anseel F; Arndt F; Balkan B; Baskin E; Bearden CE; Benotsch EG; Bernritter S; Black SR; Bleidorn W; Boysen AP; Brienza JP; Brown M; Brown SEV; Brown JW; Buckley J; Buttliere B; Byrd N; Cígler H; Capitan T; Cherubini P; Chong SY; Ciftci EE; Conrad CD; Conway P; Costa E; Cox JA; Cox DJ; Cruz F; Dawson IGJ; Demiral EE; Derrick JL; Doshi S; Dunleavy DJ; Durham JD; Elbaek CT; Ellis DA; Ert E; Espinoza MP; Füllbrunn SC; Fath S; Furrer R; Fiala L; Fillon AA; Forsgren M; Fytraki AT; Galarza FB; Gandhi L; Garrison SM; Geraldes D; Ghasemi O; Gjoneska B; Gothilander J; Grühn D; Grieder M; Hafenbrädl S; Halkias G; Hancock R; Hantula DA; Harton HC; Hoffmann CP; Holzmeister F; Hoŕak F; Hosch A-K; Imada H; Ioannidis K; Jaeger B; Janas M; Janik B; Pratap KC R; Keel PK; Keeley JW; Keller L; Kenrick DT; Kiely KM; Knutsson M; Kovacheva A; Kovera MB; Krivoshchekov V; Krumrei-Mancuso EJ; Kulibert D; Lacko D; Lemay EPA preregistered meta-analysis, including 244 effect sizes from 85 field audits and 361,645 individual job applications, tested for gender bias in hiring practices in female-stereotypical and gender-balanced as well as male-stereotypical jobs from 1976 to 2020. A “red team” of independent experts was recruited to increase the rigor and robustness of our meta-analytic approach. A forecasting survey further examined whether laypeople (n = 499 nationally representative adults) and scientists (n = 312) could predict the results. Forecasters correctly anticipated reductions in discrimination against female candidates over time. However, both scientists and laypeople overestimated the continuation of bias against female candidates. Instead, selection bias in favor of male over female candidates was eliminated and, if anything, slightly reversed in sign starting in 2009 for mixed-gender and male-stereotypical jobs in our sample. Forecasters further failed to anticipate that discrimination against male candidates for stereotypically female jobs would remain stable across the decades.
- ItemPredicting replicability—Analysis of survey and prediction market data from large-scale forecasting projects(Public Library of Science (PLoS), 2021-04-14) Gordon M; Viganola D; Dreber A; Johannesson M; Pfeiffer TThe reproducibility of published research has become an important topic in science policy. A number of large-scale replication projects have been conducted to gauge the overall reproducibility in specific academic fields. Here, we present an analysis of data from four studies which sought to forecast the outcomes of replication projects in the social and behavioural sciences, using human experts who participated in prediction markets and answered surveys. Because the number of findings replicated and predicted in each individual study was small, pooling the data offers an opportunity to evaluate hypotheses regarding the performance of prediction markets and surveys at a higher power. In total, peer beliefs were elicited for the replication outcomes of 103 published findings. We find there is information within the scientific community about the replicability of scientific findings, and that both surveys and prediction markets can be used to elicit and aggregate this information. Our results show prediction markets can determine the outcomes of direct replications with 73% accuracy (n = 103). Both the prediction market prices, and the average survey responses are correlated with outcomes (0.581 and 0.564 respectively, both p < .001). We also found a significant relationship between p-values of the original findings and replication outcomes. The dataset is made available through the R package “pooledmaRket” and can be used to further study community beliefs towards replications outcomes as elicited in the surveys and prediction markets.
- ItemPredicting the replicability of social and behavioural science claims in COVID-19 preprints(Springer Nature Limited, 2024-12-20) Marcoci A; Wilkinson DP; Vercammen A; Wintle BC; Abatayo AL; Baskin E; Berkman H; Buchanan EM; Capitán S; Capitán T; Chan G; Cheng KJG; Coupé T; Dryhurst S; Duan J; Edlund JE; Errington TM; Fedor A; Fidler F; Field JG; Fox N; Fraser H; Freeman ALJ; Hanea A; Holzmeister F; Hong S; Huggins R; Huntington-Klein N; Johannesson M; Jones AM; Kapoor H; Kerr J; Kline Struhl M; Kołczyńska M; Liu Y; Loomas Z; Luis B; Méndez E; Miske O; Mody F; Nast C; Nosek BA; Simon Parsons E; Pfeiffer T; Reed WR; Roozenbeek J; Schlyfestone AR; Schneider CR; Soh A; Song Z; Tagat A; Tutor M; Tyner AH; Urbanska K; van der Linden SReplications are important for assessing the reliability of published findings. However, they are costly, and it is infeasible to replicate everything. Accurate, fast, lower-cost alternatives such as eliciting predictions could accelerate assessment for rapid policy implementation in a crisis and help guide a more efficient allocation of scarce replication resources. We elicited judgements from participants on 100 claims from preprints about an emerging area of research (COVID-19 pandemic) using an interactive structured elicitation protocol, and we conducted 29 new high-powered replications. After interacting with their peers, participant groups with lower task expertise ('beginners') updated their estimates and confidence in their judgements significantly more than groups with greater task expertise ('experienced'). For experienced individuals, the average accuracy was 0.57 (95% CI: [0.53, 0.61]) after interaction, and they correctly classified 61% of claims; beginners' average accuracy was 0.58 (95% CI: [0.54, 0.62]), correctly classifying 69% of claims. The difference in accuracy between groups was not statistically significant and their judgements on the full set of claims were correlated (r(98) = 0.48, P < 0.001). These results suggest that both beginners and more-experienced participants using a structured process have some ability to make better-than-chance predictions about the reliability of 'fast science' under conditions of high uncertainty. However, given the importance of such assessments for making evidence-based critical decisions in a crisis, more research is required to understand who the right experts in forecasting replicability are and how their judgements ought to be elicited.
- ItemUsing prediction markets to predict the outcomes in the Defense Advanced Research Projects Agency's next-generation social science programme(The Royal Society, 2021-07) Viganola D; Buckles G; Chen Y; Diego-Rosell P; Johannesson M; Nosek BA; Pfeiffer T; Siegel A; Dreber AThere is evidence that prediction markets are useful tools to aggregate information on researchers' beliefs about scientific results including the outcome of replications. In this study, we use prediction markets to forecast the results of novel experimental designs that test established theories. We set up prediction markets for hypotheses tested in the Defense Advanced Research Projects Agency's (DARPA) Next Generation Social Science (NGS2) programme. Researchers were invited to bet on whether 22 hypotheses would be supported or not. We define support as a test result in the same direction as hypothesized, with a Bayes factor of at least 10 (i.e. a likelihood of the observed data being consistent with the tested hypothesis that is at least 10 times greater compared with the null hypothesis). In addition to betting on this binary outcome, we asked participants to bet on the expected effect size (in Cohen's d) for each hypothesis. Our goal was to recruit at least 50 participants that signed up to participate in these markets. While this was the case, only 39 participants ended up actually trading. Participants also completed a survey on both the binary result and the effect size. We find that neither prediction markets nor surveys performed well in predicting outcomes for NGS2.