623 - Human vs AI: Evaluating Agreement in Primary Outcome Classification for Neonatal RCTs

Saturday, April 25, 2026

3:30pm - 5:45pm ET

Publication Number: 2608.623

Aytana Gonzalez, Boston Children's Hospital, Miami, FL, United States; Grace E. Fuller, University of North Carolina at Chapel Hill School of Medicine, Chapel Hill, NC, United States; Smita Ektare, Boston Children's Hospital, Los Gatos, CA, United States; Kim Ruiz-Arellanos, Beth Israel Deaconess Medical Center, Boston, MA, United States; John Zupancic, Harvard Medical School, Boston, MA, United States; Brian King, Harvard Medical School, Boston, MA, United States; Kristyn S. Beam, Beth Israel Deaconess Medical Center, Needham, MA, United States

Poster Presenting Author(s)

KB

Kristyn S. Beam, MD MPH (she/her/hers)

Neonatologist
Beth Israel Deaconess Medical Center
Needham, Massachusetts, United States

Background: Standardized identification of primary outcomes in randomized controlled trials (RCTs) is critical for accurate synthesis and analysis of results across studies. While artificial intelligence (AI) offers potential to automate this process, its ability to replicate nuanced human judgment remains uncertain, particularly when outcome labeling is ambiguous or inconsistently reported.

Objective: To evaluate agreement between human and AI extraction of primary outcome classifications in neonatal RCTs and to identify sources of discordance.

Design/Methods: We conducted a comparative analysis using a subsample of neonatal RCTs (2018-2022) from the NeoCanon database. Each study’s primary outcome was independently classified by a trained human reviewer and by an AI large-language model according to a standardized three-tier scheme: explicit, implicit, or unclear. Agreement was assessed qualitatively based on label terminology (e.g. “primary outcome,” “main endpoint”) and section location (methods, results, objectives). Disagreement was reviewed to characterize systematic trends in over- and under-classification.

Results: Agreement in primary outcome extraction from 33 RCTs was 67%, with 33% disagreement. Extraction was most consistent when the outcome was clearly labeled (e.g., "primary outcome") and listed in the methods section (39% agreement). Disagreement increased when the label was absent or ambiguous (e.g., "main parameters") or when the outcome was in the objectives or introduction.

Conclusion(s): Our comparative analysis of primary outcome extraction by humans versus AI revealed that contextual factors significantly influence the agreement between the two. Specifically, AI's performance is enhanced when the manuscript explicitly uses the term "primary outcome." Overall, the AI demonstrated a tendency toward conservatism, often classifying primary outcomes as implicit when they were framed as objectives or goals. To boost the efficiency of primary outcome extraction, AI could serve as an initial filter, with human reviewers subsequently focusing their efforts on the implicitly labeled classifications in future work.