613 - Assessing Large Language Model Performance for Quality Metric Extraction in Neonatal Randomized Controlled Trials
Saturday, April 25, 2026
3:30pm - 5:45pm ET
Publication Number: 2598.613
Smita Ektare, Boston Children's Hospital, Los Gatos, CA, United States; Kim Ruiz-Arellanos, Beth Israel Deaconess Medical Center, Boston, MA, United States; Aytana Gonzalez, Boston Children's Hospital, Miami, FL, United States; Grace E. Fuller, University of North Carolina at Chapel Hill School of Medicine, Chapel Hill, NC, United States; John Zupancic, Harvard Medical School, Boston, MA, United States; Brian King, Harvard Medical School, Boston, MA, United States; Kristyn S. Beam, Beth Israel Deaconess Medical Center, Needham, MA, United States
Neonatologist Beth Israel Deaconess Medical Center Needham, Massachusetts, United States
Background: The quality of neonatal randomized controlled trials (RCTs) varies widely, highlighting the need for systematic reviewers to consistently assess quality metrics of these trials. Manual assessment requires significant time and resources, resulting in processing delays. Large language models (LLMs) may provide a scalable alternative; however, their capacity to apply standardized quality criteria to neonatal RCTs remains understudied. Objective: To evaluate the accuracy of an LLM (ChatGPT 4o) in extracting information and providing responses to 32 predefined quality assessment questions related to neonatal RCTs. RCTs were sampled from NeoCanon, a validated and comprehensive database of neonatal RCTs. Design/Methods: We established a human-reviewed reference dataset by applying 32 quality metrics to 600 RCTs (2018-2022). An 80-RCT subset informed iterative prompt engineering. We developed standardized LLM instructions to parse full-text RCTs, normalize text, and map synonyms. Specific instructions directed the LLM to conduct targeted second-pass scans for key CONSORT elements. The model generated binary responses for quality questions covering design, methods, results, and reporting for each RCT PDF. We compared model outputs to the human reference to assess agreement. Results: In comparing 2299 paired responses from a human reviewer and an LLM extraction, an overall disagreement rate of 24.3% was observed across 32 predefined questions. The level of disagreement between the reviewers varied widely, ranging from 1.3% to 74.7%. High rates of disagreement were found on questions involving subjectivity or ambiguity, such as identifying the location of protocol information and the specific blinding details of the study. Other areas of significant discordance included determining the allocation ratio and understanding the enrollment period. Conversely, questions with high levels of agreement involved objective information, including stating the trial design, presenting clear objectives, and reporting the number of participants analyzed.
Conclusion(s): This study compared LLM and human reviewers for extracting quality data from neonatal RCTs. LLMs agreed with humans on objective metrics but struggled with subjective or nuanced information and data not explicitly reported. LLMs can boost screening efficiency, but human review is likely still necessary for subjective questions to ensure data extraction quality.