Session: Medical Education 10: Simulation and Technology I
158 - Feasibility and Challenges of Large Language Model (LLM)-Generated Neonatal Resuscitation Simulations: A Multicenter Exploratory Study
Monday, April 27, 2026
8:00am - 10:00am ET
Publication Number: 4155.158
Chenguang Xu, The University of Hong Kong - Shenzhen Hospital, Shenzhen, Guangdong, China (People's Republic); Ming Zhou, Shanghai First maternity and infant hospital, shanghai, Shanghai, China (People's Republic); Dianna Wang, University of Alberta Faculty of Medicine and Dentistry, Edmonton, AB, Canada; Qianshen Zhang, The University of Hongkong, Shenzhen Hospital, SHENZHEN, Guangdong, China (People's Republic); Jiang-Qin Liu, Shanghai First Maternity and Infant Hospital, Tongji University School of Medicine, Shanghai, Shanghai, China (People's Republic); Genevieve Po Gee. Fung, The Chinese University of Hong Kong, Hong Kong, N/A, Hong Kong; Mabel Wong, Queen Mary Hospital, Hong Kong, N/A, Hong Kong; Yin Xue, The University of Hong Kong-Shenzhen Hospital, ShenZhen, Guangdong, China (People's Republic); Georg Schmolzer, University of Alberta Faculty of Medicine and Dentistry, Edmonton, AB, Canada; Po-Yin Cheung, University of Alberta Faculty of Medicine and Dentistry, Edmonton, AB, Canada
Professor University of Alberta Faculty of Medicine and Dentistry Edmonton, Alberta, Canada
Background: Simulation-based training (SBT) in neonatal resuscitation has positive impact on educational and neonatal outcomes. However, the implementation of well-designed SBT imposes multifaceted demands on instructors. Large language models (LLMs) might have the potential in dynamically generating contextual resuscitation scenarios. But gaps exist regarding the feasibility and challenges of LLMs-generated simulation scenarios in neonatal resuscitation. Objective: To explore the feasibility and challenges in AI-generated simulation scenarios using ChatGPT and DeepSeek as simulation development tools. Design/Methods: This is a prospective, multicenter pilot study evaluating the feasibility and challenges of LLMs-generated simulation scenarios. Four scenarios including extremely premature infant, placenta abruption, born before arrival and meconium-stained amniotic fluid were generated by ChatGPT-4o and DeepSeek-R1. Four equivalent scenarios were extracted from Neonatal Resuscitation Program® (NRP®) and RETAIN (a serious game platform). Totally these 16 scenarios were written on standardized templates, coded and randomized. Nine independent instructors from 5 centers, who were blinded to the group allocation, evaluated the performance of scenarios using modified Jeffries Simulation Design Scale (JSDS). AI hallucination and qualitative evaluation were also compared among four groups. Results: When compared with NRP® scenarios, ChatGPT demonstrated similar overall evaluation, whereas DeepSeek and RETAIN had lower scores in overall evaluation, problem-solving efficacy and scenario fidelity. DeepSeek exhibited inferior performance in providing appropriate information. In debriefing design, ChatGPT achieved higher scores than NRP® (P=0.02). Quantitative evaluation revealed no statistical difference in AI hallucination between two LLMs. ChatGPT demonstrated strengths in establishing clear objectives and providing structured debriefing frameworks but exhibited deficiencies in the consistent provision of dynamic vital signs. DeepSeek manifested violations in NRP® algorithm.
Conclusion(s): ChatGPT-generated simulation scenarios might be feasible in facilitating SBT when supervised by NRP instructors. NRP® deviations and gaps remain in LLMs-generated scenarios, which necessitate objective evaluation prior to implementation. Further research assessing educational outcomes and feedback from target learners are essential for the appropriate integration of LLMs-generated simulation into SBT.
Figure The overall evaluation of 4 groups. Figure 1.jpegData are presented as median (IQR) scores.
Table 1 Scores of overall evaluation for all and individual scenarios among four groups. Data are presented in median and interquartile range. Adjusted P value: * <0.05; ** <0.01; *** <0.005; **** <0.001 vs. NRP
Table 2 ChatGPT-4o vs. DeepSeek-R1 in AI hallucination and comments.