Technology
Session: Technology 1: AI in Pediatrics
Caroline Howard, BS (she/her/hers)
Clinical Researcher
Northwell Health
New York, New York, United States
(a) Bars show the breakdown by tool (Claude, ChatGPT, Gemini) for the proportion of hallucinated references (p <.001). Pairwise comparisons indicated that Claude hallucinated significantly less than both ChatGPT (p <.001) and Gemini (p <.001), and that ChatGPT hallucinated significantly less than Gemini (p <.001).
Peer-review status varied by reference source (3 LLMs and human; p=.002): Claude 100% peer reviewed, ChatGPT-4o 98%, human (original authors) 94%, and Gemini 90%. Pairwise comparisons indicated that Claude provided peer-reviewed references significantly more than Gemini (p=.012) and human-obtained (p=.025) references.
Publication recency varied by reference source (3 LLMs and human; p=.005). Tukey's HSD indicated that human-obtained references were significantly more recent (mean = 2764.629 days) than Gemini-generated references (mean = 4253.327) (p=.003).