304 - AI Hallucinated My Bibliography: Examining the Accuracy and Credibility of References Generated by Large Language Models

Friday, April 24, 2026

5:30pm - 8:00pm ET

Publication Number: 1290.304

Caroline Howard, Northwell Health, New York, NY, United States; Joseph Mekhail, Northwell Health, New York, NY, United States; Lillian M. Ravikoff, Northwell Health, Bronx, NY, United States; Samuel James. Kraus, Northwell Health, New Hyde Park, NY, United States; Neel Sharma, Northwell Health, New York, NY, United States; Jack P. Brenner, Northwell Health, Lake Success, NY, United States; Melanie Malkasyan, Northwell Health, Fair Lawn, NJ, United States; Hannah Likier, Northwell Health, Flemington, NJ, United States; Saia Kalash, Northwell Health, Lake Success, NY, United States; Ruth Milanaik, Northwell Health, Great Neck, NY, United States

Poster Presenting Author(s)

Caroline Howard, BS (she/her/hers)

Clinical Researcher
Northwell Health
New York, New York, United States

Background: Since their emergence in 2022, large language models (LLMs) have been used to refine text, write code, and organize literature. By 2024, 76% of academic researchers reported using these AI tools in their work. With this uptake, many have turned to LLMs for identifying and formatting references. Several published works, however, have featured "hallucinated" references, citations that appear legitimate but correspond to nonexistent articles.

Objective: To assess the existence, format, credibility, and relevance of references generated by advanced LLMs.

Design/Methods: 30 original investigations (10 each from Jan 2019 issues of JAMA Pediatrics, Pediatrics, and Frontiers in Pediatrics) were randomly sampled. 3 LLMs (ChatGPT-4o, Claude Sonnet 4, Gemini 1.5 Pro) were provided with each article's methods and results and the journal's reference-style guidelines. LLMs were prompted to generate 5 references per article, restricted to publications before Jun 1, 2018. As a control, 5 references were randomly sampled from each original article's bibliography. Outcomes were reference existence (real vs hallucinated), formatting correctness, peer-review status, overlap with original article's bibliography, and publication recency (days since publication).

Results: Chi-square found an association between LLM type and reference existence (p <.001), with Claude generating fewer hallucinated references (4%) than ChatGPT (18%) and Gemini (36%). Formatting correctness differed by LLM (p <.001). Gemini produced correctly-formatted references (82%) more than Claude (55%; p<.001) and ChatGPT (40%; p<.001) (Figure 1). Peer-review status varied by reference source (3 LLMs, human; p=.002). Claude generated peer-reviewed references (100%) more than Gemini (90%; p=.012) and human-obtained (94%; p=.025) references (Figure 2). Reference overlap with the original article's bibliography varied by LLM (p=.034); Claude matched more often (32%) than Gemini (17%; p=.045). A one-way ANOVA indicated that humans use more recent references than Gemini (p=.005) (Figure 3).

Conclusion(s): Claude produced the fewest hallucinated references and had the highest proportion of peer-reviewed references, suggesting Claude is currently best-equipped to reliably assist scholars in locating high-quality literature. Despite showing promise for reference generation, all 3 LLMs produced some hallucinated references, highlighting that even advanced models can generate references that appear legitimate but do not correspond to real sources. As such, scholars should be mindful about incorporating LLMs into their reference search without significant human oversight.

Figure 1. Reference quality across LLM tools (Claude, ChatGPT, Gemini).

(a) Bars show the breakdown by tool (Claude, ChatGPT, Gemini) for the proportion of hallucinated references (p <.001). Pairwise comparisons indicated that Claude hallucinated significantly less than both ChatGPT (p <.001) and Gemini (p <.001), and that ChatGPT hallucinated significantly less than Gemini (p <.001).
(b) Bars show the proportion of references correctly formatted by each tool (Claude, ChatGPT, Gemini) (p <.001). Pairwise comparisons indicated that Gemini generated correctly-formatted references significantly more than Claude (p <.001) and ChatGPT (p <.001).

Figure 2. Reference Peer-Review Status by Reference Source

Peer-review status varied by reference source (3 LLMs and human; p=.002): Claude 100% peer reviewed, ChatGPT-4o 98%, human (original authors) 94%, and Gemini 90%. Pairwise comparisons indicated that Claude provided peer-reviewed references significantly more than Gemini (p=.012) and human-obtained (p=.025) references.

Figure 3. Recency of Generated Reference (days between publication and Jan.1, 2019) by Reference Source

Publication recency varied by reference source (3 LLMs and human; p=.005). Tukey's HSD indicated that human-obtained references were significantly more recent (mean = 2764.629 days) than Gemini-generated references (mean = 4253.327) (p=.003).