Icon Legend

Presentation Icons

Ticketed Event

APS Award Winner

ASPN Award Winner

APA Award Winner

SPR Award Winner

Awarded Part 4 Maintenance of Certification (MOC) Credit

Poster Icons

SPR Award Winner

APA Award Winner

Awarded Part 4 Maintenance of Certification (MOC) Credit

132 Views

Developmental and Behavioral Pediatrics

Session: Developmental and Behavioral Pediatrics 7: Screening

229 - Applications of Generative Artificial Intelligence in Scoring And Interpreting The M-CHAT-R

Monday, April 27, 2026

8:00am - 10:00am ET

Publication Number: 4226.229

Lillian M. Ravikoff, Northwell Health, Bronx, NY, United States; Caroline Howard, Northwell Health, New York, NY, United States; Joseph Mekhail, Northwell Health, New York, NY, United States; Neel Sharma, Northwell Health, New York, NY, United States; Samuel James. Kraus, Northwell Health, New Hyde Park, NY, United States; Jack P. Brenner, Northwell Health, Lake Success, NY, United States; Melanie Malkasyan, Northwell Health, Fair Lawn, NJ, United States; Hannah Likier, Northwell Health, Flemington, NJ, United States; Clara Goldman, Northwell Health, Great Neck, NY, United States; Audrey Ng, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Great Neck, NY, United States

Poster Presenting Author(s)

Lillian M. Ravikoff, BA, BS (she/her/hers)

Research Assistant
Northwell
Bronx, New York, United States

Background: Autism spectrum disorder (ASD) is a highly prevalent developmental disorder for which early diagnosis and intervention can significantly improve outcomes. The Modified Checklist for Autism in Toddlers, Revised (M-CHAT-R) is the leading screening tool for early recognition of ASD. The American Academy of Pediatrics recommends an M-CHAT-R for every 18 and 24 month old. Many primary care providers, however, are not routinely completing the screening, with recent studies showing only 73% adherence. The most commonly cited barrier to developmental screening by pediatricians is time limitations (76%).

Objective: To assess artificial intelligence's (AI) ability to accurately score and interpret the M-CHAT-R and produce high-quality patient file notes more efficiently than human clinicians.

Design/Methods: 150 unique M-CHAT-R forms were completed and hand-scored by the research team to simulate patients in low-risk, moderate-risk, and high-risk categories (n=50 for each category). A pre-study was conducted to identify which AI model (ChatGPT-4, Claude Opus 4, Gemini Flash 2.5) generated the most comprehensive patient file notes; Claude was selected for primary analysis [Figure 1]. A prompt engineering pre-study assessed seven different prompts. Prompt 6 was chosen for its ability to produce a high-quality note and accurately read and score the completed M-CHAT-R. This prompt also alerts users if a form has been improperly completed [Figure 2, Figure 3]. Initial trials were conducted with a mix of blue and black pens. A pooled variance t-test indicated higher scoring accuracy when blue pens were used (p=0.008), leading researchers to re-run the study with only blue ink pens. The primary outcome was scoring accuracy; completion speed of AI compared to human clinicians was also evaluated using a t-test.

Results: 98.2% of M-CHAT-R forms scored by AI were placed into the correct risk category. 62% of M-CHAT-R forms were scored correctly, 12% were added incorrectly (100% within 1 point), and 26% produced an error message. Mean time for AI scoring and note writing was 16.97 seconds compared to 273.11 seconds for human clinician (p < 0.001).

Conclusion(s): AI shows potential for automating M-CHAT-R scoring and patient note generation. Though human intervention is needed when AI produces error output, clinicians can largely rely on AI to correctly identify ASD risk category (high, moderate, low). AI outperformed manual methods in speed of assessment and quality of patient note. Future studies should explore the use of AI with other screening tools to improve rates of screening and documentation.

Figure 1: Sample Notes Generated by Three AI Models

Each model was given the same prompt and M-CHAT-R form. ChatGPT tended to produce simple notes explaining only the risk category of the score calculated. Claude tended to produce notes addressing the score calculated, specific items "failed" on the screener, detailed recommendations relating to these items, and clear and specific next steps. Gemini tended to produce notes listing the items "failed" and total score along with next steps written as having already occurred.

Figure 2: Prompt Engineering Pre-Study

Flowchart of prompts tested including brief description of changes between tests and qualitative observations of output.

Figure 3: All Tested Prompts

Complete list of all prompts tested during prompt engineering process. Prompt 6 was selected.