Cross-Sectional Descriptive Study of Comparative Accuracy of ChatGPT, Google Gemini, And Microsoft Copilot in Solving NEET PG Medical Entrance Test

Authors

DOI:

https://doi.org/10.5195/ijms.2025.3989

Keywords:

Artificial Intelligence, NEET PG, Multiple choice questions, Graduate medical education, Machine Learning

Abstract

Background: Artificial Intelligence (AI) is increasingly applied in healthcare and medical education, with tools capable of assisting in diagnosis, treatment planning, and exam preparation. The NEET-PG is India’s national entrance examination for postgraduate medical training, with case vignettes forming a major component of assessment. AI chatbots therefore hold potential as aids in exam preparation. Previous studies have reported variable accuracy of AI tools in medical licensing exams, but head-to-head comparisons across question types, subjects, and platforms are scarce. Given their rapidly growing use by students and educators, establishing the reliability of these tools is critical. This study directly compares three leading AI chatbots.

 

The objective was to assess and compare the accuracy of ChatGPT-4, Google Gemini, and Microsoft Copilot in solving the NEET-PG 2023 examination and evaluate their performance across different question types and medical subjects.

 

Methods: This cross-sectional descriptive study evaluated the performance of three AI chatbots using a validated set of 200 NEET-PG 2023 questions sourced from PrepLadder and verified against standard textbooks. These questions were presented verbatim to ChatGPT-4, Google Gemini, and Microsoft Copilot. Each chatbot received the questions independently in separate sessions to minimize memory bias. Responses were recorded  correct or incorrect using the validated answer key, and accuracy was expressed  the percentage of correct responses. Comparative analysis was performed for overall accuracy, subject distribution, and question type (recall, analytical, image-based, case-based). Differences were assessed using the chi-square test with p < 0.05 considered statistically significant.

 

Results: Microsoft Copilot achieved the highest overall accuracy with 165/200 correct responses (82.5%), followed by ChatGPT-4 with 161/200 (80.5%) and Google Gemini with 155/200 (77.5%). The difference in overall performance was not statistically significant (χ² = 1.6, p = 0.4). All three chatbots achieved 100% accuracy in Microbiology, Anesthesia, and Psychiatry, whereas lower accuracy occurred in Community Medicine, Forensic Medicine, Internal Medicine, and Radiology. No significant variation was found across subjects (χ² = 2.7, p = 0.9). By question type, recall-based items showed the highest accuracy (85.5%), followed by case-based (82.4%) and analytical (77.3%), while image-based questions were the most challenging (mean accuracy 71.0%). Although Copilot performed slightly better on recall and image-based items, the differences across the three chatbots for question type were not statistically significant (χ² = 0.35, p = 0.9). These findings highlight variability by subject and question format but no significant difference among the three tools.

 

Conclusion: All three AI chatbots demonstrated good accuracy in solving NEET-PG questions, performing better in recall-based subjects and less well with image-based items, reflecting current limitations in multimodal applications. They can complement exam preparation by serving as an accessible and interactive platform, offering an affordable alternative to expensive coaching. In healthcare, AI chatbots hold potential for assisting with diagnosis, treatment planning, triage, and referral, particularly in resource-limited settings. However, concerns regarding data privacy, patient confidentiality, lack of empathy, erosion of clinical decision-making limit their broader adoption. Future research should evaluate evolving versions of these models, larger exam datasets, and integration into structured educational frameworks.

 

 

Author Biographies

Kale Sachin Sadanand, MGM Institute of Health Science / MGM Medical College and Hospital, Chhatrapati Sambhajinagar (Aurangabad), Maharashtra, India- 431001.

MD, Professor Pathology

Patil Anuradha Vishwanath, MGM Institute of Health Science / MGM Medical College and Hospital, Chhatrapati Sambhajinagar (Aurangabad), Maharashtra, India- 431001.

MD, Associate Professor Pathology

Jali Nandita Vivekanand, MGM Institute of Health Science / MGM Medical College and Hospital, Chhatrapati Sambhajinagar (Aurangabad), Maharashtra, India- 431001.

5th year( Intern), MBBS

Downloads

Published

2025-12-31

How to Cite

Bhise, M., Kale, S., Patil, A., & Jali, N. (2025). Cross-Sectional Descriptive Study of Comparative Accuracy of ChatGPT, Google Gemini, And Microsoft Copilot in Solving NEET PG Medical Entrance Test. International Journal of Medical Students, 13, S226. https://doi.org/10.5195/ijms.2025.3989