Evaluation of ChatGPT’s performance on emergency medicine board examination questions

Mustafa Can Guzelce; Sefer Ozgur; Ilker Salli

doi:10.4103/tjem.tjem_262_25

Mustafa Can Guzelce¹, Sefer Ozgur¹, Ilker Salli²

¹Department of Emergency Medicine, Izmir University of Economics Medicalpoint Hospital, Izmir, Türkiye
²Department of Emergency Medicine, Izmir Tepecik Training and Research Hospital, Izmir, Türkiye

Keywords: Artificial intelligence, board examination, ChatGPT, emergency medicine, GPT‑4, large language models, Turkish Board of Emergency Medicine

Abstract

OBJECTIVES: We aimed to evaluate the performance of a large language model(ChatGPT) in answering official sample questions from the Turkish Board of Emergency Medicine (TBEM). Two versions of the model, GPT 4 and GPT 4o, were assessed to explore consistency and accuracy across iterations.

METHODS: A cross sectional observational study was conducted using 25 standardized multiple choice questions publicly released by TBEM. Each question was manually entered into GPT 4 and GPT 4o through the OpenAI interface. Both models were prompted to select the best single answer from the provided options without additional clarification or training context. Model responses were evaluated for accuracy, consistency upon repetition, and domain specific error types. This study is compliant with the STROBE statement and the MedinAI reporting guidelines.

RESULTS: GPT 4 correctly answered 20 out of 25 questions (80%) on the first attempt. On repetition, its score improved to 84%. GPT 4o also achieved a score of 88% (22/25) on its first attempt and showed consistent results upon a second evaluation, providing identical answers in both trials. Errors occurred in the domains of trauma during pregnancy, pediatric resuscitation, and adult resuscitation protocols. Both models demonstrated strong performance in fact based domains and in questions involving image descriptions.

CONCLUSION: GPT 4 and GPT 4o performed above the TBEM passing threshold, showing solid accuracy across a range of emergency medicine topics. Both excelled in fact based and image related questions. However, they showed limitations in clinical reasoning, particularly in scenarios requiring nuanced judgment. These tools may support examination preparation but should not replace the expertise of trained emergency physicians.

How to cite this article: Guzelce MC, Ozgur S, Salli I. Evaluation of ChatGPT’s performance on emergency medicine board examination questions. Turk J Emerg Med 2026;26:110-5.