Performance of large language models in non-English medical ethics-related multiple choice questions: comparison of ChatGPT performance across versions and languages

Kim, Yoongu; Shin, Soan; Yoo, Sang-Ho

doi:10.1186/s12910-025-01316-z

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Performance of large language models in non-English medical ethics-related multiple choice questions: comparison of ChatGPT performance across versions and languages

Full metadata record

DC Field	Value	Language
dc.contributor.author	Kim, Yoongu	-
dc.contributor.author	Shin, Soan	-
dc.contributor.author	Yoo, Sang-Ho	-
dc.date.accessioned	2025-12-19T05:30:41Z	-
dc.date.available	2025-12-19T05:30:41Z	-
dc.date.issued	2025-12	-
dc.identifier.issn	1472-6939	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209933	-
dc.description.abstract	BackgroundAs large language models (LLMs) evolve, assessing their competence in ethically sensitive domains such as medical ethics has become increasingly important. Since medical ethics is a universal component of medical education, disparities in AI performance across languages may result in unequal benefits for learners. Therefore, it is essential to examine performances in non-English contexts. While previous studies have evaluated performance of Chat Generative Pre-trained Transformer(ChatGPT) on English-language multiple-choice questions (MCQs) in medical ethics, none have examined version-based improvements across non-English contexts. This study therefore evaluated ChatGPT versions 3.5, 4.0, and 4.5 for MCQs on Korean medical ethics and their English translations, with a focus on performance trends across versions and languages.MethodsWe selected 36 MCQs from the Korean National Medical Licensing Examination and the Comprehensive Clinical Medicine Evaluation databases. Each question was entered ten times per ChatGPT version (3.5, 4.0, 4.5) and language (Korean, English) for a total of 60 trials. Additionally, to assess the model's capacity to identify the ethical core without relying on the options provided, 31 of the 36 questions were modified by masking the correct choice. Accuracy was analyzed using independent sample t-tests and Mann Whitney U test, and consistency was assessed using Krippendorff's alpha.ResultsOverall, the accuracy and consistency of ChatGPT improved with each version. Version 4.5 achieved near-perfect scores and high reliability in both languages, while version 3.5 showed limited performance, particularly in the Korean test. Performance gaps between languages decreased with model upgrades but remained statistically significant in version 4.5 for some questions. In the masked-answer condition, all versions showed notable drops in accuracy and consistency, with version 4.5 still outperforming earlier versions. However, the performance remained below 50%, indicating limitations in the model's autonomous ethical reasoning.ConclusionsChatGPT demonstrated substantial improvements in medical ethics MCQ performance across versions, particularly in terms of consistency and accuracy. However, performance disparities between languages and reduced accuracy under masked answer conditions highlight the ongoing limitations of non-English ethical reasoning and context recognition. These findings emphasize the need for further research on language-sensitive fine-tuning and the evaluation of LLMs in specialized ethical domains. The findings suggest that advanced LLMs may serve as valuable supplementary tools in medical education and clinical ethics training. At the same time, the observed language disparities call for context-sensitive adaptations to prevent inequities in practice.	-
dc.format.extent	14	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	BioMed Central	-
dc.title	Performance of large language models in non-English medical ethics-related multiple choice questions: comparison of ChatGPT performance across versions and languages	-
dc.type	Article	-
dc.publisher.location	영국	-
dc.identifier.doi	10.1186/s12910-025-01316-z	-
dc.identifier.scopusid	2-s2.0-105024319341	-
dc.identifier.wosid	001634304000001	-
dc.identifier.bibliographicCitation	BMC Medical Ethics, v.26, no.1, pp 1 - 14	-
dc.citation.title	BMC Medical Ethics	-
dc.citation.volume	26	-
dc.citation.number	1	-
dc.citation.startPage	1	-
dc.citation.endPage	14	-
dc.description.isOpenAccess	Y	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	ssci	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Social Sciences - Other Topics	-
dc.relation.journalResearchArea	Medical Ethics	-
dc.relation.journalResearchArea	Biomedical Social Sciences	-
dc.relation.journalWebOfScienceCategory	Ethics	-
dc.relation.journalWebOfScienceCategory	Medical Ethics	-
dc.relation.journalWebOfScienceCategory	Social Sciences, Biomedical	-
dc.subject.keywordPlus	RELIABILITY	-
dc.subject.keywordAuthor	Artificial intelligence	-
dc.subject.keywordAuthor	Medical ethics	-
dc.subject.keywordAuthor	Medical education	-
dc.subject.keywordAuthor	Multiple-choice questions	-
dc.subject.keywordAuthor	Large language models	-
dc.subject.keywordAuthor	ChatGPT	-
dc.identifier.url	https://link.springer.com/article/10.1186/s12910-025-01316-z	-

Files in This Item: Go to Link

Appears in Collections: 서울 의과대학 > 서울 의료인문학교실 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Yoo, Sang-Ho photo

Yoo, Sang-Ho: 서울 의과대학 (DEPARTMENT OF MEDICAL HUMANITIES AND ETHICS)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE