Bridging the Language Gap: Domain-Specific Dataset Construction for Medical LLMs
- Authors
- Kim, Chae Yeon; Kim, Song Yeon; Cho, Seung Hwan; Kim, Young-Min
- Issue Date
- Aug-2024
- Publisher
- Springer Verlag
- Keywords
- Large Language Model; Mining; Domain Dataset
- Citation
- Communications in Computer and Information Science, v.2160, pp 134 - 146
- Pages
- 13
- Indexed
- SCOPUS
- Journal Title
- Communications in Computer and Information Science
- Volume
- 2160
- Start Page
- 134
- End Page
- 146
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/197962
- DOI
- 10.1007/978-981-97-6125-8_11
- ISSN
- 1865-0929
1865-0937
- Abstract
- The advent of large language models (LLMs) has transformed the field of natural language processing (NLP), demonstrating impressive capabilities across a variety of tasks such as text generation, translation, and question answering. However, their effectiveness in specialized domains is constrained by the lack of domain-specific data. This paper presents an effective methodology for constructing domain-specific datasets using domain-specific corpora, thus overcoming the challenges posed by linguistic and cultural differences in non-English speaking regions. By leveraging mining techniques, this methodology facilitates the construction of datasets tailored to local languages and cultures. A Korean medical corpus served as the foundation for dataset construction, leading to the development of a medical language model that demonstrated high performance and versatility across various NLP tasks. A bidirectional encoder representation from transformer-based comparative analysis revealed comparable performance. The objective is to streamline LLM applications across diverse domains, thereby enhancing language model efficiency. In the future, our efforts will be directed towards implementing the proposed methodology across diverse domains and investigating strategies for extracting domain-specific tasks and vocabulary to enhance the quality of domain datasets.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 산업융합학부 > 서울 산업융합학부 > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.