Cited 0 time in
Bridging the Language Gap: Domain-Specific Dataset Construction for Medical LLMs
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Kim, Chae Yeon | - |
| dc.contributor.author | Kim, Song Yeon | - |
| dc.contributor.author | Cho, Seung Hwan | - |
| dc.contributor.author | Kim, Young-Min | - |
| dc.date.accessioned | 2024-11-28T18:31:26Z | - |
| dc.date.available | 2024-11-28T18:31:26Z | - |
| dc.date.issued | 2024-08 | - |
| dc.identifier.issn | 1865-0929 | - |
| dc.identifier.issn | 1865-0937 | - |
| dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/197962 | - |
| dc.description.abstract | The advent of large language models (LLMs) has transformed the field of natural language processing (NLP), demonstrating impressive capabilities across a variety of tasks such as text generation, translation, and question answering. However, their effectiveness in specialized domains is constrained by the lack of domain-specific data. This paper presents an effective methodology for constructing domain-specific datasets using domain-specific corpora, thus overcoming the challenges posed by linguistic and cultural differences in non-English speaking regions. By leveraging mining techniques, this methodology facilitates the construction of datasets tailored to local languages and cultures. A Korean medical corpus served as the foundation for dataset construction, leading to the development of a medical language model that demonstrated high performance and versatility across various NLP tasks. A bidirectional encoder representation from transformer-based comparative analysis revealed comparable performance. The objective is to streamline LLM applications across diverse domains, thereby enhancing language model efficiency. In the future, our efforts will be directed towards implementing the proposed methodology across diverse domains and investigating strategies for extracting domain-specific tasks and vocabulary to enhance the quality of domain datasets. | - |
| dc.format.extent | 13 | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.publisher | Springer Verlag | - |
| dc.title | Bridging the Language Gap: Domain-Specific Dataset Construction for Medical LLMs | - |
| dc.type | Article | - |
| dc.publisher.location | 독일 | - |
| dc.identifier.doi | 10.1007/978-981-97-6125-8_11 | - |
| dc.identifier.scopusid | 2-s2.0-85200756998 | - |
| dc.identifier.wosid | 001317373400011 | - |
| dc.identifier.bibliographicCitation | Communications in Computer and Information Science, v.2160, pp 134 - 146 | - |
| dc.citation.title | Communications in Computer and Information Science | - |
| dc.citation.volume | 2160 | - |
| dc.citation.startPage | 134 | - |
| dc.citation.endPage | 146 | - |
| dc.type.docType | Proceedings Paper | - |
| dc.description.isOpenAccess | N | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.relation.journalResearchArea | Computer Science | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Artificial Intelligence | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Interdisciplinary Applications | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Theory & Methods | - |
| dc.subject.keywordPlus | Computational linguistics | - |
| dc.subject.keywordPlus | Data mining | - |
| dc.subject.keywordPlus | Large datasets | - |
| dc.subject.keywordAuthor | Large Language Model | - |
| dc.subject.keywordAuthor | Mining | - |
| dc.subject.keywordAuthor | Domain Dataset | - |
| dc.identifier.url | https://link.springer.com/chapter/10.1007/978-981-97-6125-8_11 | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366
COPYRIGHT © 2024 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
