Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model
- Authors
- 오서현; 강민; 이영호
- Issue Date
- Jan-2022
- Publisher
- 대한의료정보학회
- Keywords
- Artificial Intelligence; Big Data; Medical Informatics; Data Anonymization; Deep Learning
- Citation
- Healthcare Informatics Research, v.28, no.1, pp.16 - 24
- Journal Title
- Healthcare Informatics Research
- Volume
- 28
- Number
- 1
- Start Page
- 16
- End Page
- 24
- URI
- https://scholarworks.bwise.kr/gachon/handle/2020.sw.gachon/83803
- DOI
- 10.4258/hir.2022.28.1.16
- ISSN
- 2093-3681
- Abstract
- Objectives: De-identifying protected health information (PHI) in medical documents is important, and a prerequisite to deidentificationis the identification of PHI entity names in clinical documents. This study aimed to compare the performanceof three pre-training models that have recently attracted significant attention and to determine which model is more suitablefor PHI recognition. Methods: We compared the PHI recognition performance of deep learning models using the i2b2 2014dataset. We used the three pre-training models—namely, bidirectional encoder representations from transformers (BERT),robustly optimized BERT pre-training approach (RoBERTa), and XLNet (model built based on Transformer-XL)—to detectPHI. After the dataset was tokenized, it was processed using an inside-outside-beginning tagging scheme and WordPiecetokenizedto place it into these models. Further, the PHI recognition performance was investigated using BERT, RoBERTa,and XLNet. Results: Comparing the PHI recognition performance of the three models, it was confirmed that XLNet had asuperior F1-score of 96.29%. In addition, when checking PHI entity performance evaluation, RoBERTa and XLNet showeda 30% improvement in performance compared to BERT. Conclusions: Among the pre-training models used in this study,XLNet exhibited superior performance because word embedding was well constructed using the two-stream self-attentionmethod. In addition, compared to BERT, RoBERTa and XLNet showed superior performance, indicating that they were moreeffective in grasping the context.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - IT융합대학 > 컴퓨터공학과 > 1. Journal Articles
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.