Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model

오서현; 강민; 이영호

Detailed Information

Cited 4 time in webofscience

Cited 4 time in scopus

Metadata Downloads

Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model

Authors: 오서현; 강민; 이영호

Issue Date: Jan-2022

Publisher: 대한의료정보학회

Keywords: Artificial Intelligence; Big Data; Medical Informatics; Data Anonymization; Deep Learning

Citation: Healthcare Informatics Research, v.28, no.1, pp.16 - 24

Journal Title: Healthcare Informatics Research

Volume: 28

Number: 1

Start Page: 16

End Page: 24

URI: https://scholarworks.bwise.kr/gachon/handle/2020.sw.gachon/83803

DOI: 10.4258/hir.2022.28.1.16

ISSN: 2093-3681

Abstract: Objectives: De-identifying protected health information (PHI) in medical documents is important, and a prerequisite to deidentificationis the identification of PHI entity names in clinical documents. This study aimed to compare the performanceof three pre-training models that have recently attracted significant attention and to determine which model is more suitablefor PHI recognition. Methods: We compared the PHI recognition performance of deep learning models using the i2b2 2014dataset. We used the three pre-training models—namely, bidirectional encoder representations from transformers (BERT),robustly optimized BERT pre-training approach (RoBERTa), and XLNet (model built based on Transformer-XL)—to detectPHI. After the dataset was tokenized, it was processed using an inside-outside-beginning tagging scheme and WordPiecetokenizedto place it into these models. Further, the PHI recognition performance was investigated using BERT, RoBERTa,and XLNet. Results: Comparing the PHI recognition performance of the three models, it was confirmed that XLNet had asuperior F1-score of 96.29%. In addition, when checking PHI entity performance evaluation, RoBERTa and XLNet showeda 30% improvement in performance compared to BERT. Conclusions: Among the pre-training models used in this study,XLNet exhibited superior performance because word embedding was well constructed using the two-stream self-attentionmethod. In addition, compared to BERT, RoBERTa and XLNet showed superior performance, indicating that they were moreeffective in grasping the context.

Files in This Item: There are no files associated with this item.

Appears in Collections: IT융합대학 > 컴퓨터공학과 > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Lee, Young Ho photo

Lee, Young Ho: College of IT Convergence (컴퓨터공학부(컴퓨터공학전공))

Read more

Altmetrics

Total Views & Downloads

STATISTICS: Total View :4,228,045; Today View :1

RSS_1.0 RSS_2.0 ATOM_1.0

1342, Seongnam-daero, Sujeong-gu, Seongnam-si, Gyeonggi-do, Republic of Korea(13120)031-750-5114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE