Sensitive Data Identification in Structured Data through GenNER Model based on Text Generation and NER

Park, Ji sung; Kim, Gun woo; Lee, Dong ho

doi:10.1145/3398329.3398335

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Sensitive Data Identification in Structured Data through GenNER Model based on Text Generation and NER

Authors: Park, Ji sung; Kim, Gun woo; Lee, Dong ho

Issue Date: Apr-2020

Publisher: Association for Computing Machinery

Keywords: BiLSTM; CRF; DLP; Korean language; named entity recognition; NLP; Sensitive information; Structured Data; text generation

Citation: ACM International Conference Proceeding Series, pp 36 - 40

Pages: 5

Indexed: OTHER

Journal Title: ACM International Conference Proceeding Series

Start Page: 36

End Page: 40

URI: https://scholarworks.bwise.kr/erica/handle/2021.sw.erica/1815

DOI: 10.1145/3398329.3398335

ISSN: 0000-0000

Abstract: A Lot of documents in many organizations from companies to governments are shared on on-premise storage or clouds. And some of those documents may contain sensitive information such as names, social security numbers, addresses and so on. Especially a large amount of sensitive information written in Korean have been leaked nowadays. It can be severe problems to not only individuals but also many organizations. Therefore, for information protection, data loss prevention (DLP) has been needed. DLP systems based on pattern matching were popular in the past. But they have a difficulty handling new type of sensitive data whenever they come. To handle this problem, sensitive data identification with NER is proposed as a useful method of DLP system. By using NER, we can classify the words in a document into categories which consist of name, location and so on. These categories are considered as sensitive information. This approach shows good performance identifying information in unstructured data(e.g. sentences) which have contextual information whereas it has a weakness identifying sensitive information in structured data (e.g. personal names in cells of the table). Actually, a large amount of sensitive information is organized in structured data and the form of structured data varies depending on the document. Furthermore, it also has difficulties identifying data written in Korean because of its characteristics. We proposed a primary preventive measure of DLP by identifying sensitive data in tables of Korean documents combining text generation and NER models regardless of the form of tables and masking them as to share documents without disclosing sensitive information. © 2020 ACM.

Files in This Item: Go to Link

Appears in Collections: COLLEGE OF COMPUTING > DEPARTMENT OF ARTIFICIAL INTELLIGENCE > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Lee, Dong Ho photo

Lee, Dong Ho: ERICA 소프트웨어융합대학 (DEPARTMENT OF ARTIFICIAL INTELLIGENCE)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

55 Hanyangdeahak-ro, Sangnok-gu, Ansan, Gyeonggi-do, 15588, Korea+82-31-400-4269 sweetbrain@hanyang.ac.kr

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE