Sensitive Data Identification in Structured Data through GenNER Model based on Text Generation and NER
- Authors
- Park, Ji sung; Kim, Gun woo; Lee, Dong ho
- Issue Date
- Apr-2020
- Publisher
- Association for Computing Machinery
- Keywords
- BiLSTM; CRF; DLP; Korean language; named entity recognition; NLP; Sensitive information; Structured Data; text generation
- Citation
- ACM International Conference Proceeding Series, pp 36 - 40
- Pages
- 5
- Indexed
- OTHER
- Journal Title
- ACM International Conference Proceeding Series
- Start Page
- 36
- End Page
- 40
- URI
- https://scholarworks.bwise.kr/erica/handle/2021.sw.erica/1815
- DOI
- 10.1145/3398329.3398335
- ISSN
- 0000-0000
- Abstract
- A Lot of documents in many organizations from companies to governments are shared on on-premise storage or clouds. And some of those documents may contain sensitive information such as names, social security numbers, addresses and so on. Especially a large amount of sensitive information written in Korean have been leaked nowadays. It can be severe problems to not only individuals but also many organizations. Therefore, for information protection, data loss prevention (DLP) has been needed. DLP systems based on pattern matching were popular in the past. But they have a difficulty handling new type of sensitive data whenever they come. To handle this problem, sensitive data identification with NER is proposed as a useful method of DLP system. By using NER, we can classify the words in a document into categories which consist of name, location and so on. These categories are considered as sensitive information. This approach shows good performance identifying information in unstructured data(e.g. sentences) which have contextual information whereas it has a weakness identifying sensitive information in structured data (e.g. personal names in cells of the table). Actually, a large amount of sensitive information is organized in structured data and the form of structured data varies depending on the document. Furthermore, it also has difficulties identifying data written in Korean because of its characteristics. We proposed a primary preventive measure of DLP by identifying sensitive data in tables of Korean documents combining text generation and NER models regardless of the form of tables and masking them as to share documents without disclosing sensitive information. © 2020 ACM.
- Files in This Item
-
Go to Link
- Appears in
Collections - COLLEGE OF COMPUTING > DEPARTMENT OF ARTIFICIAL INTELLIGENCE > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.