Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Sensitive Data Identification in Structured Data through GenNER Model based on Text Generation and NER

Authors
Park, Ji sungKim, Gun wooLee, Dong ho
Issue Date
Apr-2020
Publisher
Association for Computing Machinery
Keywords
BiLSTM; CRF; DLP; Korean language; named entity recognition; NLP; Sensitive information; Structured Data; text generation
Citation
ACM International Conference Proceeding Series, pp 36 - 40
Pages
5
Indexed
OTHER
Journal Title
ACM International Conference Proceeding Series
Start Page
36
End Page
40
URI
https://scholarworks.bwise.kr/erica/handle/2021.sw.erica/1815
DOI
10.1145/3398329.3398335
ISSN
0000-0000
Abstract
A Lot of documents in many organizations from companies to governments are shared on on-premise storage or clouds. And some of those documents may contain sensitive information such as names, social security numbers, addresses and so on. Especially a large amount of sensitive information written in Korean have been leaked nowadays. It can be severe problems to not only individuals but also many organizations. Therefore, for information protection, data loss prevention (DLP) has been needed. DLP systems based on pattern matching were popular in the past. But they have a difficulty handling new type of sensitive data whenever they come. To handle this problem, sensitive data identification with NER is proposed as a useful method of DLP system. By using NER, we can classify the words in a document into categories which consist of name, location and so on. These categories are considered as sensitive information. This approach shows good performance identifying information in unstructured data(e.g. sentences) which have contextual information whereas it has a weakness identifying sensitive information in structured data (e.g. personal names in cells of the table). Actually, a large amount of sensitive information is organized in structured data and the form of structured data varies depending on the document. Furthermore, it also has difficulties identifying data written in Korean because of its characteristics. We proposed a primary preventive measure of DLP by identifying sensitive data in tables of Korean documents combining text generation and NER models regardless of the form of tables and masking them as to share documents without disclosing sensitive information. © 2020 ACM.
Files in This Item
Go to Link
Appears in
Collections
COLLEGE OF COMPUTING > DEPARTMENT OF ARTIFICIAL INTELLIGENCE > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Lee, Dong Ho photo

Lee, Dong Ho
ERICA 소프트웨어융합대학 (DEPARTMENT OF ARTIFICIAL INTELLIGENCE)
Read more

Altmetrics

Total Views & Downloads

BROWSE