A Deep-Learned Embedding Technique for Categorical Features Encoding

Dahouda, Mwamba Kasongo; Joe, Inwhee

doi:10.1109/ACCESS.2021.3104357

Detailed Information

Cited 4 time in webofscience

Cited 7 time in scopus

Metadata Downloads

A Deep-Learned Embedding Technique for Categorical Features Encoding

Full metadata record

DC Field	Value	Language
dc.contributor.author	Dahouda, Mwamba Kasongo	-
dc.contributor.author	Joe, Inwhee	-
dc.date.accessioned	2022-07-06T16:01:47Z	-
dc.date.available	2022-07-06T16:01:47Z	-
dc.date.created	2021-11-22	-
dc.date.issued	2021-08	-
dc.identifier.issn	2169-3536	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/141386	-
dc.description.abstract	Many machine learning algorithms and almost all deep learning architectures are incapable of processing plain texts in their raw form. This means that their input to the algorithms must be numerical in order to solve classification or regression problems. Hence, it is necessary to encode these categorical variables into numerical values using encoding techniques. Categorical features are common and often of high cardinality. One-hot encoding in such circumstances leads to very high dimensional vector representations, raising memory and computability concerns for machine learning models. This paper proposes a deep-learned embedding technique for categorical features encoding on categorical datasets. Our technique is a distributed representation for categorical features where each category is mapped to a distinct vector, and the properties of the vector are learned while training a neural network. First, we create a data vocabulary that includes only categorical data, and then we use word tokenization to make each categorical data a single word. After that, feature learning is introduced to map all of the categorical data from the vocabulary to word vectors. Three different datasets provided by the University of California Irvine (UCI) are used for training. The experimental results show that the proposed deep-learned embedding technique for categorical data provides a higher F1 score of 89% than 71% of one-hot encoding, in the case of the Long short-term memory (LSTM) model. Moreover, the deep-learned embedding technique uses less memory and generates fewer features than one-hot encoding.	-
dc.language	영어	-
dc.language.iso	en	-
dc.publisher	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC	-
dc.title	A Deep-Learned Embedding Technique for Categorical Features Encoding	-
dc.type	Article	-
dc.contributor.affiliatedAuthor	Joe, Inwhee	-
dc.identifier.doi	10.1109/ACCESS.2021.3104357	-
dc.identifier.scopusid	2-s2.0-85113332296	-
dc.identifier.wosid	000686754800001	-
dc.identifier.bibliographicCitation	IEEE ACCESS, v.9, pp.114381 - 114391	-
dc.relation.isPartOf	IEEE ACCESS	-
dc.citation.title	IEEE ACCESS	-
dc.citation.volume	9	-
dc.citation.startPage	114381	-
dc.citation.endPage	114391	-
dc.type.rims	ART	-
dc.type.docType	Article	-
dc.description.journalClass	1	-
dc.description.isOpenAccess	Y	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalResearchArea	Engineering	-
dc.relation.journalResearchArea	Telecommunications	-
dc.relation.journalWebOfScienceCategory	Computer Science, Information Systems	-
dc.relation.journalWebOfScienceCategory	Engineering, Electrical & Electronic	-
dc.relation.journalWebOfScienceCategory	Telecommunications	-
dc.subject.keywordPlus	Clustering algorithms	-
dc.subject.keywordPlus	Deep learning	-
dc.subject.keywordPlus	Embeddings	-
dc.subject.keywordPlus	Encoding (symbols)	-
dc.subject.keywordPlus	Learning systems	-
dc.subject.keywordPlus	Long short-term memory	-
dc.subject.keywordPlus	Signal encoding	-
dc.subject.keywordPlus	State assignment	-
dc.subject.keywordPlus	Vectors	-
dc.subject.keywordPlus	Categorical datasets	-
dc.subject.keywordPlus	Categorical features	-
dc.subject.keywordPlus	Categorical variables	-
dc.subject.keywordPlus	Distributed representation	-
dc.subject.keywordPlus	Embedding technique	-
dc.subject.keywordPlus	Learning architectures	-
dc.subject.keywordPlus	Machine learning models	-
dc.subject.keywordPlus	University of California	-
dc.subject.keywordPlus	Learning algorithms	-
dc.subject.keywordAuthor	Encoding	-
dc.subject.keywordAuthor	Numerical models	-
dc.subject.keywordAuthor	Machine learning	-
dc.subject.keywordAuthor	Data models	-
dc.subject.keywordAuthor	Training	-
dc.subject.keywordAuthor	Biological neural networks	-
dc.subject.keywordAuthor	Computational modeling	-
dc.subject.keywordAuthor	Data preprocessing	-
dc.subject.keywordAuthor	categorical variables	-
dc.subject.keywordAuthor	natural language processing	-
dc.subject.keywordAuthor	machine learning	-
dc.identifier.url	https://ieeexplore.ieee.org/document/9512057	-

Files in This Item

A_Deep-Learned_Embedding_Technique_for_Categorical_Features_Encoding.pdf 6.79 MB

Appears in Collections: 서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Joe, Inwhee photo

Joe, Inwhee: COLLEGE OF ENGINEERING (SCHOOL OF COMPUTER SCIENCE)

Read more

Altmetrics

Total Views & Downloads

STATISTICS: Total View :6,010,140; Today View :36,944

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1365

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE