Improving FastText with inverse document frequency of subwords
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Choi J. | - |
dc.contributor.author | Lee S.-W. | - |
dc.date.available | 2020-04-06T07:38:11Z | - |
dc.date.created | 2020-04-02 | - |
dc.date.issued | 2020-05 | - |
dc.identifier.issn | 0167-8655 | - |
dc.identifier.uri | https://scholarworks.bwise.kr/gachon/handle/2020.sw.gachon/26421 | - |
dc.description.abstract | Word embedding is important in natural language processing, and word2vec is known as a representative algorithm. However, word2vec and many other dictionary-based word embedding algorithms create word vectors only for words that appear in the training data, ignoring morphological features of these words. The FastText algorithm was previously proposed to solve this problem: it creates a word vector from subword vectors, making it possible to create word embeddings even for words never seen during the training. Because of morphological features, FastText is strong in syntactic tasks but weak in semantic tasks, compared with word2vec. In this paper, we propose a method of improving FastText by using the inverse document frequency of subwords. Our approach is intended to overcome the weakness of FastText in semantic tasks. According to our experiments, the proposed method shows improved results in semantic tests with a little loss in syntactic tests. Our method can be applied to any word embedding algorithm that uses subwords. We additionally tested probabilistic FastText, an algorithm designed to distinguish multiple-meaning words, by adding the inverse document frequency, and the results confirmed an improved performance. © 2020 Elsevier B.V. | - |
dc.language | 영어 | - |
dc.language.iso | en | - |
dc.publisher | Elsevier B.V. | - |
dc.relation.isPartOf | Pattern Recognition Letters | - |
dc.title | Improving FastText with inverse document frequency of subwords | - |
dc.type | Article | - |
dc.type.rims | ART | - |
dc.description.journalClass | 1 | - |
dc.identifier.wosid | 000537129300023 | - |
dc.identifier.doi | 10.1016/j.patrec.2020.03.003 | - |
dc.identifier.bibliographicCitation | Pattern Recognition Letters, v.133, pp.165 - 172 | - |
dc.description.isOpenAccess | N | - |
dc.identifier.scopusid | 2-s2.0-85081114333 | - |
dc.citation.endPage | 172 | - |
dc.citation.startPage | 165 | - |
dc.citation.title | Pattern Recognition Letters | - |
dc.citation.volume | 133 | - |
dc.contributor.affiliatedAuthor | Choi J. | - |
dc.contributor.affiliatedAuthor | Lee S.-W. | - |
dc.type.docType | Article | - |
dc.subject.keywordAuthor | FastText | - |
dc.subject.keywordAuthor | Inverse document frequency | - |
dc.subject.keywordAuthor | Word embedding | - |
dc.subject.keywordAuthor | Word2vec | - |
dc.subject.keywordPlus | Embeddings | - |
dc.subject.keywordPlus | Natural language processing systems | - |
dc.subject.keywordPlus | Semantics | - |
dc.subject.keywordPlus | Syntactics | - |
dc.subject.keywordPlus | Embedding algorithms | - |
dc.subject.keywordPlus | FastText | - |
dc.subject.keywordPlus | Inverse Document Frequency | - |
dc.subject.keywordPlus | Morphological features | - |
dc.subject.keywordPlus | NAtural language processing | - |
dc.subject.keywordPlus | Semantic tasks | - |
dc.subject.keywordPlus | Word embedding | - |
dc.subject.keywordPlus | Word2vec | - |
dc.subject.keywordPlus | Inverse problems | - |
dc.description.journalRegisteredClass | scie | - |
dc.description.journalRegisteredClass | scopus | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
1342, Seongnam-daero, Sujeong-gu, Seongnam-si, Gyeonggi-do, Republic of Korea(13120)031-750-5114
COPYRIGHT 2020 Gachon University All Rights Reserved.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.