Document Similarity Measure Based on the Earth Mover's Distance Utilizing Latent Dirichlet Allocation
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Jang, Min-Hee | - |
dc.contributor.author | Eom, Tae-Hwan | - |
dc.contributor.author | Kim, Sang-Wook | - |
dc.contributor.author | Hwang, Young-Sup | - |
dc.date.accessioned | 2022-07-15T19:01:14Z | - |
dc.date.available | 2022-07-15T19:01:14Z | - |
dc.date.created | 2021-05-14 | - |
dc.date.issued | 2016-01 | - |
dc.identifier.issn | 2040-7459 | - |
dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/155245 | - |
dc.description.abstract | Document similarity is used to search for such documents similar to a query document given. Text-based document similarity is computed by comparing the words in documents. The cosine similarity is the most popular text-based document similarity measure and computes the similarity of two documents based on their common word frequencies. It counts the exactly same words only, so cannot reflect semantic similarity between similar words having the same meaning. We propose a new document similarity measure to solve this problem by using the Earth Mover’s Distance (EMD). The EMD enables to compute the semantic similarity of documents. To apply the EMD to the similarity measure, we need to solve the high computational complexity and to define the distance between attributes. The high computational complexity comes from the large number of words in documents. Thus, we extract the topics from documents by using Latent Dirichlet Allocation (LDA), a document generating model. Since the number of topics is much smaller than that of words, the LDA helps reduce the computational complexity. We define the distance between topics using the cosine similarity. The experimental results on real-world document databases show that the proposed measure finds similar documents more accurately than the cosine similarity owing to reflecting semantic similarity. | - |
dc.language | 한국어 | - |
dc.language.iso | ko | - |
dc.publisher | Maxwell Scientific Publications | - |
dc.title | Document Similarity Measure Based on the Earth Mover's Distance Utilizing Latent Dirichlet Allocation | - |
dc.type | Article | - |
dc.contributor.affiliatedAuthor | Kim, Sang-Wook | - |
dc.identifier.doi | 10.19026/rjaset.12.2323 | - |
dc.identifier.bibliographicCitation | Research Journal of Applied Sciences, Engineering and Technology, v.12, no.2, pp.214 - 222 | - |
dc.relation.isPartOf | Research Journal of Applied Sciences, Engineering and Technology | - |
dc.citation.title | Research Journal of Applied Sciences, Engineering and Technology | - |
dc.citation.volume | 12 | - |
dc.citation.number | 2 | - |
dc.citation.startPage | 214 | - |
dc.citation.endPage | 222 | - |
dc.type.rims | ART | - |
dc.type.docType | 정기학술지(Article(Perspective Article포함)) | - |
dc.description.journalClass | 1 | - |
dc.description.isOpenAccess | N | - |
dc.description.journalRegisteredClass | other | - |
dc.subject.keywordAuthor | Cosine similairty | - |
dc.subject.keywordAuthor | document similarity | - |
dc.subject.keywordAuthor | earth mover | - |
dc.subject.keywordAuthor | latent dirichlet allocation | - |
dc.subject.keywordAuthor | semantic similarity | - |
dc.identifier.url | https://maxwellsci.com/jp/mspabstract.php?doi=rjaset.12.2323 | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1365
COPYRIGHT © 2021 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.