Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence

Makara Mao; Sony Peng; Yixuan Yang; 박두순

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Bi-directional Maximal Matching Algorithm to Segment Khmer Words in SentenceBi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence

Other Titles: Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence

Authors: Makara Mao; Sony Peng; Yixuan Yang; 박두순

Issue Date: Aug-2022

Publisher: 한국정보처리학회

Keywords: Bi-directional Maximal Matching; Khmer Language; Natural Language Processing; Word Corpus; Word Segmentation

Citation: JIPS(Journal of Information Processing Systems), v.18, no.4, pp 549 - 561

Pages: 13

Journal Title: JIPS(Journal of Information Processing Systems)

Volume: 18

Number: 4

Start Page: 549

End Page: 561

URI: https://scholarworks.bwise.kr/sch/handle/2021.sw.sch/21466

DOI: 10.3745/JIPS.04.0250

ISSN: 1976-913X
2092-805X

Abstract: In the Khmer writing system, the Khmer script is the official letter of Cambodia, written from left to rightwithout a space separator; it is complicated and requires more analysis studies. Without clear standardguidelines, a space separator in the Khmer language is used inconsistently and informally to separate words insentences. Therefore, a segmented method should be discussed with the combination of the future Khmernatural language processing (NLP) to define the appropriate rule for Khmer sentences. The critical process inNLP with the capability of extensive data language analysis necessitates applying in this scenario. One of theessential components in Khmer language processing is how to split the word into a series of sentences andcount the words used in the sentences. Currently, Microsoft Word cannot count Khmer words correctly. So,this study presents a systematic library to segment Khmer phrases using the bi-directional maximal matching(BiMM) method to address these problematic constraints. In the BiMM algorithm, the paper focuses on the Bidirectionalimplementation of forward maximal matching (FMM) and backward maximal matching (BMM) toimprove word segmentation accuracy. A digital or prefix tree of data structure algorithm, also known as a trie,enhances the segmentation accuracy procedure by finding the children of each word parent node. The accuracyof BiMM is higher than using FMM or BMM independently; moreover, the proposed approach improvesdictionary structures and reduces the number of errors. The result of this study can reduce the error by 8.57%compared to FMM and BFF algorithms with 94,807 Khmer words.

Files in This Item: There are no files associated with this item.

Appears in Collections: ETC > 1. Journal Articles

Show full item record

qrcode

Altmetrics

Total Views & Downloads

STATISTICS: Total View :1,423,421; Today View :984

RSS_1.0 RSS_2.0 ATOM_1.0

(31538) 22, Soonchunhyang-ro, Asan-si, Chungcheongnam-do, Republic of Korea+82-41-530-1114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Altmetrics

Total Views & Downloads

BROWSE