Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Bi-directional Maximal Matching Algorithm to Segment Khmer Words in SentenceBi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence

Other Titles
Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence
Authors
Makara MaoSony PengYixuan Yang박두순
Issue Date
Aug-2022
Publisher
한국정보처리학회
Keywords
Bi-directional Maximal Matching; Khmer Language; Natural Language Processing; Word Corpus; Word Segmentation
Citation
JIPS(Journal of Information Processing Systems), v.18, no.4, pp 549 - 561
Pages
13
Journal Title
JIPS(Journal of Information Processing Systems)
Volume
18
Number
4
Start Page
549
End Page
561
URI
https://scholarworks.bwise.kr/sch/handle/2021.sw.sch/21466
DOI
10.3745/JIPS.04.0250
ISSN
1976-913X
2092-805X
Abstract
In the Khmer writing system, the Khmer script is the official letter of Cambodia, written from left to rightwithout a space separator; it is complicated and requires more analysis studies. Without clear standardguidelines, a space separator in the Khmer language is used inconsistently and informally to separate words insentences. Therefore, a segmented method should be discussed with the combination of the future Khmernatural language processing (NLP) to define the appropriate rule for Khmer sentences. The critical process inNLP with the capability of extensive data language analysis necessitates applying in this scenario. One of theessential components in Khmer language processing is how to split the word into a series of sentences andcount the words used in the sentences. Currently, Microsoft Word cannot count Khmer words correctly. So,this study presents a systematic library to segment Khmer phrases using the bi-directional maximal matching(BiMM) method to address these problematic constraints. In the BiMM algorithm, the paper focuses on the Bidirectionalimplementation of forward maximal matching (FMM) and backward maximal matching (BMM) toimprove word segmentation accuracy. A digital or prefix tree of data structure algorithm, also known as a trie,enhances the segmentation accuracy procedure by finding the children of each word parent node. The accuracyof BiMM is higher than using FMM or BMM independently; moreover, the proposed approach improvesdictionary structures and reduces the number of errors. The result of this study can reduce the error by 8.57%compared to FMM and BFF algorithms with 94,807 Khmer words.
Files in This Item
There are no files associated with this item.
Appears in
Collections
ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Altmetrics

Total Views & Downloads

BROWSE