Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

SwahBERT: Language Model of Swahili

Full metadata record
DC Field Value Language
dc.contributor.authorMartin, Gati L.-
dc.contributor.authorMswahili, Medard E.-
dc.contributor.authorJeong, Young-Seob-
dc.contributor.authorWoo, Jiyoung-
dc.date.accessioned2022-11-29T06:41:18Z-
dc.date.available2022-11-29T06:41:18Z-
dc.date.created2022-11-28-
dc.date.issued2022-11-
dc.identifier.urihttps://scholarworks.bwise.kr/sch/handle/2021.sw.sch/21858-
dc.description.abstractThe rapid development of social networks, electronic commerce, mobile Internet, and other technologies has influenced the growth of Web data. Social media and Internet forums are valuable sources of citizens' opinions, which can be analyzed for community development and user behavior analysis. Unfortunately, the scarcity of resources (i.e., datasets or language models) has become a barrier to the development of natural language processing applications in low-resource languages. Thanks to the recent growth of online forums and news platforms of Swahili, we introduce two datasets of Swahili in this paper: a pre-training dataset of approximately 105MB with 16M words and an annotated dataset of 13K instances for the emotion classification task. The emotion classification dataset is manually annotated by two native Swahili speakers. We pre-trained a new monolingual language model for Swahili, namely SwahBERT, using our collected pre-training data, and tested it with four downstream tasks including emotion classification. We found that SwahBERT outperforms multilingual BERT, a well-known existing language model, in almost all downstream tasks.-
dc.language영어-
dc.language.isoen-
dc.publisherASSOC COMPUTATIONAL LINGUISTICS-ACL-
dc.titleSwahBERT: Language Model of Swahili-
dc.typeArticle-
dc.contributor.affiliatedAuthorWoo, Jiyoung-
dc.identifier.wosid000859869500023-
dc.identifier.bibliographicCitationNAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, pp.314 - +-
dc.relation.isPartOfNAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES-
dc.citation.titleNAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES-
dc.citation.startPage314-
dc.citation.endPage+-
dc.type.rimsART-
dc.type.docTypeProceedings Paper-
dc.description.journalClass3-
dc.description.isOpenAccessN-
dc.relation.journalResearchAreaComputer Science-
dc.relation.journalResearchAreaLinguistics-
dc.relation.journalWebOfScienceCategoryComputer Science, Artificial Intelligence-
dc.relation.journalWebOfScienceCategoryComputer Science, Interdisciplinary Applications-
dc.relation.journalWebOfScienceCategoryLinguistics-
Files in This Item
There are no files associated with this item.
Appears in
Collections
SCH Media Labs > Department of Big Data Engineering > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Woo, Ji young photo

Woo, Ji young
College of Software Convergence (AI·빅데이터학과)
Read more

Altmetrics

Total Views & Downloads

BROWSE