SwahBERT: Language Model of Swahili

Martin, Gati L.; Mswahili, Medard E.; Jeong, Young-Seob; Woo, Jiyoung

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

SwahBERT: Language Model of Swahili

Full metadata record

DC Field	Value	Language
dc.contributor.author	Martin, Gati L.	-
dc.contributor.author	Mswahili, Medard E.	-
dc.contributor.author	Jeong, Young-Seob	-
dc.contributor.author	Woo, Jiyoung	-
dc.date.accessioned	2022-11-29T06:41:18Z	-
dc.date.available	2022-11-29T06:41:18Z	-
dc.date.created	2022-11-28	-
dc.date.issued	2022-11	-
dc.identifier.uri	https://scholarworks.bwise.kr/sch/handle/2021.sw.sch/21858	-
dc.description.abstract	The rapid development of social networks, electronic commerce, mobile Internet, and other technologies has influenced the growth of Web data. Social media and Internet forums are valuable sources of citizens' opinions, which can be analyzed for community development and user behavior analysis. Unfortunately, the scarcity of resources (i.e., datasets or language models) has become a barrier to the development of natural language processing applications in low-resource languages. Thanks to the recent growth of online forums and news platforms of Swahili, we introduce two datasets of Swahili in this paper: a pre-training dataset of approximately 105MB with 16M words and an annotated dataset of 13K instances for the emotion classification task. The emotion classification dataset is manually annotated by two native Swahili speakers. We pre-trained a new monolingual language model for Swahili, namely SwahBERT, using our collected pre-training data, and tested it with four downstream tasks including emotion classification. We found that SwahBERT outperforms multilingual BERT, a well-known existing language model, in almost all downstream tasks.	-
dc.language	영어	-
dc.language.iso	en	-
dc.publisher	ASSOC COMPUTATIONAL LINGUISTICS-ACL	-
dc.title	SwahBERT: Language Model of Swahili	-
dc.type	Article	-
dc.contributor.affiliatedAuthor	Woo, Jiyoung	-
dc.identifier.wosid	000859869500023	-
dc.identifier.bibliographicCitation	NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, pp.314 - +	-
dc.relation.isPartOf	NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES	-
dc.citation.title	NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES	-
dc.citation.startPage	314	-
dc.citation.endPage	+	-
dc.type.rims	ART	-
dc.type.docType	Proceedings Paper	-
dc.description.journalClass	3	-
dc.description.isOpenAccess	N	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalResearchArea	Linguistics	-
dc.relation.journalWebOfScienceCategory	Computer Science, Artificial Intelligence	-
dc.relation.journalWebOfScienceCategory	Computer Science, Interdisciplinary Applications	-
dc.relation.journalWebOfScienceCategory	Linguistics	-

Files in This Item: There are no files associated with this item.

Appears in Collections: SCH Media Labs > Department of Big Data Engineering > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Woo, Ji young photo

Woo, Ji young: College of Software Convergence (AI·빅데이터학과)

Read more

Altmetrics

Total Views & Downloads

STATISTICS: Total View :1,398,733; Today View :3,332

RSS_1.0 RSS_2.0 ATOM_1.0

(31538) 22, Soonchunhyang-ro, Asan-si, Chungcheongnam-do, Republic of Korea+82-41-530-1114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE