Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents

Authors
Moradshahi, MehradShen, TianhaoBali, KalikaChoudhury, Monojitde Chalendar, GaëlGoel, AnmolKim, SungkyunKodali, PrashantKumaraguru, PonnurangamSemmar, NasredineSemnani, Sina J.Seo, JiwonSeshadri, VivekShrivastava, ManishSun, MichaelYadavalli, AdityaYou, ChaobinXiong, DeyiLam, Monica S.
Issue Date
Jul-2023
Citation
Association for Computational Linguistics (ACL). Annual Meeting Conference Proceedings, pp 2773 - 2794
Pages
22
Indexed
SCOPUS
Journal Title
Association for Computational Linguistics (ACL). Annual Meeting Conference Proceedings
Start Page
2773
End Page
2794
URI
https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/192898
DOI
10.18653/v1/2023.findings-acl.174
ISSN
0736-587X
Abstract
Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language. X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Altmetrics

Total Views & Downloads

BROWSE