Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Structure-Aware Multimodal Sequential Learning for Visual Dialog

Full metadata record
DC Field Value Language
dc.contributor.authorKim, Young-Jin-
dc.contributor.authorKim, Min-Jun-
dc.contributor.authorAn, Kyunghwan-
dc.contributor.authorAhn, Jinwoo-
dc.contributor.authorKim, Jaeseok-
dc.contributor.authorHeo, Yu-Jung-
dc.contributor.authorChang, Du-Seong-
dc.contributor.authorKim, Eun-Sol-
dc.date.accessioned2024-11-28T08:27:22Z-
dc.date.available2024-11-28T08:27:22Z-
dc.date.issued2024-03-
dc.identifier.issn2159-5399-
dc.identifier.issn2374-3468-
dc.identifier.urihttps://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/195035-
dc.description.abstractWith the ability to collect vast amounts of image and natural language data from the web, there has been a remarkable advancement in Large-scale Language Models (LLMs). This progress has led to the emergence of chatbots and dialogue systems capable of fluent conversations with humans. As the variety of devices enabling interactions between humans and agents expands, and the performance of text-based dialogue systems improves, there has been recently proposed research on visual dialog. However, visual dialog requires understanding sequences of pairs consisting of images and sentences, making it challenging to gather sufficient data for training large-scale models from the web. In this paper, we propose a new multimodal learning method leveraging existing large-scale models designed for each modality, to enable model training for visual dialog with small visual dialog datasets. The key ideas of our approach are: 1) storing the history or context during the progression of visual dialog in the form of spatiotemporal graphs, and 2) introducing small modulation blocks between modality-specific models and the graphs to align the semantic spaces. For implementation, we introduce a novel structure-aware cross-attention method, which retrieves relevant image and text knowledge for utterance generation from the pretrained models. For experiments, we achieved a new state-of-the-art performance on three visual dialog datasets, including the most challenging one COMET.-
dc.format.extent9-
dc.language영어-
dc.language.isoENG-
dc.publisherAssociation for the Advancement of Artificial Intelligence-
dc.titleStructure-Aware Multimodal Sequential Learning for Visual Dialog-
dc.typeArticle-
dc.publisher.location영국-
dc.identifier.doi10.1609/aaai.v38i12.29219-
dc.identifier.scopusid2-s2.0-85189523321-
dc.identifier.wosid001241515300025-
dc.identifier.bibliographicCitationProceedings of the AAAI Conference on Artificial Intelligence, v.38, no.12, pp 13193 - 13201-
dc.citation.titleProceedings of the AAAI Conference on Artificial Intelligence-
dc.citation.volume38-
dc.citation.number12-
dc.citation.startPage13193-
dc.citation.endPage13201-
dc.type.docTypeProceedings Paper-
dc.description.isOpenAccessN-
dc.description.journalRegisteredClassscopus-
dc.relation.journalResearchAreaComputer Science-
dc.relation.journalWebOfScienceCategoryComputer Science, Artificial Intelligence-
dc.relation.journalWebOfScienceCategoryComputer Science, Theory & Methods-
dc.subject.keywordPlusArtificial intelligence-
dc.subject.keywordPlusLarge datasets-
dc.subject.keywordPlusLearning systems-
dc.subject.keywordPlusSpeech processing-
dc.identifier.urlhttps://ojs.aaai.org/index.php/AAAI/article/view/29219-
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Eun Sol photo

Kim, Eun Sol
COLLEGE OF ENGINEERING (SCHOOL OF COMPUTER SCIENCE)
Read more

Altmetrics

Total Views & Downloads

BROWSE