Structure-Aware Multimodal Sequential Learning for Visual Dialog

Kim, Young-Jin; Kim, Min-Jun; An, Kyunghwan; Ahn, Jinwoo; Kim, Jaeseok; Heo, Yu-Jung; Chang, Du-Seong; Kim, Eun-Sol

doi:10.1609/aaai.v38i12.29219

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Structure-Aware Multimodal Sequential Learning for Visual Dialog

Full metadata record

DC Field	Value	Language
dc.contributor.author	Kim, Young-Jin	-
dc.contributor.author	Kim, Min-Jun	-
dc.contributor.author	An, Kyunghwan	-
dc.contributor.author	Ahn, Jinwoo	-
dc.contributor.author	Kim, Jaeseok	-
dc.contributor.author	Heo, Yu-Jung	-
dc.contributor.author	Chang, Du-Seong	-
dc.contributor.author	Kim, Eun-Sol	-
dc.date.accessioned	2024-11-28T08:27:22Z	-
dc.date.available	2024-11-28T08:27:22Z	-
dc.date.issued	2024-03	-
dc.identifier.issn	2159-5399	-
dc.identifier.issn	2374-3468	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/195035	-
dc.description.abstract	With the ability to collect vast amounts of image and natural language data from the web, there has been a remarkable advancement in Large-scale Language Models (LLMs). This progress has led to the emergence of chatbots and dialogue systems capable of fluent conversations with humans. As the variety of devices enabling interactions between humans and agents expands, and the performance of text-based dialogue systems improves, there has been recently proposed research on visual dialog. However, visual dialog requires understanding sequences of pairs consisting of images and sentences, making it challenging to gather sufficient data for training large-scale models from the web. In this paper, we propose a new multimodal learning method leveraging existing large-scale models designed for each modality, to enable model training for visual dialog with small visual dialog datasets. The key ideas of our approach are: 1) storing the history or context during the progression of visual dialog in the form of spatiotemporal graphs, and 2) introducing small modulation blocks between modality-specific models and the graphs to align the semantic spaces. For implementation, we introduce a novel structure-aware cross-attention method, which retrieves relevant image and text knowledge for utterance generation from the pretrained models. For experiments, we achieved a new state-of-the-art performance on three visual dialog datasets, including the most challenging one COMET.	-
dc.format.extent	9	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	Association for the Advancement of Artificial Intelligence	-
dc.title	Structure-Aware Multimodal Sequential Learning for Visual Dialog	-
dc.type	Article	-
dc.publisher.location	영국	-
dc.identifier.doi	10.1609/aaai.v38i12.29219	-
dc.identifier.scopusid	2-s2.0-85189523321	-
dc.identifier.wosid	001241515300025	-
dc.identifier.bibliographicCitation	Proceedings of the AAAI Conference on Artificial Intelligence, v.38, no.12, pp 13193 - 13201	-
dc.citation.title	Proceedings of the AAAI Conference on Artificial Intelligence	-
dc.citation.volume	38	-
dc.citation.number	12	-
dc.citation.startPage	13193	-
dc.citation.endPage	13201	-
dc.type.docType	Proceedings Paper	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalWebOfScienceCategory	Computer Science, Artificial Intelligence	-
dc.relation.journalWebOfScienceCategory	Computer Science, Theory & Methods	-
dc.subject.keywordPlus	Artificial intelligence	-
dc.subject.keywordPlus	Large datasets	-
dc.subject.keywordPlus	Learning systems	-
dc.subject.keywordPlus	Speech processing	-
dc.identifier.url	https://ojs.aaai.org/index.php/AAAI/article/view/29219	-

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Kim, Eun Sol photo

Kim, Eun Sol: COLLEGE OF ENGINEERING (SCHOOL OF COMPUTER SCIENCE)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE