VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Ma, M.; Yoon, S.; Kim, J.; Lee, Y.; Kang, S.; Yoo, C.D.

doi:10.1007/978-3-030-58604-1_10

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ma, M.	-
dc.contributor.author	Yoon, S.	-
dc.contributor.author	Kim, J.	-
dc.contributor.author	Lee, Y.	-
dc.contributor.author	Kang, S.	-
dc.contributor.author	Yoo, C.D.	-
dc.date.accessioned	2023-03-08T13:53:00Z	-
dc.date.available	2023-03-08T13:53:00Z	-
dc.date.issued	2020-08	-
dc.identifier.issn	0302-9743	-
dc.identifier.issn	1611-3349	-
dc.identifier.uri	https://scholarworks.bwise.kr/cau/handle/2019.sw.cau/63280	-
dc.description.abstract	Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query. For VMR, several methods that require full supervision for training have been proposed. Unfortunately, acquiring a large number of training videos with labeled temporal boundaries for each query is a labor-intensive process. This paper explores a method for performing VMR in a weakly-supervised manner (wVMR): training is performed without temporal moment labels but only with the text query that describes a segment of the video. Existing methods on wVMR generate multi-scale proposals and apply query-guided attention mechanism to highlight the most relevant proposal. To leverage the weak supervision, contrastive learning is used which predicts higher scores for the correct video-query pairs than for the incorrect pairs. It has been observed that a large number of candidate proposals, coarse query representation, and one-way attention mechanism lead to blurry attention map which limits the localization performance. To address this issue, Video-Language Alignment Network (VLANet) is proposed that learns a sharper attention by pruning out spurious candidate proposals and applying a multi-directional attention mechanism with fine-grained query representation. The Surrogate Proposal Selection module selects a proposal based on the proximity to the query in the joint embedding space, and thus substantially reduces candidate proposals which leads to lower computation load and sharper attention. Next, the Cascaded Cross-modal Attention module considers dense feature interactions and multi-directional attention flows to learn the multi-modal alignment. VLANet is trained end-to-end using contrastive loss which enforces semantically similar videos and queries to cluster. The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.	-
dc.format.extent	16	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	Springer Science and Business Media Deutschland GmbH	-
dc.title	VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval	-
dc.type	Article	-
dc.identifier.doi	10.1007/978-3-030-58604-1_10	-
dc.identifier.bibliographicCitation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v.12373, pp 156 - 171	-
dc.description.isOpenAccess	N	-
dc.identifier.scopusid	2-s2.0-85097104217	-
dc.citation.endPage	171	-
dc.citation.startPage	156	-
dc.citation.title	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	-
dc.citation.volume	12373	-
dc.type.docType	Conference Paper	-
dc.publisher.location	미국	-
dc.subject.keywordAuthor	Multi-modal learning	-
dc.subject.keywordAuthor	Video moment retrieval	-
dc.subject.keywordAuthor	Weakly-supervised learning	-
dc.description.journalRegisteredClass	scopus	-

Files in This Item: There are no files associated with this item.

Appears in Collections: College of Software > Department of Artificial Intelligence > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Kim, Junyeong photo

Kim, Junyeong: 소프트웨어대학 (AI학과)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

84, Heukseok-ro, Dongjak-gu, Seoul, Republic of Korea (06974)02-820-6194

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE