Cap4Bridge: Caption-Guided Cross-Modal Contextualization with Stochastic Augmentation for Text-Video Retrieval

Jeon, Minju; Kim, Hyungee; Kim, Si-Woo; Oh, Youngtaek; Lee, Soeun; Kim, Dong-Jin

doi:10.1109/ACCESS.2026.3680911

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Cap4Bridge: Caption-Guided Cross-Modal Contextualization with Stochastic Augmentation for Text-Video Retrieval

Full metadata record

DC Field	Value	Language
dc.contributor.author	Jeon, Minju	-
dc.contributor.author	Kim, Hyungee	-
dc.contributor.author	Kim, Si-Woo	-
dc.contributor.author	Oh, Youngtaek	-
dc.contributor.author	Lee, Soeun	-
dc.contributor.author	Kim, Dong-Jin	-
dc.date.accessioned	2026-05-09T05:02:06Z	-
dc.date.available	2026-05-09T05:02:06Z	-
dc.date.issued	2026-04	-
dc.identifier.issn	2169-3536	-
dc.identifier.issn	2169-3536	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/212541	-
dc.description.abstract	A key challenge in text-video retrieval is bridging the semantic gap between information-rich videos and concise text queries. Existing methods often address this by incorporating auxiliary captions from Large Language Models (LLMs) or employing stochastic modeling. However, these approaches face critical challenges: captions can lack domain-specific relevance, while stochastic methods that directly model text embeddings risk distorting the original query's intent. To overcome these issues, we propose Cap4Bridge, a framework that leverages semantic anchors searched from a domain-specific caption anchor bank. Our framework introduces two key components: 1) Caption-Guided Cross-Modality Contextualization, which uses a shared co-attention mechanism to enrich both video and text representations with these anchors, and 2) Similarity-Aware Stochastic Augmentation, which applies Gaussian noise scaled by relevance to the searched semantic anchors rather than the query itself. This integrated strategy bridges the fundamental information imbalance by providing complementary context to both modalities and robustly expanding the semantic representation while preserving the original query's intent. Our method achieves across most benchmarks, including R@1 scores of 58.5% on MSRVTT, 51.3% on MSVD, and 63.8% on DiDeMo, demonstrating its high efficacy and generalizability, particularly in challenging cross-domain settings.	-
dc.format.extent	12	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC	-
dc.title	Cap4Bridge: Caption-Guided Cross-Modal Contextualization with Stochastic Augmentation for Text-Video Retrieval	-
dc.type	Article	-
dc.publisher.location	미국	-
dc.identifier.doi	10.1109/ACCESS.2026.3680911	-
dc.identifier.scopusid	2-s2.0-105035550342	-
dc.identifier.wosid	001740812900002	-
dc.identifier.bibliographicCitation	IEEE ACCESS, v.14, pp 54442 - 54453	-
dc.citation.title	IEEE ACCESS	-
dc.citation.volume	14	-
dc.citation.startPage	54442	-
dc.citation.endPage	54453	-
dc.type.docType	Article	-
dc.description.isOpenAccess	Y	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalResearchArea	Engineering	-
dc.relation.journalResearchArea	Telecommunications	-
dc.relation.journalWebOfScienceCategory	Computer Science, Information Systems	-
dc.relation.journalWebOfScienceCategory	Engineering, Electrical & Electronic	-
dc.relation.journalWebOfScienceCategory	Telecommunications	-
dc.subject.keywordPlus	Benchmarking	-
dc.subject.keywordPlus	Gaussian noise (electronic)	-
dc.subject.keywordPlus	Image retrieval	-
dc.subject.keywordPlus	Learning systems	-
dc.subject.keywordPlus	Modeling languages	-
dc.subject.keywordPlus	Semantics	-
dc.subject.keywordPlus	Stochastic models	-
dc.subject.keywordPlus	Stochastic systems	-
dc.subject.keywordAuthor	Broadcasting	-
dc.subject.keywordAuthor	Broadcast technology	-
dc.subject.keywordAuthor	Filtering	-
dc.subject.keywordAuthor	Filters	-
dc.subject.keywordAuthor	Videos	-
dc.subject.keywordAuthor	Video equipment	-
dc.subject.keywordAuthor	Text to video	-
dc.subject.keywordAuthor	TV	-
dc.subject.keywordAuthor	Video description	-
dc.subject.keywordAuthor	Telecommunications	-
dc.subject.keywordAuthor	Computer vision	-
dc.subject.keywordAuthor	text-video retrieval	-
dc.subject.keywordAuthor	cross-modal learning	-
dc.subject.keywordAuthor	semantic alignment	-
dc.identifier.url	https://ieeexplore.ieee.org/document/11474843	-

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > ETC > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Kim, Dong Jin photo

Kim, Dong Jin: COLLEGE OF ENGINEERING (DEPARTMENT OF INTELLIGENCE COMPUTING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE