Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Cap4Bridge: Caption-Guided Cross-Modal Contextualization with Stochastic Augmentation for Text-Video Retrievalopen access

Authors
Jeon, MinjuKim, HyungeeKim, Si-WooOh, YoungtaekLee, SoeunKim, Dong-Jin
Issue Date
Apr-2026
Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Keywords
Broadcasting; Broadcast technology; Filtering; Filters; Videos; Video equipment; Text to video; TV; Video description; Telecommunications; Computer vision; text-video retrieval; cross-modal learning; semantic alignment
Citation
IEEE ACCESS, v.14, pp 54442 - 54453
Pages
12
Indexed
SCIE
SCOPUS
Journal Title
IEEE ACCESS
Volume
14
Start Page
54442
End Page
54453
URI
https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/212541
DOI
10.1109/ACCESS.2026.3680911
ISSN
2169-3536
2169-3536
Abstract
A key challenge in text-video retrieval is bridging the semantic gap between information-rich videos and concise text queries. Existing methods often address this by incorporating auxiliary captions from Large Language Models (LLMs) or employing stochastic modeling. However, these approaches face critical challenges: captions can lack domain-specific relevance, while stochastic methods that directly model text embeddings risk distorting the original query's intent. To overcome these issues, we propose Cap4Bridge, a framework that leverages semantic anchors searched from a domain-specific caption anchor bank. Our framework introduces two key components: 1) Caption-Guided Cross-Modality Contextualization, which uses a shared co-attention mechanism to enrich both video and text representations with these anchors, and 2) Similarity-Aware Stochastic Augmentation, which applies Gaussian noise scaled by relevance to the searched semantic anchors rather than the query itself. This integrated strategy bridges the fundamental information imbalance by providing complementary context to both modalities and robustly expanding the semantic representation while preserving the original query's intent. Our method achieves across most benchmarks, including R@1 scores of 58.5% on MSRVTT, 51.3% on MSVD, and 63.8% on DiDeMo, demonstrating its high efficacy and generalizability, particularly in challenging cross-domain settings.
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Dong Jin photo

Kim, Dong Jin
COLLEGE OF ENGINEERING (DEPARTMENT OF INTELLIGENCE COMPUTING)
Read more

Altmetrics

Total Views & Downloads

BROWSE