Cap4Bridge: Caption-Guided Cross-Modal Contextualization with Stochastic Augmentation for Text-Video Retrievalopen access
- Authors
- Jeon, Minju; Kim, Hyungee; Kim, Si-Woo; Oh, Youngtaek; Lee, Soeun; Kim, Dong-Jin
- Issue Date
- Apr-2026
- Publisher
- IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
- Keywords
- Broadcasting; Broadcast technology; Filtering; Filters; Videos; Video equipment; Text to video; TV; Video description; Telecommunications; Computer vision; text-video retrieval; cross-modal learning; semantic alignment
- Citation
- IEEE ACCESS, v.14, pp 54442 - 54453
- Pages
- 12
- Indexed
- SCIE
SCOPUS
- Journal Title
- IEEE ACCESS
- Volume
- 14
- Start Page
- 54442
- End Page
- 54453
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/212541
- DOI
- 10.1109/ACCESS.2026.3680911
- ISSN
- 2169-3536
2169-3536
- Abstract
- A key challenge in text-video retrieval is bridging the semantic gap between information-rich videos and concise text queries. Existing methods often address this by incorporating auxiliary captions from Large Language Models (LLMs) or employing stochastic modeling. However, these approaches face critical challenges: captions can lack domain-specific relevance, while stochastic methods that directly model text embeddings risk distorting the original query's intent. To overcome these issues, we propose Cap4Bridge, a framework that leverages semantic anchors searched from a domain-specific caption anchor bank. Our framework introduces two key components: 1) Caption-Guided Cross-Modality Contextualization, which uses a shared co-attention mechanism to enrich both video and text representations with these anchors, and 2) Similarity-Aware Stochastic Augmentation, which applies Gaussian noise scaled by relevance to the searched semantic anchors rather than the query itself. This integrated strategy bridges the fundamental information imbalance by providing complementary context to both modalities and robustly expanding the semantic representation while preserving the original query's intent. Our method achieves across most benchmarks, including R@1 scores of 58.5% on MSRVTT, 51.3% on MSVD, and 63.8% on DiDeMo, demonstrating its high efficacy and generalizability, particularly in challenging cross-domain settings.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > ETC > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.