Cap4Bridge: Caption-Guided Cross-Modal Contextualization with Stochastic Augmentation for Text-Video Retrieval

Jeon, Minju; Kim, Hyungee; Kim, Si-Woo; Oh, Youngtaek; Lee, Soeun; Kim, Dong-Jin

doi:10.1109/ACCESS.2026.3680911

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Cap4Bridge: Caption-Guided Cross-Modal Contextualization with Stochastic Augmentation for Text-Video Retrievalopen access

Authors: Jeon, Minju; Kim, Hyungee; Kim, Si-Woo; Oh, Youngtaek; Lee, Soeun; Kim, Dong-Jin

Issue Date: Apr-2026

Publisher: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords: Broadcasting; Broadcast technology; Filtering; Filters; Videos; Video equipment; Text to video; TV; Video description; Telecommunications; Computer vision; text-video retrieval; cross-modal learning; semantic alignment

Citation: IEEE ACCESS, v.14, pp 54442 - 54453

Pages: 12

Indexed: SCIE
SCOPUS

Journal Title: IEEE ACCESS

Volume: 14

Start Page: 54442

End Page: 54453

URI: https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/212541

DOI: 10.1109/ACCESS.2026.3680911

ISSN: 2169-3536
2169-3536

Abstract: A key challenge in text-video retrieval is bridging the semantic gap between information-rich videos and concise text queries. Existing methods often address this by incorporating auxiliary captions from Large Language Models (LLMs) or employing stochastic modeling. However, these approaches face critical challenges: captions can lack domain-specific relevance, while stochastic methods that directly model text embeddings risk distorting the original query's intent. To overcome these issues, we propose Cap4Bridge, a framework that leverages semantic anchors searched from a domain-specific caption anchor bank. Our framework introduces two key components: 1) Caption-Guided Cross-Modality Contextualization, which uses a shared co-attention mechanism to enrich both video and text representations with these anchors, and 2) Similarity-Aware Stochastic Augmentation, which applies Gaussian noise scaled by relevance to the searched semantic anchors rather than the query itself. This integrated strategy bridges the fundamental information imbalance by providing complementary context to both modalities and robustly expanding the semantic representation while preserving the original query's intent. Our method achieves across most benchmarks, including R@1 scores of 58.5% on MSRVTT, 51.3% on MSVD, and 63.8% on DiDeMo, demonstrating its high efficacy and generalizability, particularly in challenging cross-domain settings.

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > ETC > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Kim, Dong Jin photo

Kim, Dong Jin: COLLEGE OF ENGINEERING (DEPARTMENT OF INTELLIGENCE COMPUTING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE