Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioningopen access
- Authors
- Jeon, MinJu; Kim, Si-Woo; Kim, Ye-Chan; Kim, HyunGee; Kim, Dong-Jin
- Issue Date
- Nov-2025
- Publisher
- Association for Computational Linguistics
- Citation
- EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp 25777 - 25790
- Pages
- 14
- Indexed
- SCOPUS
- Journal Title
- EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
- Start Page
- 25777
- End Page
- 25790
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/213292
- DOI
- 10.18653/v1/2025.emnlp-main.1308
- Abstract
- Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose **Sali4Vid**, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > ETC > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.