Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentationopen access

Authors
Park, JinwooKim, JungeunSeok, JaegwangLee, SukhyunKim, Junyeong
Issue Date
Feb-2024
Publisher
Institute of Electrical and Electronics Engineers Inc.
Keywords
contrastive learning; multi-modal reasoning; Visual scene segmentation
Citation
IEEE Access, v.12, pp 32408 - 32419
Pages
12
Journal Title
IEEE Access
Volume
12
Start Page
32408
End Page
32419
URI
https://scholarworks.bwise.kr/cau/handle/2019.sw.cau/73035
DOI
10.1109/ACCESS.2024.3370676
ISSN
2169-3536
Abstract
This paper proposes a video scene segmentation framework referred to as a Contrasting Multi-Modal Similarity (CMS). Video is composed of multiple scenes which are short stories or semantic units of video, with each scene consisting of multiple shots. The task of video scene segmentation aims to semantically segment long videos, such as movies, into the sequence of scenes by identifying the boundaries of each scene transition. Current video scene segmentation frameworks have primarily relied on comparing only the visual cues of adjacent shots to identify scene boundaries. These frameworks have focused on two major approaches: 1) comparing only the visual cues of adjacent frames to distinguish between scenes and 2) performing clustering based on visual cues for distinction among scenes. However, within videos, there exist numerous scenes that are difficult to distinguish using visual information alone, as they often appear similar or ambiguous. Taking inspiration from the aforementioned issues, we propose a framework referred to as CMS that leverages not only visual cues (i.e., shots) but also textual cues (i.e., captions) to semantically distinguish scenes. The new framework, CMS, leverages visual cues and text cues as follows: 1) Generate captions corresponding to each shot using a zero-shot captioning model (Caption Generation). 2) Construct similarity score matrices for each modality to measure semantic similarities (Similarity Score Calculation). 3) Based on the above matrix, select similar shots and dissimilar shots for contrastive training (Similarity Score-based Sampling). Our experiments show that the CMS framework advances the performance to exceed the previous state-of-the-art methods with a relatively simple approach without complex model architectures. © 2013 IEEE.
Files in This Item
Appears in
Collections
College of Software > Department of Artificial Intelligence > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Junyeong photo

Kim, Junyeong
소프트웨어대학 (AI학과)
Read more

Altmetrics

Total Views & Downloads

BROWSE