Contrasting Multi-Modal Similarity Framework for Video Scene Segmentation

Park, Jinwoo; Kim, Jungeun; Seok, Jaegwang; Lee, Sukhyun; Kim, Junyeong

doi:10.1109/ACCESS.2024.3370676

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentationopen access

Authors: Park, Jinwoo; Kim, Jungeun; Seok, Jaegwang; Lee, Sukhyun; Kim, Junyeong

Issue Date: Feb-2024

Publisher: Institute of Electrical and Electronics Engineers Inc.

Keywords: contrastive learning; multi-modal reasoning; Visual scene segmentation

Citation: IEEE Access, v.12, pp 32408 - 32419

Pages: 12

Journal Title: IEEE Access

Volume: 12

Start Page: 32408

End Page: 32419

URI: https://scholarworks.bwise.kr/cau/handle/2019.sw.cau/73035

DOI: 10.1109/ACCESS.2024.3370676

ISSN: 2169-3536

Abstract: This paper proposes a video scene segmentation framework referred to as a Contrasting Multi-Modal Similarity (CMS). Video is composed of multiple scenes which are short stories or semantic units of video, with each scene consisting of multiple shots. The task of video scene segmentation aims to semantically segment long videos, such as movies, into the sequence of scenes by identifying the boundaries of each scene transition. Current video scene segmentation frameworks have primarily relied on comparing only the visual cues of adjacent shots to identify scene boundaries. These frameworks have focused on two major approaches: 1) comparing only the visual cues of adjacent frames to distinguish between scenes and 2) performing clustering based on visual cues for distinction among scenes. However, within videos, there exist numerous scenes that are difficult to distinguish using visual information alone, as they often appear similar or ambiguous. Taking inspiration from the aforementioned issues, we propose a framework referred to as CMS that leverages not only visual cues (i.e., shots) but also textual cues (i.e., captions) to semantically distinguish scenes. The new framework, CMS, leverages visual cues and text cues as follows: 1) Generate captions corresponding to each shot using a zero-shot captioning model (Caption Generation). 2) Construct similarity score matrices for each modality to measure semantic similarities (Similarity Score Calculation). 3) Based on the above matrix, select similar shots and dissimilar shots for contrastive training (Similarity Score-based Sampling). Our experiments show that the CMS framework advances the performance to exceed the previous state-of-the-art methods with a relatively simple approach without complex model architectures. © 2013 IEEE.

Files in This Item

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentation.pdf 2.58 MB

Appears in Collections: College of Software > Department of Artificial Intelligence > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Kim, Junyeong photo

Kim, Junyeong: 소프트웨어대학 (AI학과)

Read more

Altmetrics

Total Views & Downloads

STATISTICS: Total View :7,447,524; Today View :7,080

RSS_1.0 RSS_2.0 ATOM_1.0

84, Heukseok-ro, Dongjak-gu, Seoul, Republic of Korea (06974)02-820-6194

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE