Compositional Video Understanding with Spatiotemporal Structure-based Transformers

Yun, Hoyeoung; Ahn, Jinwoo; Kim, Minseo; Kim, Eun-Sol

doi:10.1109/CVPR52733.2024.01774

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Compositional Video Understanding with Spatiotemporal Structure-based Transformers

Full metadata record

DC Field	Value	Language
dc.contributor.author	Yun, Hoyeoung	-
dc.contributor.author	Ahn, Jinwoo	-
dc.contributor.author	Kim, Minseo	-
dc.contributor.author	Kim, Eun-Sol	-
dc.date.accessioned	2026-05-12T05:30:32Z	-
dc.date.available	2026-05-12T05:30:32Z	-
dc.date.issued	2024-09	-
dc.identifier.issn	1063-6919	-
dc.identifier.issn	2575-7075	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/212714	-
dc.description.abstract	In this paper, we suggest a new novel method to understand complex semantic structures through long video inputs. Conventional methods for understanding videos have been focused on short-term clips, and trained to get visual representations for the short clips using convolutional neural networks or transformer architectures. However, most real-world videos are composed of long videos ranging from minutes to hours, therefore, it essentially brings limitations to understanding the overall semantic structures of the long videos by dividing them into small clips and learning the representations of them. We suggest a new algorithm to learn the multi-granular semantic structures of videos, by defining spatiotemporal high-order relationships among object-based representations as semantic units. The proposed method includes a new transformer architecture capable of learning spatiotemporal graphs, and a compositional learning method to learn disentangled features for each semantic unit. Using the suggested method, we resolve the challenging video task, which is compositional generalization understanding of unseen videos. In experiments, we demonstrate new state-of-the-art performances for two challenging video datasets.	-
dc.format.extent	10	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	IEEE	-
dc.title	Compositional Video Understanding with Spatiotemporal Structure-based Transformers	-
dc.type	Article	-
dc.publisher.location	미국	-
dc.identifier.doi	10.1109/CVPR52733.2024.01774	-
dc.identifier.scopusid	2-s2.0-85211444795	-
dc.identifier.wosid	001342515502009	-
dc.identifier.bibliographicCitation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 18751 - 18760	-
dc.citation.title	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition	-
dc.citation.startPage	18751	-
dc.citation.endPage	18760	-
dc.type.docType	Proceedings Paper	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalWebOfScienceCategory	Computer Science, Artificial Intelligence	-
dc.relation.journalWebOfScienceCategory	Computer Science, Interdisciplinary Applications	-
dc.relation.journalWebOfScienceCategory	Computer Science, Theory & Methods	-
dc.subject.keywordPlus	Contrastive Learning	-
dc.subject.keywordPlus	Video analysis	-
dc.identifier.url	https://ieeexplore.ieee.org/document/10657973	-

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Kim, Eun Sol photo

Kim, Eun Sol: COLLEGE OF ENGINEERING (SCHOOL OF COMPUTER SCIENCE)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE