Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Compositional Video Understanding with Spatiotemporal Structure-based Transformers

Full metadata record
DC Field Value Language
dc.contributor.authorYun, Hoyeoung-
dc.contributor.authorAhn, Jinwoo-
dc.contributor.authorKim, Minseo-
dc.contributor.authorKim, Eun-Sol-
dc.date.accessioned2026-05-12T05:30:32Z-
dc.date.available2026-05-12T05:30:32Z-
dc.date.issued2024-09-
dc.identifier.issn1063-6919-
dc.identifier.issn2575-7075-
dc.identifier.urihttps://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/212714-
dc.description.abstractIn this paper, we suggest a new novel method to understand complex semantic structures through long video inputs. Conventional methods for understanding videos have been focused on short-term clips, and trained to get visual representations for the short clips using convolutional neural networks or transformer architectures. However, most real-world videos are composed of long videos ranging from minutes to hours, therefore, it essentially brings limitations to understanding the overall semantic structures of the long videos by dividing them into small clips and learning the representations of them. We suggest a new algorithm to learn the multi-granular semantic structures of videos, by defining spatiotemporal high-order relationships among object-based representations as semantic units. The proposed method includes a new transformer architecture capable of learning spatiotemporal graphs, and a compositional learning method to learn disentangled features for each semantic unit. Using the suggested method, we resolve the challenging video task, which is compositional generalization understanding of unseen videos. In experiments, we demonstrate new state-of-the-art performances for two challenging video datasets.-
dc.format.extent10-
dc.language영어-
dc.language.isoENG-
dc.publisherIEEE-
dc.titleCompositional Video Understanding with Spatiotemporal Structure-based Transformers-
dc.typeArticle-
dc.publisher.location미국-
dc.identifier.doi10.1109/CVPR52733.2024.01774-
dc.identifier.scopusid2-s2.0-85211444795-
dc.identifier.wosid001342515502009-
dc.identifier.bibliographicCitationProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 18751 - 18760-
dc.citation.titleProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition-
dc.citation.startPage18751-
dc.citation.endPage18760-
dc.type.docTypeProceedings Paper-
dc.description.isOpenAccessN-
dc.description.journalRegisteredClassscopus-
dc.relation.journalResearchAreaComputer Science-
dc.relation.journalWebOfScienceCategoryComputer Science, Artificial Intelligence-
dc.relation.journalWebOfScienceCategoryComputer Science, Interdisciplinary Applications-
dc.relation.journalWebOfScienceCategoryComputer Science, Theory & Methods-
dc.subject.keywordPlusContrastive Learning-
dc.subject.keywordPlusVideo analysis-
dc.identifier.urlhttps://ieeexplore.ieee.org/document/10657973-
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Eun Sol photo

Kim, Eun Sol
COLLEGE OF ENGINEERING (SCHOOL OF COMPUTER SCIENCE)
Read more

Altmetrics

Total Views & Downloads

BROWSE