Cited 0 time in
Compositional Video Understanding with Spatiotemporal Structure-based Transformers
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Yun, Hoyeoung | - |
| dc.contributor.author | Ahn, Jinwoo | - |
| dc.contributor.author | Kim, Minseo | - |
| dc.contributor.author | Kim, Eun-Sol | - |
| dc.date.accessioned | 2026-05-12T05:30:32Z | - |
| dc.date.available | 2026-05-12T05:30:32Z | - |
| dc.date.issued | 2024-09 | - |
| dc.identifier.issn | 1063-6919 | - |
| dc.identifier.issn | 2575-7075 | - |
| dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/212714 | - |
| dc.description.abstract | In this paper, we suggest a new novel method to understand complex semantic structures through long video inputs. Conventional methods for understanding videos have been focused on short-term clips, and trained to get visual representations for the short clips using convolutional neural networks or transformer architectures. However, most real-world videos are composed of long videos ranging from minutes to hours, therefore, it essentially brings limitations to understanding the overall semantic structures of the long videos by dividing them into small clips and learning the representations of them. We suggest a new algorithm to learn the multi-granular semantic structures of videos, by defining spatiotemporal high-order relationships among object-based representations as semantic units. The proposed method includes a new transformer architecture capable of learning spatiotemporal graphs, and a compositional learning method to learn disentangled features for each semantic unit. Using the suggested method, we resolve the challenging video task, which is compositional generalization understanding of unseen videos. In experiments, we demonstrate new state-of-the-art performances for two challenging video datasets. | - |
| dc.format.extent | 10 | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.publisher | IEEE | - |
| dc.title | Compositional Video Understanding with Spatiotemporal Structure-based Transformers | - |
| dc.type | Article | - |
| dc.publisher.location | 미국 | - |
| dc.identifier.doi | 10.1109/CVPR52733.2024.01774 | - |
| dc.identifier.scopusid | 2-s2.0-85211444795 | - |
| dc.identifier.wosid | 001342515502009 | - |
| dc.identifier.bibliographicCitation | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 18751 - 18760 | - |
| dc.citation.title | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition | - |
| dc.citation.startPage | 18751 | - |
| dc.citation.endPage | 18760 | - |
| dc.type.docType | Proceedings Paper | - |
| dc.description.isOpenAccess | N | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.relation.journalResearchArea | Computer Science | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Artificial Intelligence | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Interdisciplinary Applications | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Theory & Methods | - |
| dc.subject.keywordPlus | Contrastive Learning | - |
| dc.subject.keywordPlus | Video analysis | - |
| dc.identifier.url | https://ieeexplore.ieee.org/document/10657973 | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366
COPYRIGHT © 2024 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
