Cited 0 time in
Convolutional Method for Modeling Video Temporal Context Effectively in Transformer
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Park, Hae Sung | - |
| dc.contributor.author | Choi, Yong Suk | - |
| dc.date.accessioned | 2024-11-28T14:00:58Z | - |
| dc.date.available | 2024-11-28T14:00:58Z | - |
| dc.date.issued | 2023-03 | - |
| dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/196713 | - |
| dc.description.abstract | Video understanding remains a challenging task because video understanding models have many parameters to be trained and should capture detailed spatiotemporal contexts in video effectively. Recent methods have typically employed 3D convolution modules or else self-attention modules. However, we identify that when the self-attention mechanism captures temporal semantics, it often struggles to find out proper temporal context for video understanding. In this paper, we propose a new method for enhancing temporal modeling by incorporating 3D convolution modules into attention-based model, transformer. In particular, we replace the temporal attention of the TimeSformer with a 3D convolution module to improve temporal context learning. In contrast to the TimeSformer, our proposed method can focus on modeling temporal details at the low-level encoders, while gradually getting to focus on temporal contexts more globally at the high-level encoders. Our method surpasses the TimeSformer by 2.2% margin on Something-Something v2, which is required complex temporal modeling for getting high performance. | - |
| dc.format.extent | 4 | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.publisher | ASSOC COMPUTING MACHINERY | - |
| dc.title | Convolutional Method for Modeling Video Temporal Context Effectively in Transformer | - |
| dc.type | Article | - |
| dc.publisher.location | 미국 | - |
| dc.identifier.doi | 10.1145/3555776.3578481 | - |
| dc.identifier.scopusid | 2-s2.0-85162913929 | - |
| dc.identifier.wosid | 001124308100172 | - |
| dc.identifier.bibliographicCitation | 38TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2023, pp 1205 - 1208 | - |
| dc.citation.title | 38TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2023 | - |
| dc.citation.startPage | 1205 | - |
| dc.citation.endPage | 1208 | - |
| dc.type.docType | Proceedings Paper | - |
| dc.description.isOpenAccess | N | - |
| dc.description.journalRegisteredClass | scie | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.relation.journalResearchArea | Computer Science | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Interdisciplinary Applications | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Theory & Methods | - |
| dc.subject.keywordPlus | 3D modeling | - |
| dc.subject.keywordPlus | Classification (of information) | - |
| dc.subject.keywordPlus | Convolution | - |
| dc.subject.keywordPlus | Semantics | - |
| dc.subject.keywordPlus | Signal encoding | - |
| dc.subject.keywordAuthor | Video classification | - |
| dc.subject.keywordAuthor | Transformer | - |
| dc.subject.keywordAuthor | 3D convolution | - |
| dc.subject.keywordAuthor | Self-attention | - |
| dc.subject.keywordAuthor | Temporal feature | - |
| dc.subject.keywordAuthor | Computer Vision | - |
| dc.identifier.url | https://dl.acm.org/doi/10.1145/3555776.3578481 | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366
COPYRIGHT © 2024 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
