Patch-level Representation Learning for Self-supervised Vision Transformers

Yun, Sukmin; Lee, Hankook; Kim, Jaehyung; Shin, Jinwoo

doi:10.1109/CVPR52688.2022.00817

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Patch-level Representation Learning for Self-supervised Vision Transformers

Authors: Yun, Sukmin; Lee, Hankook; Kim, Jaehyung; Shin, Jinwoo

Issue Date: Jun-2022

Publisher: IEEE Computer Society

Keywords: Self-& semi-& meta- & unsupervised learning

Citation: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 8354 - 8363

Pages: 10

Indexed: SCOPUS

Journal Title: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Start Page: 8354

End Page: 8363

URI: https://scholarworks.bwise.kr/erica/handle/2021.sw.erica/119223

DOI: 10.1109/CVPR52688.2022.00817

Abstract: Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advantages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the benefit, i.e., they are architecture-agnostic. In particular, we focus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neighbors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with SelfPatch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, SelfPatch significantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.

Files in This Item: Go to Link

Appears in Collections: COLLEGE OF COMPUTING > DEPARTMENT OF ARTIFICIAL INTELLIGENCE > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Yun, Sukmin photo

Yun, Sukmin: COLLEGE OF COMPUTING (DEPARTMENT OF ARTIFICIAL INTELLIGENCE)

Read more

Altmetrics

Total Views & Downloads

STATISTICS: Total View :2,655,496; Today View :15

RSS_1.0 RSS_2.0 ATOM_1.0

55 Hanyangdeahak-ro, Sangnok-gu, Ansan, Gyeonggi-do, 15588, Korea+82-31-400-4269 sweetbrain@hanyang.ac.kr

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE