S-ViT: Sparse Vision Transformer for Accurate Face Recognition
- Authors
- Kim, Geunsu; Park, Gyudo; Kang, Soohyeok; Woo, Simon S.
- Issue Date
- Mar-2023
- Publisher
- Association for Computing Machinery
- Keywords
- deep learning model compression; face recognition; neural networks; pruning; vision transformer
- Citation
- Proceedings of the ACM Symposium on Applied Computing, pp 1130 - 1138
- Pages
- 9
- Indexed
- SCOPUS
- Journal Title
- Proceedings of the ACM Symposium on Applied Computing
- Start Page
- 1130
- End Page
- 1138
- URI
- https://scholarworks.bwise.kr/skku/handle/2021.sw.skku/106826
- DOI
- 10.1145/3555776.3577640
- ISSN
- 0000-0000
- Abstract
- Most of the existing face recognition applications using deep learning models have leveraged CNN-based architectures as the feature extractor. However, recent studies have shown that in computer vision tasks, vision transformer-based models often outperform CNN-based models. Therefore, in this work, we propose a Sparse Vision Transformer (S-ViT) based on the Vision Transformer (ViT) architecture to improve the face recognition tasks. After the model is trained, S-ViT tends to have a sparse distribution of weights compared to ViT, so we named it according to these characteristics. Unlike the conventional ViT, our proposed S-ViT adopts image Relative Positional Encoding (iRPE) method for positional encoding. Also, S-ViT has been modified so that all token embeddings, not just class token, participate in the decoding process. Through extensive experiment, we showed that S-ViT achieves better performance in closed-set than the other baseline models, and showed better performance than the baseline ViT-based models. For example, when using ArcFace as the loss function in the identification protocol, S-ViT achieved up to 3.27% higher accuracy than ResNet50. We also show that the use of ArcFace loss functions yields greater performance gains in S-ViT than in baseline models. In addition, S-ViT has an advantage in cost-performance trade-off because it tends to be more robust to the pruning technique than the underlying model, ViT. Therefore, S-ViT offers the additional advantage, which can be applied more flexibly in the target devices with limited resources. © 2023 ACM.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - Computing and Informatics > Computer Science and Engineering > 1. Journal Articles
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.