Finding Optimal Numerical Format for Sub-8-Bit Post-Training Quantization of Vision Transformers
- Authors
- 이장환; Hwang, Youngdeok; Choi, Jungwook
- Issue Date
- Jun-2023
- Publisher
- Institute of Electrical and Electronics Engineers Inc.
- Keywords
- fixed-point; floating-point; Post-training quantization; vision Transformer
- Citation
- ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp 1 - 5
- Pages
- 5
- Indexed
- SCOPUS
- Journal Title
- ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
- Start Page
- 1
- End Page
- 5
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/196142
- DOI
- 10.1109/ICASSP49357.2023.10096798
- ISSN
- 0736-7791
1520-6149
- Abstract
- Vision Transformers (ViTs) have gained significant attention for their exceptional model accuracies on computer vision applications, but their demanding memory requirements and computational complexity have hindered active deployment. Post-training quantization (PTQ) is a practical method to tackle this challenge by directly reducing ViT's bit-precision. However, diverse data characteristics across different operations of ViT cannot be well captured solely by a single numerical format (fixed or floating-point). This work proposes an analytical framework that optimizes the numerical format of each matrix multiplication of ViTs for mixed-format sub-8bit quantization. The extensive evaluation demonstrates that the proposed method can reduce the PTQ error and achieve state-of-the-art accuracy for popular ViT models.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.