FloatMax: An Efficient Accelerator for Transformer-Based Models Exploiting Tensor-Wise Adaptive Floating-Point Quantization
- Authors
- Chung, Seoho; Kim, Kwangrae; Rho, Soomin; Kim, Chanhoon; Chung, Ki-Seok
- Issue Date
- Jan-2025
- Keywords
- Hardware Accelerator; Quantization; Transformer models
- Citation
- IEEE International Conference on Computer Design - VLSI in Computers and Processors, pp 599 - 607
- Pages
- 9
- Indexed
- SCOPUS
- Journal Title
- IEEE International Conference on Computer Design - VLSI in Computers and Processors
- Start Page
- 599
- End Page
- 607
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206564
- DOI
- 10.1109/ICCD63220.2024.00096
- ISSN
- 1063-6404
2576-6996
- Abstract
- The rapid growth of the Transformer model size results in significant computational resources and memory requirements. To mitigate this complexity, quantization to reduce the bit width to represent numbers is being actively studied. However, prior quantization methods that use integers of 8 bits or less suffer from accuracy loss due to lower resolution because tensors of the Transformer model often have a non-uniform distribution containing outliers. In this paper, we propose a novel quantization method that utilizes tensor-wise adaptive floating-point quantization. Two key strategies address the issue of accuracy degradation. First, we leverage the characteristics of the floating-point data type, which provides higher precision for normal values and lower precision for outlier values. Second, since the degree of non-uniformity and outliers in each tensor vary from layer to layer, we adaptively assign different bit widths to the exponent and the mantissa of the floating-point representation for each tensor, quantizing them with different levels of precision. This approach allows high accuracy across the tensor distribution while efficiently managing outliers. We design the FloatMax processing element and decoders to efficiently carry out these floating-point computations. In addition, FloatMax is integrated into a systolic array to accelerate the linear layer and attention mechanism. Evaluation results show that FloatMax achieves 0.75×, 0.85×, and 0.79× area reduction and 1.70×, 1.84×, and 1.89× average performance improvement over Olive, ANT, and AdaFloat, respectively, without accuracy loss.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.