Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

FloatMax: An Efficient Accelerator for Transformer-Based Models Exploiting Tensor-Wise Adaptive Floating-Point Quantization

Authors
Chung, SeohoKim, KwangraeRho, SoominKim, ChanhoonChung, Ki-Seok
Issue Date
Jan-2025
Keywords
Hardware Accelerator; Quantization; Transformer models
Citation
IEEE International Conference on Computer Design - VLSI in Computers and Processors, pp 599 - 607
Pages
9
Indexed
SCOPUS
Journal Title
IEEE International Conference on Computer Design - VLSI in Computers and Processors
Start Page
599
End Page
607
URI
https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206564
DOI
10.1109/ICCD63220.2024.00096
ISSN
1063-6404
2576-6996
Abstract
The rapid growth of the Transformer model size results in significant computational resources and memory requirements. To mitigate this complexity, quantization to reduce the bit width to represent numbers is being actively studied. However, prior quantization methods that use integers of 8 bits or less suffer from accuracy loss due to lower resolution because tensors of the Transformer model often have a non-uniform distribution containing outliers. In this paper, we propose a novel quantization method that utilizes tensor-wise adaptive floating-point quantization. Two key strategies address the issue of accuracy degradation. First, we leverage the characteristics of the floating-point data type, which provides higher precision for normal values and lower precision for outlier values. Second, since the degree of non-uniformity and outliers in each tensor vary from layer to layer, we adaptively assign different bit widths to the exponent and the mantissa of the floating-point representation for each tensor, quantizing them with different levels of precision. This approach allows high accuracy across the tensor distribution while efficiently managing outliers. We design the FloatMax processing element and decoders to efficiently carry out these floating-point computations. In addition, FloatMax is integrated into a systolic array to accelerate the linear layer and attention mechanism. Evaluation results show that FloatMax achieves 0.75×, 0.85×, and 0.79× area reduction and 1.70×, 1.84×, and 1.89× average performance improvement over Olive, ANT, and AdaFloat, respectively, without accuracy loss.
Files in This Item
There are no files associated with this item.
Appears in
Collections
서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Chung, Ki Seok photo

Chung, Ki Seok
COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)
Read more

Altmetrics

Total Views & Downloads

BROWSE