FloatMax: An Efficient Accelerator for Transformer-Based Models Exploiting Tensor-Wise Adaptive Floating-Point Quantization

Chung, Seoho; Kim, Kwangrae; Rho, Soomin; Kim, Chanhoon; Chung, Ki-Seok

doi:10.1109/ICCD63220.2024.00096

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

FloatMax: An Efficient Accelerator for Transformer-Based Models Exploiting Tensor-Wise Adaptive Floating-Point Quantization

Authors: Chung, Seoho; Kim, Kwangrae; Rho, Soomin; Kim, Chanhoon; Chung, Ki-Seok

Issue Date: Jan-2025

Keywords: Hardware Accelerator; Quantization; Transformer models

Citation: IEEE International Conference on Computer Design - VLSI in Computers and Processors, pp 599 - 607

Pages: 9

Indexed: SCOPUS

Journal Title: IEEE International Conference on Computer Design - VLSI in Computers and Processors

Start Page: 599

End Page: 607

URI: https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206564

DOI: 10.1109/ICCD63220.2024.00096

ISSN: 1063-6404
2576-6996

Abstract: The rapid growth of the Transformer model size results in significant computational resources and memory requirements. To mitigate this complexity, quantization to reduce the bit width to represent numbers is being actively studied. However, prior quantization methods that use integers of 8 bits or less suffer from accuracy loss due to lower resolution because tensors of the Transformer model often have a non-uniform distribution containing outliers. In this paper, we propose a novel quantization method that utilizes tensor-wise adaptive floating-point quantization. Two key strategies address the issue of accuracy degradation. First, we leverage the characteristics of the floating-point data type, which provides higher precision for normal values and lower precision for outlier values. Second, since the degree of non-uniformity and outliers in each tensor vary from layer to layer, we adaptively assign different bit widths to the exponent and the mantissa of the floating-point representation for each tensor, quantizing them with different levels of precision. This approach allows high accuracy across the tensor distribution while efficiently managing outliers. We design the FloatMax processing element and decoders to efficiently carry out these floating-point computations. In addition, FloatMax is integrated into a systolic array to accelerate the linear layer and attention mechanism. Evaluation results show that FloatMax achieves 0.75×, 0.85×, and 0.79× area reduction and 1.70×, 1.84×, and 1.89× average performance improvement over Olive, ANT, and AdaFloat, respectively, without accuracy loss.

Files in This Item: There are no files associated with this item.

Appears in Collections: 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Chung, Ki Seok photo

Chung, Ki Seok: COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE