Detailed Information

Cited 0 time in webofscience Cited 1 time in scopus
Metadata Downloads

A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

Authors
Lee, Sae KyuAgrawal, AnkurSilberman, JoelZiegler, MatthewKang, MinguVenkataramani, SwagathCao, NianzhengFleischer, BruceGuillorn, MichaelCohen, MatthewMueller, Silvia M.Oh, JinwookLutz, MartinJung, JinwookKoswatta, SiyuZhou, ChingZalani, VidhiKar, MonodeepBonanno, JamesCasatuta, RobertChen, Chia-YuChoi, JungwookHaynie, HowardHerbert, AlyssaJain, RadhikaKim, Kyu-HyounLi, YulongRen, ZhibinRider, ScotSchaal, MarcelSchelm, KerstinScheuermann, Michael R.Sun, XiaoTran, HungWang, NaigangWang, WeiZhang, XinShah, VinayCurran, BrianSrinivasan, VijayalakshmiLu, Pong-FeiShukla, SunilGopalakrishnan, KailashChang, Leland
Issue Date
Jan-2022
Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Keywords
Training; Artificial intelligence; AI accelerators; Inference algorithms; Computer architecture; Bandwidth; System-on-chip; Approximate computing; artificial intelligence (AI); deep neural networks (DNNs); hardware accelerators; machine learning (ML); reduced precision computation
Citation
IEEE JOURNAL OF SOLID-STATE CIRCUITS, v.57, no.1, pp.182 - 197
Indexed
SCIE
SCOPUS
Journal Title
IEEE JOURNAL OF SOLID-STATE CIRCUITS
Volume
57
Number
1
Start Page
182
End Page
197
URI
https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/139877
DOI
10.1109/JSSC.2021.3120113
ISSN
0018-9200
Abstract
Reduced precision computation is a key enabling factor for energy-efficient acceleration of deep learning (DL) applications. This article presents a 7-nm four-core mixed-precision artificial intelligence (AI) chip that supports four compute precisions--FP16, Hybrid-FP8 (HFP8), INT4, and INT2--to support diverse application demands for training and inference. The chip leverages cutting-edge algorithmic advances to demonstrate leading-edge power efficiency for 8-bit floating-point (FP8) training and INT4 inference without model accuracy degradation. A new HFP8 format combined with separation of the floating- and fixed-point pipelines and aggressive circuit/architecture optimization enables performance improvements while maintaining high compute utilization. A high-bandwidth ring protocol enables efficient data communication, while power management using workload-aware clock throttling maximizes performance within a given power budget. The AI chip demonstrates 3.58-TFLOPS/W peak energy efficiency and 26.2-TFLOPS peak performance for HFP8 iso-accuracy training, and 16.9-TOPS/W peak energy efficiency and 104.9-TOPS peak performance for INT4 iso-accuracy inference.
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Choi, Jung wook photo

Choi, Jung wook
COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)
Read more

Altmetrics

Total Views & Downloads

BROWSE