A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling
- Authors
- Lee, Sae Kyu; Agrawal, Ankur; Silberman, Joel; Ziegler, Matthew; Kang, Mingu; Venkataramani, Swagath; Cao, Nianzheng; Fleischer, Bruce; Guillorn, Michael; Cohen, Matthew; Mueller, Silvia M.; Oh, Jinwook; Lutz, Martin; Jung, Jinwook; Koswatta, Siyu; Zhou, Ching; Zalani, Vidhi; Kar, Monodeep; Bonanno, James; Casatuta, Robert; Chen, Chia-Yu; Choi, Jungwook; Haynie, Howard; Herbert, Alyssa; Jain, Radhika; Kim, Kyu-Hyoun; Li, Yulong; Ren, Zhibin; Rider, Scot; Schaal, Marcel; Schelm, Kerstin; Scheuermann, Michael R.; Sun, Xiao; Tran, Hung; Wang, Naigang; Wang, Wei; Zhang, Xin; Shah, Vinay; Curran, Brian; Srinivasan, Vijayalakshmi; Lu, Pong-Fei; Shukla, Sunil; Gopalakrishnan, Kailash; Chang, Leland
- Issue Date
- Jan-2022
- Publisher
- IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
- Keywords
- Training; Artificial intelligence; AI accelerators; Inference algorithms; Computer architecture; Bandwidth; System-on-chip; Approximate computing; artificial intelligence (AI); deep neural networks (DNNs); hardware accelerators; machine learning (ML); reduced precision computation
- Citation
- IEEE JOURNAL OF SOLID-STATE CIRCUITS, v.57, no.1, pp.182 - 197
- Indexed
- SCIE
SCOPUS
- Journal Title
- IEEE JOURNAL OF SOLID-STATE CIRCUITS
- Volume
- 57
- Number
- 1
- Start Page
- 182
- End Page
- 197
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/139877
- DOI
- 10.1109/JSSC.2021.3120113
- ISSN
- 0018-9200
- Abstract
- Reduced precision computation is a key enabling factor for energy-efficient acceleration of deep learning (DL) applications. This article presents a 7-nm four-core mixed-precision artificial intelligence (AI) chip that supports four compute precisions--FP16, Hybrid-FP8 (HFP8), INT4, and INT2--to support diverse application demands for training and inference. The chip leverages cutting-edge algorithmic advances to demonstrate leading-edge power efficiency for 8-bit floating-point (FP8) training and INT4 inference without model accuracy degradation. A new HFP8 format combined with separation of the floating- and fixed-point pipelines and aggressive circuit/architecture optimization enables performance improvements while maintaining high compute utilization. A high-bandwidth ring protocol enables efficient data communication, while power management using workload-aware clock throttling maximizes performance within a given power budget. The AI chip demonstrates 3.58-TFLOPS/W peak energy efficiency and 26.2-TFLOPS peak performance for HFP8 iso-accuracy training, and 16.9-TOPS/W peak energy efficiency and 104.9-TOPS peak performance for INT4 iso-accuracy inference.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.