Detailed Information

Cited 7 time in webofscience Cited 10 time in scopus
Metadata Downloads

System-Level Effects of Soft Errors in Uncore Components

Authors
Cho, HyungminCheng, EricShepherd, ThomasCher, Chen-YongMitra, Subhasish
Issue Date
Sep-2017
Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Keywords
Recovery; reliability; resilience; simulation; soft error; uncore components
Citation
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, v.36, no.9, pp.1497 - 1510
Journal Title
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
Volume
36
Number
9
Start Page
1497
End Page
1510
URI
https://scholarworks.bwise.kr/hongik/handle/2020.sw.hongik/5339
DOI
10.1109/TCAD.2017.2651824
ISSN
0278-0070
Abstract
The effects of soft errors in processor cores have been widely studied. However, little has been published about soft errors in uncore components, such as the memory subsystem and I/O controllers, of a system-on-a-chip (SoC). In this paper, we study how soft errors in uncore components affect system-level behaviors. We have created a new mixed-mode simulation platform that combines simulators at two different levels of abstraction, and achieves 20 000x speedup over register-transfer-level-only simulation. Using this platform, we present the first study of the system-level impact of soft errors inside various uncore components of a large-scale, multicore SoC using the industrial-grade, open-source OpenSPARC T2 SoC design. Our results show that soft errors in uncore components can significantly impact system-level reliability. We also demonstrate that uncore soft errors can create major challenges for traditional system-level checkpoint recovery techniques. To overcome such recovery challenges, we present a new replay recovery technique for uncore components belonging to the memory subsystem. For the L2 cache controller and the dynamic random-access memory controller components of OpenSPARC T2, our new technique reduces the probability that an application run fails to produce correct results due to soft errors by more than 50x with 1.82% and 2.58% chip-level area and power impact, respectively.
Files in This Item
There are no files associated with this item.
Appears in
Collections
College of Engineering > Computer Engineering Major > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Altmetrics

Total Views & Downloads

BROWSE