SkipReduce: (Interconnection) Network Sparsity to Accelerate Distributed Machine Learning
- Authors
- Kasan, Hans; Abts, Dennis; Choi, Jungwook; Kim, John
- Issue Date
- Oct-2025
- Citation
- IEEE/ACM International Symposium on Microarchitecture (MICRO), v.Part of 213862, pp 643 - 658
- Pages
- 16
- Indexed
- SCOPUS
- Journal Title
- IEEE/ACM International Symposium on Microarchitecture (MICRO)
- Volume
- Part of 213862
- Start Page
- 643
- End Page
- 658
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209447
- DOI
- 10.1145/3725843.3756092
- ISSN
- 1072-4451
- Abstract
- The interconnection network is a critical component for building scalable systems, as its communication bandwidth directly impacts the collective communication performance of distributed training. In this work, we exploit interconnection network sparsity (or communication sparsity) to address challenges of communication performance and scalability. In particular, we identify how gradients (or packets) during communication can be randomly skipped with minimal impact on accuracy. However, skipping gradients in fine granularity (or individually) results in a loss of gradient information without improving communication performance, due to the synchronous nature of collective communication. Thus, we propose coarse-grained skipping where gradient slices are skipped, which enables skipping of some AllReduce steps to accelerate communication. In particular, we propose SkipReduce collective communication that intentionally skips random gradients during AllReduce. However, a naive implementation of SkipReduce can degrade accuracy by repeatedly skipping gradients from the same node, which introduces bias. To mitigate this accuracy loss, we show how randomizing the skipped gradient slices improves training accuracy with negligible additional runtime. We also observe that not all layers have similar communication sparsity and propose applying SkipReduce selectively where only the sparse layers (or gradients) are skipped to minimize the accuracy impact of SkipReduce. Compared to prior work on communication acceleration, SkipReduce can be seamlessly integrated into existing collective communication libraries with minimal overhead. We implement SkipReduce on top of NCCL's ring-based AllReduce algorithm. Our results show that this method accelerates collective communication while preserving final training accuracy. Compared to baseline AllReduce, SkipReduce provides up to a 1.58 × speedup in time-to-accuracy. Beyond this performance gain in data parallelism, this work also discusses the broader implications of SkipReduce, including its application to other parallelism strategies and logical topologies, as well as its benefits as a model regularizer.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.