Orchestrating Large-Scale SpGEMMs using Dynamic Block Distribution and Data Transfer Minimization on Heterogeneous Systems
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Park, Taehyeong | - |
dc.contributor.author | Kang, Seokwon | - |
dc.contributor.author | Jang, Myung-Hwan | - |
dc.contributor.author | Kim, Sang-Wook | - |
dc.contributor.author | Park, Yongjun | - |
dc.date.accessioned | 2023-08-22T01:30:19Z | - |
dc.date.available | 2023-08-22T01:30:19Z | - |
dc.date.issued | 2023-04 | - |
dc.identifier.issn | 1084-4627 | - |
dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/189393 | - |
dc.description.abstract | Sparse general matrix-matrix multiplication (SpGEMM) is a major kernel in various emerging applications, such as database management systems, deep learning, graph analysis, and recommendation systems. Since SpGEMM requires extensive computation, many SpGEMM techniques have been implemented based on graphics processing units (GPUs) to exploit massive data parallelism completely. However, traditional SpGEMM techniques usually do not fully utilize the GPU because most non-zero elements of the target sparse matrices exist in a few hub nodes, and non-hub nodes barely have non-zero elements. The data-related characteristics (power law) result in a significant degradation in performance because of the load imbalance between the GPU cores and the low utilization of each core. Many attempts have been made through recent implementations to solve this problem using smart pre-/post-processing. However, the net performance hardly improves and sometimes even deteriorates owing to the large overheads. Additionally, non-hub nodes are inherently not suitable for GPU computing, even after optimization. Furthermore, the performance is no longer dominated by kernel execution, but by data transfers such as device-to-host (D2H) data transfers and file I/Os, owing to the rapid growth in the computing power of GPUs and input data size.Therefore, this work proposes a Dynamic Block Distributor (DBD), a novel full-system-level SpGEMM orchestration framework for heterogeneous systems, improving the overall performance by enabling an efficient CPU-GPU collaboration and further minimizing the overhead in data transfer between all the system elements. This framework first divides the target matrix into smaller blocks and then offloads the computation of each block to an appropriate computing unit between a GPU and CPU based on its workload type and the status of resource utilization at runtime. It also minimizes the overhead in data transfer with simple but suitable techniques, such as Row Collecting, I/O Overlapping, and I/O Binding. Our experiments showed that this framework increased the execution latency of SpGEMM, which included both the kernel execution and D2H transfers, by 3.24x on average, and the overall execution time by 2.07x on average, compared to that of the baseline cuSPARSE library. | - |
dc.format.extent | 4 | - |
dc.language | 영어 | - |
dc.language.iso | ENG | - |
dc.publisher | IEEE Computer Society | - |
dc.title | Orchestrating Large-Scale SpGEMMs using Dynamic Block Distribution and Data Transfer Minimization on Heterogeneous Systems | - |
dc.type | Article | - |
dc.publisher.location | 미국 | - |
dc.identifier.doi | 10.1109/ICDE55515.2023.00189 | - |
dc.identifier.scopusid | 2-s2.0-85167662508 | - |
dc.identifier.bibliographicCitation | Proceedings - International Conference on Data Engineering, v.2023-April, pp 2456 - 2459 | - |
dc.citation.title | Proceedings - International Conference on Data Engineering | - |
dc.citation.volume | 2023-April | - |
dc.citation.startPage | 2456 | - |
dc.citation.endPage | 2459 | - |
dc.type.docType | Conference paper | - |
dc.description.isOpenAccess | N | - |
dc.description.journalRegisteredClass | scopus | - |
dc.subject.keywordPlus | Computer graphics | - |
dc.subject.keywordPlus | Computing power | - |
dc.subject.keywordPlus | Data transfer | - |
dc.subject.keywordPlus | Deep learning | - |
dc.subject.keywordPlus | Electric power distribution | - |
dc.subject.keywordPlus | Matrix algebra | - |
dc.subject.keywordPlus | Program processors | - |
dc.subject.keywordPlus | Graphics processing unit | - |
dc.subject.keywordPlus | Heterogeneous | - |
dc.subject.keywordPlus | Heterogeneous systems | - |
dc.subject.keywordPlus | Hub nodes | - |
dc.subject.keywordPlus | Large scale sparse matrix | - |
dc.subject.keywordPlus | Large-scales | - |
dc.subject.keywordPlus | MAtrix multiplication | - |
dc.subject.keywordPlus | Matrix-matrix multiplications | - |
dc.subject.keywordPlus | Performance | - |
dc.subject.keywordPlus | Sparse matrices | - |
dc.subject.keywordPlus | Sparse matrix multiplication | - |
dc.subject.keywordAuthor | GPU | - |
dc.subject.keywordAuthor | heterogeneous | - |
dc.subject.keywordAuthor | large-scale sparse matrix | - |
dc.subject.keywordAuthor | Sparse matrix multiplication | - |
dc.identifier.url | https://ieeexplore.ieee.org/document/10184530 | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1365
COPYRIGHT © 2021 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.