Discovering Efficient Fused Layer Configurations for Executing Multi-Workloads on Multi-Core NPUs

Lee, Younghyun; Kim, Hyejun; Yu, Yongseung; Cho, Myeongjin; Seo, Jiwon; Park, Yongjun

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Discovering Efficient Fused Layer Configurations for Executing Multi-Workloads on Multi-Core NPUs

Full metadata record

DC Field	Value	Language
dc.contributor.author	Lee, Younghyun	-
dc.contributor.author	Kim, Hyejun	-
dc.contributor.author	Yu, Yongseung	-
dc.contributor.author	Cho, Myeongjin	-
dc.contributor.author	Seo, Jiwon	-
dc.contributor.author	Park, Yongjun	-
dc.date.accessioned	2024-11-28T16:30:53Z	-
dc.date.available	2024-11-28T16:30:53Z	-
dc.date.issued	2024-03	-
dc.identifier.issn	1530-1591	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/197597	-
dc.description.abstract	As the AI industry grows rapidly, Neural Processing Units (NPUs) have been developed to deliver AI services more efficiently. One of the most important challenges for NPUs is task scheduling to minimize off-chip memory accesses, which may occur significant performance overhead. To reduce memory accesses, multiple convolution layers can be fused into a fused layer group, which offers numerous optimization opportunities. However, in most Convolutional Neural Networks (CNNs), when multiple layers are fused, the on-chip memory utilization of the fused layers gradually decreases, resulting in non-flat memory usage. In this paper, we propose a scheduling search algorithm to optimize the fusion of multiple convolution layers while reducing the peak on-chip memory usage. The proposed algorithm aims to find a schedule that simultaneously optimizes execution time and peak on-chip memory usage, despite a slight increase in off-chip memory accesses. It organizes the search space into a graph of possible partial schedules and then finds the optimal path. As a result of the improved on-chip memory usage, multiple workloads can be executed on multi-core NPUs with increased throughput. Experimental results show that the fusion schedule explored by the proposed method reduced on-chip memory usage by 39%, while increasing latency by 13%. When the freed on-chip memory was allocated to other workloads and the two workloads were executed concurrently in a multi-core NPU, a 32% performance improvement could be achieved.	-
dc.format.extent	6	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.title	Discovering Efficient Fused Layer Configurations for Executing Multi-Workloads on Multi-Core NPUs	-
dc.type	Article	-
dc.identifier.scopusid	2-s2.0-85196490523	-
dc.identifier.bibliographicCitation	Proceedings -Design, Automation and Test in Europe, DATE, pp 1 - 6	-
dc.citation.title	Proceedings -Design, Automation and Test in Europe, DATE	-
dc.citation.startPage	1	-
dc.citation.endPage	6	-
dc.type.docType	Conference paper	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scopus	-
dc.subject.keywordPlus	Convolution	-
dc.subject.keywordPlus	Multilayer neural networks	-
dc.subject.keywordAuthor	Compilers	-
dc.subject.keywordAuthor	Neural networks	-
dc.subject.keywordAuthor	NPU	-

Files in This Item: There are no files associated with this item.

Appears in Collections: 서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

Show simple item record

qrcode

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Altmetrics

Total Views & Downloads

BROWSE