LOP+SAMM: DNN Inference Accelerator with Hardware Loop Offloading and Segment-Wise On-Chip Memory Data Synchronization

Lee, Won Kyoo; Rho, Soomin; Chung, Ki-Seok

doi:10.1109/ICECIE66637.2025.11363807

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

LOP+SAMM: DNN Inference Accelerator with Hardware Loop Offloading and Segment-Wise On-Chip Memory Data Synchronization

Authors: Lee, Won Kyoo; Rho, Soomin; Chung, Ki-Seok

Issue Date: Feb-2026

Publisher: Institute of Electrical and Electronics Engineers

Keywords: Accelerator; Dataflow Processing; Matrix Multiplication

Citation: Proceedings, International Conference on Electrical, Control and Instrumentation Engineering, ICECIE, pp 467 - 474

Pages: 8

Indexed: SCOPUS

Journal Title: Proceedings, International Conference on Electrical, Control and Instrumentation Engineering, ICECIE

Start Page: 467

End Page: 474

URI: https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/212286

DOI: 10.1109/ICECIE66637.2025.11363807

ISSN: 2832-9821
2832-9848

Abstract: Recent demand for computing power in Deep Neural Networks (DNNs) has driven extensive research on accelerators that improve performance. Standard accelerators typically rely on hardware tailored to fixed operations and offload operation scheduling to maximize compute utilization, but this approach limits scalability to other models and programmability. Alternatively, systems connect a RISC-V host CPU to control the accelerator and improve flexibility; however, for the multi-loop structure of DNN computation, CPU control overhead can limit compute utilization. These approaches also overlook the gains available from efficient on-chip memory data management.We propose a DNN inference accelerator that mitigates this trade-off by combining a Loop Offloading Processor (LOP) and a Scratchpad–Accumulator Mutex Map (SAMM). LOP offloads the CPU control overheads that arise in nested loops to hardware, thereby addressing utilization limits while preserving programmability through loop-wise control. SAMM operates segment-wise mutual exclusion to orchestrate efficient data transfers between on-chip memory and external memory, enabling fine-grained overlap of transfer and computation and preserving the maximum tile size (i.e., across the entire buffer). Compared to the state-of-the-art Gemmini accelerator, our evaluation demonstrates that LOP+SAMM improves performance by 1.13–1.32× across diverse GEMM (General Matrix Multiplication) workloads, results in up to 1.51× fewer external memory accesses, decreases scratchpad bank conflicts by up to 15.22×, and achieves 1.09–1.18× end-to-end latency speedups at the model level with only a 1.01× area increase over the Gemmini baseline.

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Chung, Ki Seok photo

Chung, Ki Seok: COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE