LOP+SAMM: DNN Inference Accelerator with Hardware Loop Offloading and Segment-Wise On-Chip Memory Data Synchronization
- Authors
- Lee, Won Kyoo; Rho, Soomin; Chung, Ki-Seok
- Issue Date
- Feb-2026
- Publisher
- Institute of Electrical and Electronics Engineers
- Keywords
- Accelerator; Dataflow Processing; Matrix Multiplication
- Citation
- Proceedings, International Conference on Electrical, Control and Instrumentation Engineering, ICECIE, pp 467 - 474
- Pages
- 8
- Indexed
- SCOPUS
- Journal Title
- Proceedings, International Conference on Electrical, Control and Instrumentation Engineering, ICECIE
- Start Page
- 467
- End Page
- 474
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/212286
- DOI
- 10.1109/ICECIE66637.2025.11363807
- ISSN
- 2832-9821
2832-9848
- Abstract
- Recent demand for computing power in Deep Neural Networks (DNNs) has driven extensive research on accelerators that improve performance. Standard accelerators typically rely on hardware tailored to fixed operations and offload operation scheduling to maximize compute utilization, but this approach limits scalability to other models and programmability. Alternatively, systems connect a RISC-V host CPU to control the accelerator and improve flexibility; however, for the multi-loop structure of DNN computation, CPU control overhead can limit compute utilization. These approaches also overlook the gains available from efficient on-chip memory data management.We propose a DNN inference accelerator that mitigates this trade-off by combining a Loop Offloading Processor (LOP) and a Scratchpad–Accumulator Mutex Map (SAMM). LOP offloads the CPU control overheads that arise in nested loops to hardware, thereby addressing utilization limits while preserving programmability through loop-wise control. SAMM operates segment-wise mutual exclusion to orchestrate efficient data transfers between on-chip memory and external memory, enabling fine-grained overlap of transfer and computation and preserving the maximum tile size (i.e., across the entire buffer). Compared to the state-of-the-art Gemmini accelerator, our evaluation demonstrates that LOP+SAMM improves performance by 1.13–1.32× across diverse GEMM (General Matrix Multiplication) workloads, results in up to 1.51× fewer external memory accesses, decreases scratchpad bank conflicts by up to 15.22×, and achieves 1.09–1.18× end-to-end latency speedups at the model level with only a 1.01× area increase over the Gemmini baseline.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.