Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits

Authors
Kim, DowonLee, MinJaeKim, JanghyeonKwon, HyuckSungJeong, HyeonggyuPark, Sang-SooYoon, MinyongRoh, Si-DongKwon, YongsukSo, JininChoi, Jungwook
Issue Date
Dec-2025
Publisher
Institute of Electrical and Electronics Engineers Inc.
Keywords
Long-context LLM inference; Processing-Near-Memory (PNM); Compute Express Link (CXL); Key-Value (KV) cache management; Hybrid GPU-PNM parallelism
Citation
Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, pp 1 - 13
Pages
13
Indexed
SCOPUS
Journal Title
Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
Start Page
1
End Page
13
URI
https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/211317
DOI
10.1109/PACT65351.2025.00013
ISSN
1089-795X
Abstract
The expansion of context windows in large language models (LLMs) to multi-million tokens introduces severe memory and compute bottlenecks, particularly in managing the growing Key-Value (KV) cache. While Compute Express Link (CXL) enables non-eviction frameworks that offload the full KV-cache to scalable external memory, these frameworks still suffer from costly data transfers when recalling non-resident KV tokens to limited GPU memory as context lengths increase. This work proposes scalable Processing-Near-Memory (PNM) for 1M-Token LLM Inference, a CXL-enabled KV-cache management system that coordinates memory and computation beyond GPU limits. Our design offloads token page selection to a PNM accelerator within CXL memory, eliminating costly recalls and enabling larger GPU batch sizes. We further introduce a hybrid parallelization strategy and a steady-token selection mechanism to enhance compute efficiency and scalability. Implemented atop a state-of-the-art CXL-PNM system, our solution delivers consistent performance gains for LLMs with up to 405B parameters and 1M-token contexts. Our PNM-only offloading scheme (PNM-KV) and GPU–PNM hybrid with steady-token execution (PnG-KV) achieve up to 21.9× throughput improvement, up to 60× lower energy per token, and up to 7.3× better total cost efficiency than the baseline, demonstrating that CXL-enabled multi-PNM architectures can serve as a scalable backbone for future long-context LLM inference.
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Choi, Jung wook photo

Choi, Jung wook
COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)
Read more

Altmetrics

Total Views & Downloads

BROWSE