Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits

Full metadata record
DC Field Value Language
dc.contributor.authorKim, Dowon-
dc.contributor.authorLee, MinJae-
dc.contributor.authorKim, Janghyeon-
dc.contributor.authorKwon, HyuckSung-
dc.contributor.authorJeong, Hyeonggyu-
dc.contributor.authorPark, Sang-Soo-
dc.contributor.authorYoon, Minyong-
dc.contributor.authorRoh, Si-Dong-
dc.contributor.authorKwon, Yongsuk-
dc.contributor.authorSo, Jinin-
dc.contributor.authorChoi, Jungwook-
dc.date.accessioned2026-03-18T00:30:34Z-
dc.date.available2026-03-18T00:30:34Z-
dc.date.issued2025-12-
dc.identifier.issn1089-795X-
dc.identifier.urihttps://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/211317-
dc.description.abstractThe expansion of context windows in large language models (LLMs) to multi-million tokens introduces severe memory and compute bottlenecks, particularly in managing the growing Key-Value (KV) cache. While Compute Express Link (CXL) enables non-eviction frameworks that offload the full KV-cache to scalable external memory, these frameworks still suffer from costly data transfers when recalling non-resident KV tokens to limited GPU memory as context lengths increase. This work proposes scalable Processing-Near-Memory (PNM) for 1M-Token LLM Inference, a CXL-enabled KV-cache management system that coordinates memory and computation beyond GPU limits. Our design offloads token page selection to a PNM accelerator within CXL memory, eliminating costly recalls and enabling larger GPU batch sizes. We further introduce a hybrid parallelization strategy and a steady-token selection mechanism to enhance compute efficiency and scalability. Implemented atop a state-of-the-art CXL-PNM system, our solution delivers consistent performance gains for LLMs with up to 405B parameters and 1M-token contexts. Our PNM-only offloading scheme (PNM-KV) and GPU–PNM hybrid with steady-token execution (PnG-KV) achieve up to 21.9× throughput improvement, up to 60× lower energy per token, and up to 7.3× better total cost efficiency than the baseline, demonstrating that CXL-enabled multi-PNM architectures can serve as a scalable backbone for future long-context LLM inference.-
dc.format.extent13-
dc.language영어-
dc.language.isoENG-
dc.publisherInstitute of Electrical and Electronics Engineers Inc.-
dc.titleScalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits-
dc.typeArticle-
dc.identifier.doi10.1109/PACT65351.2025.00013-
dc.identifier.scopusid2-s2.0-105031900662-
dc.identifier.bibliographicCitationParallel Architectures and Compilation Techniques - Conference Proceedings, PACT, pp 1 - 13-
dc.citation.titleParallel Architectures and Compilation Techniques - Conference Proceedings, PACT-
dc.citation.startPage1-
dc.citation.endPage13-
dc.type.docTypeConference paper-
dc.description.isOpenAccessN-
dc.description.journalRegisteredClassscopus-
dc.subject.keywordPlusCache memory-
dc.subject.keywordPlusMemory architecture-
dc.subject.keywordPlusMemory management-
dc.subject.keywordAuthorLong-context LLM inference-
dc.subject.keywordAuthorProcessing-Near-Memory (PNM)-
dc.subject.keywordAuthorCompute Express Link (CXL)-
dc.subject.keywordAuthorKey-Value (KV) cache management-
dc.subject.keywordAuthorHybrid GPU-PNM parallelism-
dc.identifier.urlhttps://ieeexplore.ieee.org/document/11282934-
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Choi, Jung wook photo

Choi, Jung wook
COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)
Read more

Altmetrics

Total Views & Downloads

BROWSE