Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Modelsopen access

Authors
Yang, JaewooKim, HayunJi, JunyungKim, Younghoon
Issue Date
Apr-2025
Publisher
MDPI
Keywords
quantization; LLM; post-training quantization; outliers
Citation
FUTURE INTERNET, v.17, no.4, pp 1 - 21
Pages
21
Indexed
SCOPUS
ESCI
Journal Title
FUTURE INTERNET
Volume
17
Number
4
Start Page
1
End Page
21
URI
https://scholarworks.bwise.kr/erica/handle/2021.sw.erica/125234
DOI
10.3390/fi17040185
ISSN
1999-5903
1999-5903
Abstract
Modern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference. Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such as INT8. However, we identify a critical challenge in activation quantization for GLU (Gated Linear Unit) variants, which are commonly used in the feed-forward networks of modern LLMs like the LLaMA family. Specifically, severe local quantization errors arise due to excessively large activation magnitudes, which we refer to as activation spikes, leading to significant degradation in model performance. Our analysis reveals a systematic pattern of these spikes: they predominantly occur in the FFN (feed-forward network) layers at the early and late layers of the model and are concentrated on a small subset of tokens rather than being uniformly distributed across a token sequence. To mitigate this issue, we propose two empirical methods: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), which isolate activation spikes during quantization. Extensive experiments demonstrated that our methods effectively improve activation quantization, particularly in coarse-grained quantization schemes, enhancing the performance of LLMs with GLU variants and addressing the limitations of existing quantization techniques. The code for implementing our methods and reproducing the experiments is publicly available our GitHub repository.
Files in This Item
Go to Link
Appears in
Collections
COLLEGE OF COMPUTING > DEPARTMENT OF ARTIFICIAL INTELLIGENCE > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Young hoon photo

Kim, Young hoon
ERICA 소프트웨어융합대학 (DEPARTMENT OF ARTIFICIAL INTELLIGENCE)
Read more

Altmetrics

Total Views & Downloads

BROWSE