Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Not All Layers Are Equal: A Layer-Wise Adaptive Approach Toward Large-Scale DNN Training

Authors
Ko, YunyongLee, DongwonKim, Ssng Wook
Issue Date
Apr-2022
Publisher
Association for Computing Machinery, Inc
Keywords
large batch training; layer-wise approach; learning rate scaling
Citation
WWW 2022 - Proceedings of the ACM Web Conference 2022, pp.1851 - 1859
Indexed
SCOPUS
Journal Title
WWW 2022 - Proceedings of the ACM Web Conference 2022
Start Page
1851
End Page
1859
URI
https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/138794
DOI
10.1145/3485447.3511989
ISSN
0000-0000
Abstract
A large-batch training with data parallelism is a widely adopted approach to efficiently train a large deep neural network (DNN) model. Large-batch training, however, often suffers from the problem of the model quality degradation because of its fewer iterations. To alleviate this problem, in general, learning rate (lr) scaling methods have been applied, which increases the learning rate to make an update larger at each iteration. Unfortunately, however, we observe that large-batch training with state-of-the-art lr scaling methods still often degrade the model quality when a batch size crosses a specific limit, rendering such lr methods less useful. To this phenomenon, we hypothesize that existing lr scaling methods overlook the subtle but important differences across layersin training, which results in the degradation of the overall model quality. From this hypothesis, we propose a novel approach (LENA) toward the learning rate scaling for large-scale DNN training, employing: (1) a layer-wise adaptive lr scaling to adjust lr for each layer individually, and (2) a layer-wise state-aware warm-up to track the state of the training for each layer and finish its warm-up automatically. The comprehensive evaluation with variations of batch sizes demonstrates that LENA achieves the target accuracy (i.e., the accuracy of single-worker training): (1) within the fewest iterations across different batch sizes (up to 45.2% fewer iterations and 44.7% shorter time than the existing state-of-the-art method), and (2) for training very large-batch sizes, surpassing the limits of all baselines.
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Sang-Wook photo

Kim, Sang-Wook
COLLEGE OF ENGINEERING (SCHOOL OF COMPUTER SCIENCE)
Read more

Altmetrics

Total Views & Downloads

BROWSE