SCALABLE AND PRACTICAL NATURAL GRADIENT FOR LARGE-SCALE DEEP LEARNING
Abstract
large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size. previous approaches attempt to address this problem by varying the learning rate and batch size over epochs and layers, or ad hoc modifications of batch normalization we propose scalable and practical natural gradient descent (sp-ngd), a principled approach for training models that allows them to attain similar generalization performance to models trained with first-order optimization methods, but with accelerated convergence. furthermore, sp-ngd scales to large mini-batch sizes with a negligible computational overhead as compared to first-order methods. we evaluated sp-ngd on a benchmark task where highly optimized first-order methods are available as references: training a resnet-50 model for image classification on imagenet. we demonstrate convergence to a top-1 validation accuracy of 75.4% in 5.5 minutes using a mini-batch size of 32,768 with 1,024 gpus, as well as an accuracy of 74.9% with an extremely large mini-batch size of 131,072 in 873 steps of sp-ngd
Existing System
? More complex approaches for manipulating the learning rate were proposed, such as LARS [8], where a different learning rate is used for each layer by normalizing them with the ratio between the layer-wise norms of the weights and gradients. ? This enabled the training with a mini-batch size of 32K without the use of ad hoc modifications, which achieved 74.9% accuracy in 14 minutes (64 epochs) [8]. It has been reported that combining LARS with counter intuitive modifications to the batch normalization, can yield 75.8% accuracy even for a mini-batch size of 65K ? The hierarchical synchronization of minibatches have also been proposed but such methods have not been tested at scale to the extent of the authors’ knowledge.
Disadvantages
? Our implementation uses a decentralized approach using MPI/NCCL1 collective communications among the processes. The decentralized approach has been used in high performance computing for a long time, and is known to scale to thousands of GPUs without modification. Although, software like Horovod2 can alleviate the problems with parameter servers by working as a TensorFlow wrapper for NCCL, a workable realization of K-FAC requires solving many engineering and modeling challenges, and our solution is the first one that succeeds on a large scale task.
Proposed System
? The existing work using the SGD with large mini-batch. Contrary to prior claims that models trained with second-order methods do not generalize as well as the SGD, we were able to show that this is not at all the case, even for extremely large mini-batches. Our SP-NGD framework allowed us to train on 1024 GPUs and achieved 75.4% in 5.5 minutes. This is the first work which observes the relationship between the FIM of ResNet-50 and its training on large mini-batches ranging from 4K to 131K. The advantage that we have in designing better optimizers by taking this approach is that we are starting from the most mathematically rigorous form, and every improvement that we make is a systematic design decision based on observation of the FIM
Advantages
? More complex approaches for manipulating the learning rate were proposed, such as LARS [8], where a different learning rate is used for each layer by normalizing them with the ratio between the layer-wise norms of the weights and gradients. This enabled the training with a mini-batch size of 32K without the use of ad hoc modifications, which achieved 74.9% accuracy in 14 minutes (64 epochs). It has been reported that combining LARS with counter intuitive modifications to the batch normalization, can yield 75.8% accuracy even for a mini-batch size of 65K
