Minimizing Training Time of Distributed Machine Learning by Reducing Data Communication
Abstract : ? Due to the additive property of most machine learning objective functions, the training can be distributed to multiple machines.
? Distributed machine learning is an efficient way to deal with the rapid growth of data volume at the cost of extra intermachine communication.
? One common implementation is the parameter server system which contains two types of nodes: worker nodes, which are used for calculating updates, and server nodes, which are used for maintaining parameters.
? We observe that inefficient communication between workers and servers may slow down the system.
? Therefore, we propose a graph partition problem to partition data among workers and parameters among servers such that the total training time is minimized.
? Our problem is NPComplete.
? We investigate a twostep heuristic approach that first partitions data, and then partitions parameters.
? We consider the tradeoff between partition time and the saving in training time. Besides, we adopt a multilevel graph partition approach to fit the bipartite graph partitioning.
? We implement both approaches based on an opensource parameter server platformPSlite.
? Experiment results on synthetic and realworld datasets show that both approaches could significantly improve the communication efficiency up to 14 times compared with the random partition.
Deep neural networks are good at discovering correlation structures in data in an unsupervised fashion. Therefore it is widely used in speech analysis, natural language processing and in computer vision. This information of the structure of the data is stored in a distributed fashion. i.e. Information about the model is distributed across different layers in a neural network and in each layer, model information (weights) are distributed in different neurons. There are a lot of ways to combine the information in a layer spread across different neurons and there are lot of ways to combine layers in order to minimize a loss function (which is a proxy for how well the neural network is doing in terms of achieving its goals).
? When the neural network just begins to train for several epochs, the gradient values are so big that the network varies rapidly.
? Compressing these early gradients restricts the evolving ability of the neural network in the first several epochs. In order to elude this problem, Yujun Lin et al. make use of Warmup training in DGC.
? This technique splits the training process into two periods, warmup period and the normal training period.
? we study and analyze synchronous and asynchronous weight update algorithms (like Parallel SGD, ADMM and Downpour SGD) and come up with worst case asymptotic communication cost and computation time for each of the these algorithms.
? we communicate the parameters to each machine, where SGD is used to update weights of the model.
? Once this is done, the weights are sent over to the driver machine for aggregation.
? Once the parameters have been aggregated and updated using data from all machine (this is why it is synchronous), parameters are broadcasted to all machines for the whole procedure to be repeated
• The communication synchronization decides how frequently all local models are synchronized with others, which will influence not only the communication traffic but also the performance and convergence of model training.
? So there is a tradeoff between the communication traffic and the convergence.
? Additionally, different synchronization schemes can be combined with different architectures.
? Proposed tensor partitioning to communication scheduling (even feedforward computations can be paralleled with communications) to further reduce the communication cost.
