Asynchronous Parallel, Sparse Approximated SVRG for High-Dimensional Machine Learning
ABSTARCT :
With increasing of data size and development of multi-core computers, asynchronous parallel stochastic optimization algorithms such as KroMagnon have gained significant attention. In this paper, we propose a new Sparse approximation and asynchronous parallel Stochastic Variance Reduced Gradient (SSVRG) method for sparse and high-dimensional machine learning problems. Unlike standard SVRG and its asynchronous parallel variant, KroMagnon, the snapshot point of SSVRG is set to the average of all the iterates in the previous epoch, which allows it to take much larger learning rates and makes it more robust to the choice of learning rates. In particular, we use the sparse approximation of the popular SVRG estimator to perform completely sparse updates.Therefore, SSVRG has a much lower per-iteration computational cost than its dense counterpart, SVRG++, and is very friendly to asynchronous parallel implementation. Moreover, we provide the convergence guarantees of SSVRG for both SC and non-SC problems, while existing asynchronous algorithms (e.g., KroMagnon) only have convergence guarantees for SC problems. Finally, we extend SSVRG to non-smooth and asynchronous parallel settings. Numerical results demonstrate that SSVRG converges significantly faster than the state-of-the-art asynchronous parallel methods, e.g., KroMagnon, and is usually more than three orders of magnitude faster than SVRG++.
EXISTING SYSTEM :
• Asynchronous variants of SGD are most competitive when the updates are sparse and have a small overlap, that is, when each update modifies a small and different subset of the coefficients.
• This is typically achieved by updating only coefficients for which the partial gradient at a given iteration is nonzero, 11 but existing schemes such as the lagged updates technique (Schmidt et al., 2016) are not applicable in the asynchronous setting.
• We have proven that this algorithm is linearly convergent under a condition on the step size and that it is linearly faster than its sequential counterpart given a bound on the delay.
• Empirical benchmarks show that PROXASAGA is orders of magnitude faster than existing state-of-the-art methods.
DISADVANTAGE :
? These lock-free, asynchronous algorithms exhibit speedups even when applied to large, non-convex problems, as demonstrated by deep learning systems such as Google’s Downpour SGD and Microsoft’s Project Adam.
? Because of these gains, CYCLADES can actually outperform HOGWILD! in practice on sufficiently sparse problems, despite appearing to require more computational overhead.
? This is in contrast to recent studies similar to, where the authors provide speedup guarantees via a convergence-to-optimal proof for an asynchronous SGD on a nonconvex problem.
? In particular, we test CYCLADES on a classification problem for text based data.
PROPOSED SYSTEM :
• We introduce PROXASAGA, a fully asynchronous sparse method inspired by SAGA and its asynchronous parallel variant ASAGA. The proposed algorithm is easy to implement and significantly outperforms the state of the art on several non-smooth, large-scale problems.
• we have proposed a novel perspective to clarify an important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms.
• Building on the recently proposed “perturbed iterate” framework, we have proposed a novel perspective to clarify an important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms.
ADVANTAGE :
? These improved cache locality and coherence properties experimentally lead to substantial performance gains as we see in the next section. We can now combine the results of the previous subsection to obtain our main theorem for CYCLADES.
? Furthermore, for variance reduction techniques such as SAGA and SVRG, CYCLADES yields better accuracy and more significant speedups, with up to 5? performance gains over HOGWILD!-type implementations.
? Hence, it could be the case for an applications of interest that we cannot analyze how a serial SU algorithm performs in terms of say the accuracy of the solution, but CYCLADES can still provide black box guarantees for speedup, since our analysis is completely oblivious to the qualitative performance of the serial algorithm.
|