Theoretical Machine Learning Seminar
On the Dynamics of Gradient Descent for Training Deep Neural Networks
Deep learning builds upon the mysterious ability of gradient-based methods to solve related non-convex optimization problems. However, a complete theoretical understanding is missing even in the simpler setting of training a deep linear neural network, which consists of a composition of linear transformations. We analyze the speed of convergence to global optimum for gradient descent training a deep linear neural network by minimizing the L2 loss over whitened data. Convergence at a linear rate is guaranteed when the following conditions hold: (i) the dimensions of hidden layers are at least the minimum of the input and output dimensions; (ii) weight matrices at initialization are approximately balanced; (iii) the initial loss is smaller than the loss of any rank-deficient solution. Moreover, in the important case of output dimension 1, i.e., scalar regression, these conditions are met, and thus convergence to global optimum happens, with constant probability under a random initialization scheme. Our results significantly extend previous analyses, e.g., of deep linear residual networks (Bartlett et al., 2018).
Key to our analysis is an alignment property between singular vectors of weight matrices in two consecutive layers of a linear network, which holds throughout the gradient flow trajectory. We show that a similar (albeit weaker) invariance property also holds for deep nonlinear neural networks with ReLU activation, which implies that gradient flow automatically balances the magnitudes of different layers.
Joint work with Sanjeev Arora, Nadav Cohen, Noah Golowich, Simon S. Du and Jason D. Lee.