Photo by Pixabay from Pexels
Deep neural networks (DNNs) have shown impressive empirical performance but they are still nevertheless a black-box function modeling data. This is often a significant barrier for practitioners, whose choices most often rely on trial and error, and raises many fundamental
questions for theorists regarding why and in which circumstances we can expect these models to perform well.
The learning process of a DNN consists of two major ingredients: a loss function that needs to be minimized, and an optimization algorithm used to find an optimum, in a landscape constituted by the values that the loss assumes for each configuration of the model parameters. Understanding the loss landscape and how the dynamics takes place on it is therefore a fundamental matter which would have a significant impact in machine learning.
We build upon some existing connections between machine learning and statistical physics to unravel the interplay between landscape and dynamics in a series of different contexts: (a) off-equilibrium, (b) equilibrium and steady state, (c) criticality, i.e. emergent collective behavior arising from the competition between energy and entropy. From a practitioner’s point of view, these three aspects will provide precious knowledge on (a) the learning process, (b) the preconditioning of models, and (c) hyperparameter bounds for learning. The approach we propose uses methods and ideas from the statistical mechanics of disordered systems, and will provide a new bridge between machine learning and physics.
We perform an extensive analysis of the roles of dynamics and loss landscape. More concretely, in each of the contexts described above, we will compare the behavior of Langevin versus Stochastic Gradient Descent (SGD) dynamics, and the landscape of DNNs with that of a paradigmatic spin glass model. We expect the following broad outcomes:
- A systematic comparison between Langevin and SGD dynamics.
- A systematic comparison between DNN models and some affine complex systems.
- Understanding how the noise in the dynamics affects the effective landscape that is visited, and how this can give rise to emergent collective behaviors of the parameters.
- Using the current understanding of SGD to avoid Hessian calculations in second-order algorithms for the optimization of complex potential energy functionals.