# A mean-field limit for certain deep neural networks

@article{Araujo2019AML, title={A mean-field limit for certain deep neural networks}, author={Dyego Ara'ujo and Roberto Imbuzeiro Oliveira and Daniel Yukimura}, journal={arXiv: Statistics Theory}, year={2019} }

Understanding deep neural networks (DNNs) is a key challenge in the theory of machine learning, with potential applications to the many fields where DNNs have been successfully used. This article presents a scaling limit for a DNN being trained by stochastic gradient descent. Our networks have a fixed (but arbitrary) number $L\geq 2$ of inner layers; $N\gg 1$ neurons per layer; full connections between layers; and fixed weights (or "random features" that are not trained) near the input and… Expand

#### Figures from this paper

#### 42 Citations

OVER-PARAMETERIZED DEEP NEURAL NETWORKS

- 2020

This paper proposes a new mean-field framework for over-parameterized deep neural networks (DNNs), which can be used to analyze neural network training. In this framework, a DNN is represented by… Expand

Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks

- Computer Science, Mathematics
- COLT
- 2021

This analysis leads to the first global convergence proof for over-parameterized neural network training with more than $3$ layers in the mean-field regime, and leads to a simpler representation of DNNs, for which the training objective can be reformulated as a convex optimization problem via suitable re- parameterization. Expand

Mean Field Analysis of Deep Neural Networks

- Mathematics
- 2019

We analyze multi-layer neural networks in the asymptotic regime of simultaneously (A) large network sizes and (B) large numbers of stochastic gradient descent training iterations. We rigorously… Expand

Predicting the outputs of finite deep neural networks trained with noisy gradients

- Mathematics, Physics
- Physical Review E
- 2021

This work considers a DNN training protocol, involving noise, weight decay and finite width, whose outcome corresponds to a certain non-Gaussian stochastic process, whose deviation from a GP is controlled by the finite width. Expand

A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks

- Computer Science, Physics
- ArXiv
- 2020

A mathematically rigorous framework for multilayer neural networks in the mean field regime with a new idea of a non-evolving probability space that allows to embed neural networks of arbitrary widths and proves a global convergence guarantee for two-layer and three-layer networks. Expand

Global Convergence of Three-layer Neural Networks in the Mean Field Regime

- Computer Science, Physics
- ICLR
- 2021

This work develops a rigorous framework to establish the mean field limit of three-layer networks under stochastic gradient descent training and proposes the idea of a neuronal embedding, which comprises of a fixed probability space that encapsulates neural networks of arbitrary sizes. Expand

An analytic theory of shallow networks dynamics for hinge loss classification

- Computer Science, Mathematics
- NeurIPS
- 2020

This paper study in detail the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task, and shows that in a suitable mean-field limit this case maps to a single-node learning problem with a time-dependent dataset determined self-consistently from the average nodes population. Expand

Mathematical Models of Overparameterized Neural Networks

- Computer Science, Mathematics
- Proceedings of the IEEE
- 2021

The analysis of two-layer NNs is focused on and the key mathematical models, with their algorithmic implications, are explained and the challenges in understanding deep NNs are discussed. Expand

Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

- Computer Science, Mathematics
- ICML
- 2020

An infinite hierarchy of ordinary differential equations is derived, the neural tangent hierarchy (NTH) which captures the gradient descent dynamic of the deep neural network, and it is proved that the truncated hierarchy of NTH approximates theynamic of the NTK up to arbitrary precision. Expand

Feature Learning in Infinite-Width Neural Networks

- Computer Science, Physics
- ArXiv
- 2020

It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique. Expand

#### References

SHOWING 1-10 OF 34 REFERENCES

Mean Field Analysis of Deep Neural Networks

- Mathematics
- 2019

We analyze multi-layer neural networks in the asymptotic regime of simultaneously (A) large network sizes and (B) large numbers of stochastic gradient descent training iterations. We rigorously… Expand

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

- Mathematics, Physics
- COLT
- 2019

This paper shows that the number of hidden units only needs to be larger than a quantity dependent on the regularity properties of the data, and independent of the dimensions, and generalizes this analysis to the case of unbounded activation functions. Expand

Mean Field Analysis of Neural Networks

- Mathematics
- 2018

Machine learning, and in particular neural network models, have revolutionized fields such as image, text, and speech recognition. Today, many important real-world applications in these areas are… Expand

A mean field view of the landscape of two-layer neural networks

- Computer Science, Mathematics
- Proceedings of the National Academy of Sciences
- 2018

A compact description of the SGD dynamics is derived in terms of a limiting partial differential equation that allows for “averaging out” some of the complexities of the landscape of neural networks and can be used to prove a general convergence result for noisy SGD. Expand

Scaling description of generalization with number of parameters in deep learning

- Mathematics, Computer Science
- ArXiv
- 2019

This work relies on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations that affect the generalization error of neural networks. Expand

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

- Computer Science, Mathematics
- ICLR
- 2019

Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. Expand

Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error

- Computer Science, Mathematics
- ArXiv
- 2018

A Law of Large Numbers and a Central Limit Theorem for the empirical distribution are established, which together show that the approximation error of the network universally scales as O(n-1) and the scale and nature of the noise introduced by stochastic gradient descent are quantified. Expand

On Lazy Training in Differentiable Programming

- Computer Science
- NeurIPS
- 2019

This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Expand

Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks

- Computer Science, Physics
- ArXiv
- 2019

This work uncovers a phenomenon in which the behavior of these complex networks -- under suitable scalings and stochastic gradient descent dynamics -- becomes independent of the number of neurons as this number grows sufficiently large. Expand

Understanding deep learning requires rethinking generalization

- Computer Science
- ICLR
- 2017

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity. Expand