Parameters as interacting particles

Long time convergence and asymptotic error scaling of neural networks

Grant M. Rotskoff, Eric Vanden Eijnden

Research output: Contribution to journalConference article

Abstract

The performance of neural networks on high-dimensional data distributions suggests that it may be possible to parameterize a representation of a given high-dimensional function with controllably small errors, potentially outperforming standard interpolation methods. We demonstrate, both theoretically and numerically, that this is indeed the case. We map the parameters of a neural network to a system of particles relaxing with an interaction potential determined by the loss function. We show that in the limit that the number of parameters n is large, the landscape of the mean-squared error becomes convex and the representation error in the function scales as O(n 1 ). In this limit, we prove a dynamical variant of the universal approximation theorem showing that the optimal representation can be attained by stochastic gradient descent, the algorithm ubiquitously used for parameter optimization in machine learning. In the asymptotic regime, we study the fluctuations around the optimal representation and show that they arise at a scale O(n 1 ). These fluctuations in the landscape identify the natural scale for the noise in stochastic gradient descent. Our results apply to both single and multi-layer neural networks, as well as standard kernel methods like radial basis functions.

Original languageEnglish (US)
Pages (from-to)7146-7155
Number of pages10
JournalAdvances in Neural Information Processing Systems
Volume2018-December
StatePublished - Jan 1 2018
Event32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada
Duration: Dec 2 2018Dec 8 2018

Fingerprint

Neural networks
Multilayer neural networks
Learning systems
Interpolation

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Cite this

Parameters as interacting particles : Long time convergence and asymptotic error scaling of neural networks. / Rotskoff, Grant M.; Vanden Eijnden, Eric.

In: Advances in Neural Information Processing Systems, Vol. 2018-December, 01.01.2018, p. 7146-7155.

Research output: Contribution to journalConference article

@article{7d7887386f3342e187f5cc05a50c5b83,
title = "Parameters as interacting particles: Long time convergence and asymptotic error scaling of neural networks",
abstract = "The performance of neural networks on high-dimensional data distributions suggests that it may be possible to parameterize a representation of a given high-dimensional function with controllably small errors, potentially outperforming standard interpolation methods. We demonstrate, both theoretically and numerically, that this is indeed the case. We map the parameters of a neural network to a system of particles relaxing with an interaction potential determined by the loss function. We show that in the limit that the number of parameters n is large, the landscape of the mean-squared error becomes convex and the representation error in the function scales as O(n − 1 ). In this limit, we prove a dynamical variant of the universal approximation theorem showing that the optimal representation can be attained by stochastic gradient descent, the algorithm ubiquitously used for parameter optimization in machine learning. In the asymptotic regime, we study the fluctuations around the optimal representation and show that they arise at a scale O(n − 1 ). These fluctuations in the landscape identify the natural scale for the noise in stochastic gradient descent. Our results apply to both single and multi-layer neural networks, as well as standard kernel methods like radial basis functions.",
author = "Rotskoff, {Grant M.} and {Vanden Eijnden}, Eric",
year = "2018",
month = "1",
day = "1",
language = "English (US)",
volume = "2018-December",
pages = "7146--7155",
journal = "Advances in Neural Information Processing Systems",
issn = "1049-5258",

}

TY - JOUR

T1 - Parameters as interacting particles

T2 - Long time convergence and asymptotic error scaling of neural networks

AU - Rotskoff, Grant M.

AU - Vanden Eijnden, Eric

PY - 2018/1/1

Y1 - 2018/1/1

N2 - The performance of neural networks on high-dimensional data distributions suggests that it may be possible to parameterize a representation of a given high-dimensional function with controllably small errors, potentially outperforming standard interpolation methods. We demonstrate, both theoretically and numerically, that this is indeed the case. We map the parameters of a neural network to a system of particles relaxing with an interaction potential determined by the loss function. We show that in the limit that the number of parameters n is large, the landscape of the mean-squared error becomes convex and the representation error in the function scales as O(n − 1 ). In this limit, we prove a dynamical variant of the universal approximation theorem showing that the optimal representation can be attained by stochastic gradient descent, the algorithm ubiquitously used for parameter optimization in machine learning. In the asymptotic regime, we study the fluctuations around the optimal representation and show that they arise at a scale O(n − 1 ). These fluctuations in the landscape identify the natural scale for the noise in stochastic gradient descent. Our results apply to both single and multi-layer neural networks, as well as standard kernel methods like radial basis functions.

AB - The performance of neural networks on high-dimensional data distributions suggests that it may be possible to parameterize a representation of a given high-dimensional function with controllably small errors, potentially outperforming standard interpolation methods. We demonstrate, both theoretically and numerically, that this is indeed the case. We map the parameters of a neural network to a system of particles relaxing with an interaction potential determined by the loss function. We show that in the limit that the number of parameters n is large, the landscape of the mean-squared error becomes convex and the representation error in the function scales as O(n − 1 ). In this limit, we prove a dynamical variant of the universal approximation theorem showing that the optimal representation can be attained by stochastic gradient descent, the algorithm ubiquitously used for parameter optimization in machine learning. In the asymptotic regime, we study the fluctuations around the optimal representation and show that they arise at a scale O(n − 1 ). These fluctuations in the landscape identify the natural scale for the noise in stochastic gradient descent. Our results apply to both single and multi-layer neural networks, as well as standard kernel methods like radial basis functions.

UR - http://www.scopus.com/inward/record.url?scp=85064832054&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064832054&partnerID=8YFLogxK

M3 - Conference article

VL - 2018-December

SP - 7146

EP - 7155

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

SN - 1049-5258

ER -