Deep learning with elastic averaging SGD

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM.We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.

Original languageEnglish (US)
Title of host publicationAdvances in Neural Information Processing Systems
PublisherNeural information processing systems foundation
Pages685-693
Number of pages9
Volume2015-January
StatePublished - 2015
Event29th Annual Conference on Neural Information Processing Systems, NIPS 2015 - Montreal, Canada
Duration: Dec 7 2015Dec 12 2015

Other

Other29th Annual Conference on Neural Information Processing Systems, NIPS 2015
CountryCanada
CityMontreal
Period12/7/1512/12/15

Fingerprint

Communication
Image classification
Parallel processing systems
Deep learning
Momentum
Servers
Neural networks
Experiments

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Cite this

Zhang, S., Choromanska, A., & LeCun, Y. (2015). Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems (Vol. 2015-January, pp. 685-693). Neural information processing systems foundation.

Deep learning with elastic averaging SGD. / Zhang, Sixin; Choromanska, Anna; LeCun, Yann.

Advances in Neural Information Processing Systems. Vol. 2015-January Neural information processing systems foundation, 2015. p. 685-693.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhang, S, Choromanska, A & LeCun, Y 2015, Deep learning with elastic averaging SGD. in Advances in Neural Information Processing Systems. vol. 2015-January, Neural information processing systems foundation, pp. 685-693, 29th Annual Conference on Neural Information Processing Systems, NIPS 2015, Montreal, Canada, 12/7/15.
Zhang S, Choromanska A, LeCun Y. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems. Vol. 2015-January. Neural information processing systems foundation. 2015. p. 685-693
Zhang, Sixin ; Choromanska, Anna ; LeCun, Yann. / Deep learning with elastic averaging SGD. Advances in Neural Information Processing Systems. Vol. 2015-January Neural information processing systems foundation, 2015. pp. 685-693
@inproceedings{68bfb11603ee4d489695364818cc1a28,
title = "Deep learning with elastic averaging SGD",
abstract = "We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM.We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.",
author = "Sixin Zhang and Anna Choromanska and Yann LeCun",
year = "2015",
language = "English (US)",
volume = "2015-January",
pages = "685--693",
booktitle = "Advances in Neural Information Processing Systems",
publisher = "Neural information processing systems foundation",

}

TY - GEN

T1 - Deep learning with elastic averaging SGD

AU - Zhang, Sixin

AU - Choromanska, Anna

AU - LeCun, Yann

PY - 2015

Y1 - 2015

N2 - We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM.We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.

AB - We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM.We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.

UR - http://www.scopus.com/inward/record.url?scp=84965152276&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84965152276&partnerID=8YFLogxK

M3 - Conference contribution

VL - 2015-January

SP - 685

EP - 693

BT - Advances in Neural Information Processing Systems

PB - Neural information processing systems foundation

ER -