Low Latency RNN Inference with Cellular Batching

Pin Gao, Yongwei Wu, Lingfan Yu, Jinyang Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN “cell” (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.

Original languageEnglish (US)
Title of host publicationProceedings of the 13th EuroSys Conference, EuroSys 2018
PublisherAssociation for Computing Machinery, Inc
Volume2018-January
ISBN (Electronic)9781450355841
DOIs
StatePublished - Apr 23 2018
Event13th EuroSys Conference, EuroSys 2018 - Porto, Portugal
Duration: Apr 23 2018Apr 26 2018

Other

Other13th EuroSys Conference, EuroSys 2018
CountryPortugal
CityPorto
Period4/23/184/26/18

Fingerprint

Throughput
Recurrent neural networks
Learning systems
Neural networks
Experiments

Keywords

  • Batching
  • Dataflow Graph
  • Inference
  • Recurrent Neural Network

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Gao, P., Wu, Y., Yu, L., & Li, J. (2018). Low Latency RNN Inference with Cellular Batching. In Proceedings of the 13th EuroSys Conference, EuroSys 2018 (Vol. 2018-January). Association for Computing Machinery, Inc. https://doi.org/10.1145/3190508.3190541

Low Latency RNN Inference with Cellular Batching. / Gao, Pin; Wu, Yongwei; Yu, Lingfan; Li, Jinyang.

Proceedings of the 13th EuroSys Conference, EuroSys 2018. Vol. 2018-January Association for Computing Machinery, Inc, 2018.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Gao, P, Wu, Y, Yu, L & Li, J 2018, Low Latency RNN Inference with Cellular Batching. in Proceedings of the 13th EuroSys Conference, EuroSys 2018. vol. 2018-January, Association for Computing Machinery, Inc, 13th EuroSys Conference, EuroSys 2018, Porto, Portugal, 4/23/18. https://doi.org/10.1145/3190508.3190541
Gao P, Wu Y, Yu L, Li J. Low Latency RNN Inference with Cellular Batching. In Proceedings of the 13th EuroSys Conference, EuroSys 2018. Vol. 2018-January. Association for Computing Machinery, Inc. 2018 https://doi.org/10.1145/3190508.3190541
Gao, Pin ; Wu, Yongwei ; Yu, Lingfan ; Li, Jinyang. / Low Latency RNN Inference with Cellular Batching. Proceedings of the 13th EuroSys Conference, EuroSys 2018. Vol. 2018-January Association for Computing Machinery, Inc, 2018.
@inproceedings{fc412e5ac9bd49df8416cf42caf7f339,
title = "Low Latency RNN Inference with Cellular Batching",
abstract = "Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN “cell” (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.",
keywords = "Batching, Dataflow Graph, Inference, Recurrent Neural Network",
author = "Pin Gao and Yongwei Wu and Lingfan Yu and Jinyang Li",
year = "2018",
month = "4",
day = "23",
doi = "10.1145/3190508.3190541",
language = "English (US)",
volume = "2018-January",
booktitle = "Proceedings of the 13th EuroSys Conference, EuroSys 2018",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - Low Latency RNN Inference with Cellular Batching

AU - Gao, Pin

AU - Wu, Yongwei

AU - Yu, Lingfan

AU - Li, Jinyang

PY - 2018/4/23

Y1 - 2018/4/23

N2 - Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN “cell” (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.

AB - Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN “cell” (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.

KW - Batching

KW - Dataflow Graph

KW - Inference

KW - Recurrent Neural Network

UR - http://www.scopus.com/inward/record.url?scp=85052014907&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052014907&partnerID=8YFLogxK

U2 - 10.1145/3190508.3190541

DO - 10.1145/3190508.3190541

M3 - Conference contribution

AN - SCOPUS:85052014907

VL - 2018-January

BT - Proceedings of the 13th EuroSys Conference, EuroSys 2018

PB - Association for Computing Machinery, Inc

ER -