A massively parallel adaptive fast-multipole method on heterogeneous architectures

Ilya Lashuk, Aparna Chandramowlishwaran, Harper Langston, Tuan Anh Nguyen, Rahul Sampath, Aashay Shringarpure, Richard Vuduc, Lexing Ying, Denis Zorin, George Biros

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.

Original languageEnglish (US)
Title of host publicationProceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09
DOIs
StatePublished - 2009
EventConference on High Performance Computing Networking, Storage and Analysis, SC '09 - Portland, OR, United States
Duration: Nov 14 2009Nov 20 2009

Other

OtherConference on High Performance Computing Networking, Storage and Analysis, SC '09
CountryUnited States
CityPortland, OR
Period11/14/0911/20/09

Fingerprint

Program processors
Data storage equipment
Data structures
Scalability
Graphics processing unit
Costs

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications

Cite this

Lashuk, I., Chandramowlishwaran, A., Langston, H., Nguyen, T. A., Sampath, R., Shringarpure, A., ... Biros, G. (2009). A massively parallel adaptive fast-multipole method on heterogeneous architectures. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09 [1654118] https://doi.org/10.1145/1654059.1654118

A massively parallel adaptive fast-multipole method on heterogeneous architectures. / Lashuk, Ilya; Chandramowlishwaran, Aparna; Langston, Harper; Nguyen, Tuan Anh; Sampath, Rahul; Shringarpure, Aashay; Vuduc, Richard; Ying, Lexing; Zorin, Denis; Biros, George.

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09. 2009. 1654118.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lashuk, I, Chandramowlishwaran, A, Langston, H, Nguyen, TA, Sampath, R, Shringarpure, A, Vuduc, R, Ying, L, Zorin, D & Biros, G 2009, A massively parallel adaptive fast-multipole method on heterogeneous architectures. in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09., 1654118, Conference on High Performance Computing Networking, Storage and Analysis, SC '09, Portland, OR, United States, 11/14/09. https://doi.org/10.1145/1654059.1654118
Lashuk I, Chandramowlishwaran A, Langston H, Nguyen TA, Sampath R, Shringarpure A et al. A massively parallel adaptive fast-multipole method on heterogeneous architectures. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09. 2009. 1654118 https://doi.org/10.1145/1654059.1654118
Lashuk, Ilya ; Chandramowlishwaran, Aparna ; Langston, Harper ; Nguyen, Tuan Anh ; Sampath, Rahul ; Shringarpure, Aashay ; Vuduc, Richard ; Ying, Lexing ; Zorin, Denis ; Biros, George. / A massively parallel adaptive fast-multipole method on heterogeneous architectures. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09. 2009.
@inproceedings{ed4dcc0e127447539bf5105a5517b9a0,
title = "A massively parallel adaptive fast-multipole method on heterogeneous architectures",
abstract = "We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.",
author = "Ilya Lashuk and Aparna Chandramowlishwaran and Harper Langston and Nguyen, {Tuan Anh} and Rahul Sampath and Aashay Shringarpure and Richard Vuduc and Lexing Ying and Denis Zorin and George Biros",
year = "2009",
doi = "10.1145/1654059.1654118",
language = "English (US)",
isbn = "9781605587448",
booktitle = "Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09",

}

TY - GEN

T1 - A massively parallel adaptive fast-multipole method on heterogeneous architectures

AU - Lashuk, Ilya

AU - Chandramowlishwaran, Aparna

AU - Langston, Harper

AU - Nguyen, Tuan Anh

AU - Sampath, Rahul

AU - Shringarpure, Aashay

AU - Vuduc, Richard

AU - Ying, Lexing

AU - Zorin, Denis

AU - Biros, George

PY - 2009

Y1 - 2009

N2 - We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.

AB - We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.

UR - http://www.scopus.com/inward/record.url?scp=74049157044&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=74049157044&partnerID=8YFLogxK

U2 - 10.1145/1654059.1654118

DO - 10.1145/1654059.1654118

M3 - Conference contribution

AN - SCOPUS:74049157044

SN - 9781605587448

BT - Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09

ER -