A massively parallel adaptive fast multipole method on heterogeneous architectures

Ilya Lashuk, Aparna Chandramowlishwaran, Harper Langston, Tuan Anh Nguyen, Rahul Sampath, Aashay Shringarpure, Richard Vuduc, Lexing Ying, Denis Zorin, George Biros

Research output: Contribution to journalArticle

Abstract

We describe a parallel fast multipole method (FMM) for highly nonuniform distributions of particles. We employ both distributed memory parallelism (via MPI) and shared memory parallelism (via OpenMP and GPU acceleration) to rapidly evaluate two-body nonoscillatory potentials in three dimensions on heterogeneous high performance computing architectures. We have performed scalability tests with up to 30 billion particles on 196,608 cores on the AMD/ CRAY-based Jaguar system at ORNL. On a GPU-enabled system (NSF's Keeneland at Georgia Tech/ORNL), we observed 30× speedup over a single core CPU and 7× speedup over a multicore CPU implementation. By combining GPUs with MPI, we achieve less than 10 ns/particle and six digits of accuracy for a run with 48 million nonuniformly distributed particles on 192 GPUs.

Original languageEnglish (US)
Pages (from-to)101-109
Number of pages9
JournalCommunications of the ACM
Volume55
Issue number5
DOIs
StatePublished - May 2012

Fingerprint

Program processors
Data storage equipment
Scalability
Graphics processing unit

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Lashuk, I., Chandramowlishwaran, A., Langston, H., Nguyen, T. A., Sampath, R., Shringarpure, A., ... Biros, G. (2012). A massively parallel adaptive fast multipole method on heterogeneous architectures. Communications of the ACM, 55(5), 101-109. https://doi.org/10.1145/2160718.2160740

A massively parallel adaptive fast multipole method on heterogeneous architectures. / Lashuk, Ilya; Chandramowlishwaran, Aparna; Langston, Harper; Nguyen, Tuan Anh; Sampath, Rahul; Shringarpure, Aashay; Vuduc, Richard; Ying, Lexing; Zorin, Denis; Biros, George.

In: Communications of the ACM, Vol. 55, No. 5, 05.2012, p. 101-109.

Research output: Contribution to journalArticle

Lashuk, I, Chandramowlishwaran, A, Langston, H, Nguyen, TA, Sampath, R, Shringarpure, A, Vuduc, R, Ying, L, Zorin, D & Biros, G 2012, 'A massively parallel adaptive fast multipole method on heterogeneous architectures', Communications of the ACM, vol. 55, no. 5, pp. 101-109. https://doi.org/10.1145/2160718.2160740
Lashuk I, Chandramowlishwaran A, Langston H, Nguyen TA, Sampath R, Shringarpure A et al. A massively parallel adaptive fast multipole method on heterogeneous architectures. Communications of the ACM. 2012 May;55(5):101-109. https://doi.org/10.1145/2160718.2160740
Lashuk, Ilya ; Chandramowlishwaran, Aparna ; Langston, Harper ; Nguyen, Tuan Anh ; Sampath, Rahul ; Shringarpure, Aashay ; Vuduc, Richard ; Ying, Lexing ; Zorin, Denis ; Biros, George. / A massively parallel adaptive fast multipole method on heterogeneous architectures. In: Communications of the ACM. 2012 ; Vol. 55, No. 5. pp. 101-109.
@article{6b5f450f806e40d791b759c49fbd5c30,
title = "A massively parallel adaptive fast multipole method on heterogeneous architectures",
abstract = "We describe a parallel fast multipole method (FMM) for highly nonuniform distributions of particles. We employ both distributed memory parallelism (via MPI) and shared memory parallelism (via OpenMP and GPU acceleration) to rapidly evaluate two-body nonoscillatory potentials in three dimensions on heterogeneous high performance computing architectures. We have performed scalability tests with up to 30 billion particles on 196,608 cores on the AMD/ CRAY-based Jaguar system at ORNL. On a GPU-enabled system (NSF's Keeneland at Georgia Tech/ORNL), we observed 30× speedup over a single core CPU and 7× speedup over a multicore CPU implementation. By combining GPUs with MPI, we achieve less than 10 ns/particle and six digits of accuracy for a run with 48 million nonuniformly distributed particles on 192 GPUs.",
author = "Ilya Lashuk and Aparna Chandramowlishwaran and Harper Langston and Nguyen, {Tuan Anh} and Rahul Sampath and Aashay Shringarpure and Richard Vuduc and Lexing Ying and Denis Zorin and George Biros",
year = "2012",
month = "5",
doi = "10.1145/2160718.2160740",
language = "English (US)",
volume = "55",
pages = "101--109",
journal = "Communications of the ACM",
issn = "0001-0782",
publisher = "Association for Computing Machinery (ACM)",
number = "5",

}

TY - JOUR

T1 - A massively parallel adaptive fast multipole method on heterogeneous architectures

AU - Lashuk, Ilya

AU - Chandramowlishwaran, Aparna

AU - Langston, Harper

AU - Nguyen, Tuan Anh

AU - Sampath, Rahul

AU - Shringarpure, Aashay

AU - Vuduc, Richard

AU - Ying, Lexing

AU - Zorin, Denis

AU - Biros, George

PY - 2012/5

Y1 - 2012/5

N2 - We describe a parallel fast multipole method (FMM) for highly nonuniform distributions of particles. We employ both distributed memory parallelism (via MPI) and shared memory parallelism (via OpenMP and GPU acceleration) to rapidly evaluate two-body nonoscillatory potentials in three dimensions on heterogeneous high performance computing architectures. We have performed scalability tests with up to 30 billion particles on 196,608 cores on the AMD/ CRAY-based Jaguar system at ORNL. On a GPU-enabled system (NSF's Keeneland at Georgia Tech/ORNL), we observed 30× speedup over a single core CPU and 7× speedup over a multicore CPU implementation. By combining GPUs with MPI, we achieve less than 10 ns/particle and six digits of accuracy for a run with 48 million nonuniformly distributed particles on 192 GPUs.

AB - We describe a parallel fast multipole method (FMM) for highly nonuniform distributions of particles. We employ both distributed memory parallelism (via MPI) and shared memory parallelism (via OpenMP and GPU acceleration) to rapidly evaluate two-body nonoscillatory potentials in three dimensions on heterogeneous high performance computing architectures. We have performed scalability tests with up to 30 billion particles on 196,608 cores on the AMD/ CRAY-based Jaguar system at ORNL. On a GPU-enabled system (NSF's Keeneland at Georgia Tech/ORNL), we observed 30× speedup over a single core CPU and 7× speedup over a multicore CPU implementation. By combining GPUs with MPI, we achieve less than 10 ns/particle and six digits of accuracy for a run with 48 million nonuniformly distributed particles on 192 GPUs.

UR - http://www.scopus.com/inward/record.url?scp=84860239558&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84860239558&partnerID=8YFLogxK

U2 - 10.1145/2160718.2160740

DO - 10.1145/2160718.2160740

M3 - Article

VL - 55

SP - 101

EP - 109

JO - Communications of the ACM

JF - Communications of the ACM

SN - 0001-0782

IS - 5

ER -