Toward data-driven architectural support in improving the performance of future HPC architectures

George Matheou, Vassos Soteriou Soteriou, Paraskevas Evripidou

Research output: Contribution to journalArticle

Abstract

We propose architectures based on Data-Driven Multithreading (DDM), a hybrid control-flow/data-flow model, to address the concurrency challenges faced by future High-Performance Computing (HPC)systems. We focus on the design and implementation of an optimized hardware Thread Scheduling Unit (TSU)and its integration into a multi-core system dubbed MiDAS. The TSU is the core of the DDM model and it orchestrates the execution of multiple threads on sequential processors based on data availability. MiDAS was prototyped on a Xilinx Virtex-6 FPGA and extensively evaluated using several micro-benchmarks, showing that it achieves linearly-growing performance as the processing core count increases even when running benchmarks comprising very small problem sizes. Under the largest problem size tested and with all 8 available cores being utilized, MiDAS achieves an average speedup of 7.91×, exhibiting 98.8% utilization efficiency. Further, several results pertaining to the proposed hardware TSU are provided, including FPGA real estate requirements, where it is found that MiDAS's TSU demands relatively small overheads and reduced power consumption, while various TSU operations adhere to low latency responses. To back said claims, the proposed DDM-based TSU is compared with the Task Superscalar architecture that implements the StarSs programming framework in hardware. As such, comparison results show that the proposed TSU requires much less of both hardware investment and energy consumption to operate. Specifically, Task Superscalar is found to be 4.94 × larger than the DDM-supporting TSU in terms of slice register requirements and 11.34 × larger with respect to the slice look-up table count. Last, the hardware TSU is compared with a software TSU implementation offering identical functionalities, with both being run on an FPGA fabric under a synthetic application, where a detailed performance evaluation shows that MiDAS's hardware-implemented TSU significantly outperforms its software-based TSU counterpart.

Original languageEnglish (US)
Pages (from-to)82-106
Number of pages25
JournalParallel Computing
Volume86
DOIs
StatePublished - Aug 1 2019

Fingerprint

Data-driven
Thread
High Performance
Scheduling
Computing
Unit
Multithreading
Hardware
Field programmable gate arrays (FPGA)
Superscalar
Field Programmable Gate Array
Slice
Architecture
Count
Benchmark
Hybrid Control
Software
Look-up Table
Requirements
Comparison Result

Keywords

  • Data-driven multithreading
  • Data-flow execution
  • FPGA
  • Hardware thread scheduler
  • HPC
  • Multi-core architecture

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Networks and Communications
  • Computer Graphics and Computer-Aided Design
  • Artificial Intelligence

Cite this

Toward data-driven architectural support in improving the performance of future HPC architectures. / Matheou, George; Soteriou, Vassos Soteriou; Evripidou, Paraskevas.

In: Parallel Computing, Vol. 86, 01.08.2019, p. 82-106.

Research output: Contribution to journalArticle

@article{490ebb3321f24bec9197d0de82e0eab0,
title = "Toward data-driven architectural support in improving the performance of future HPC architectures",
abstract = "We propose architectures based on Data-Driven Multithreading (DDM), a hybrid control-flow/data-flow model, to address the concurrency challenges faced by future High-Performance Computing (HPC)systems. We focus on the design and implementation of an optimized hardware Thread Scheduling Unit (TSU)and its integration into a multi-core system dubbed MiDAS. The TSU is the core of the DDM model and it orchestrates the execution of multiple threads on sequential processors based on data availability. MiDAS was prototyped on a Xilinx Virtex-6 FPGA and extensively evaluated using several micro-benchmarks, showing that it achieves linearly-growing performance as the processing core count increases even when running benchmarks comprising very small problem sizes. Under the largest problem size tested and with all 8 available cores being utilized, MiDAS achieves an average speedup of 7.91×, exhibiting 98.8{\%} utilization efficiency. Further, several results pertaining to the proposed hardware TSU are provided, including FPGA real estate requirements, where it is found that MiDAS's TSU demands relatively small overheads and reduced power consumption, while various TSU operations adhere to low latency responses. To back said claims, the proposed DDM-based TSU is compared with the Task Superscalar architecture that implements the StarSs programming framework in hardware. As such, comparison results show that the proposed TSU requires much less of both hardware investment and energy consumption to operate. Specifically, Task Superscalar is found to be 4.94 × larger than the DDM-supporting TSU in terms of slice register requirements and 11.34 × larger with respect to the slice look-up table count. Last, the hardware TSU is compared with a software TSU implementation offering identical functionalities, with both being run on an FPGA fabric under a synthetic application, where a detailed performance evaluation shows that MiDAS's hardware-implemented TSU significantly outperforms its software-based TSU counterpart.",
keywords = "Data-driven multithreading, Data-flow execution, FPGA, Hardware thread scheduler, HPC, Multi-core architecture",
author = "George Matheou and Soteriou, {Vassos Soteriou} and Paraskevas Evripidou",
year = "2019",
month = "8",
day = "1",
doi = "10.1016/j.parco.2019.04.011",
language = "English (US)",
volume = "86",
pages = "82--106",
journal = "Parallel Computing",
issn = "0167-8191",
publisher = "Elsevier",

}

TY - JOUR

T1 - Toward data-driven architectural support in improving the performance of future HPC architectures

AU - Matheou, George

AU - Soteriou, Vassos Soteriou

AU - Evripidou, Paraskevas

PY - 2019/8/1

Y1 - 2019/8/1

N2 - We propose architectures based on Data-Driven Multithreading (DDM), a hybrid control-flow/data-flow model, to address the concurrency challenges faced by future High-Performance Computing (HPC)systems. We focus on the design and implementation of an optimized hardware Thread Scheduling Unit (TSU)and its integration into a multi-core system dubbed MiDAS. The TSU is the core of the DDM model and it orchestrates the execution of multiple threads on sequential processors based on data availability. MiDAS was prototyped on a Xilinx Virtex-6 FPGA and extensively evaluated using several micro-benchmarks, showing that it achieves linearly-growing performance as the processing core count increases even when running benchmarks comprising very small problem sizes. Under the largest problem size tested and with all 8 available cores being utilized, MiDAS achieves an average speedup of 7.91×, exhibiting 98.8% utilization efficiency. Further, several results pertaining to the proposed hardware TSU are provided, including FPGA real estate requirements, where it is found that MiDAS's TSU demands relatively small overheads and reduced power consumption, while various TSU operations adhere to low latency responses. To back said claims, the proposed DDM-based TSU is compared with the Task Superscalar architecture that implements the StarSs programming framework in hardware. As such, comparison results show that the proposed TSU requires much less of both hardware investment and energy consumption to operate. Specifically, Task Superscalar is found to be 4.94 × larger than the DDM-supporting TSU in terms of slice register requirements and 11.34 × larger with respect to the slice look-up table count. Last, the hardware TSU is compared with a software TSU implementation offering identical functionalities, with both being run on an FPGA fabric under a synthetic application, where a detailed performance evaluation shows that MiDAS's hardware-implemented TSU significantly outperforms its software-based TSU counterpart.

AB - We propose architectures based on Data-Driven Multithreading (DDM), a hybrid control-flow/data-flow model, to address the concurrency challenges faced by future High-Performance Computing (HPC)systems. We focus on the design and implementation of an optimized hardware Thread Scheduling Unit (TSU)and its integration into a multi-core system dubbed MiDAS. The TSU is the core of the DDM model and it orchestrates the execution of multiple threads on sequential processors based on data availability. MiDAS was prototyped on a Xilinx Virtex-6 FPGA and extensively evaluated using several micro-benchmarks, showing that it achieves linearly-growing performance as the processing core count increases even when running benchmarks comprising very small problem sizes. Under the largest problem size tested and with all 8 available cores being utilized, MiDAS achieves an average speedup of 7.91×, exhibiting 98.8% utilization efficiency. Further, several results pertaining to the proposed hardware TSU are provided, including FPGA real estate requirements, where it is found that MiDAS's TSU demands relatively small overheads and reduced power consumption, while various TSU operations adhere to low latency responses. To back said claims, the proposed DDM-based TSU is compared with the Task Superscalar architecture that implements the StarSs programming framework in hardware. As such, comparison results show that the proposed TSU requires much less of both hardware investment and energy consumption to operate. Specifically, Task Superscalar is found to be 4.94 × larger than the DDM-supporting TSU in terms of slice register requirements and 11.34 × larger with respect to the slice look-up table count. Last, the hardware TSU is compared with a software TSU implementation offering identical functionalities, with both being run on an FPGA fabric under a synthetic application, where a detailed performance evaluation shows that MiDAS's hardware-implemented TSU significantly outperforms its software-based TSU counterpart.

KW - Data-driven multithreading

KW - Data-flow execution

KW - FPGA

KW - Hardware thread scheduler

KW - HPC

KW - Multi-core architecture

UR - http://www.scopus.com/inward/record.url?scp=85067951572&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85067951572&partnerID=8YFLogxK

U2 - 10.1016/j.parco.2019.04.011

DO - 10.1016/j.parco.2019.04.011

M3 - Article

AN - SCOPUS:85067951572

VL - 86

SP - 82

EP - 106

JO - Parallel Computing

JF - Parallel Computing

SN - 0167-8191

ER -