A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs

Arseniy Vitkovskiy, Vassos Soteriou Soteriou, Chrysostomos Nicopoulos

Research output: Contribution to journalArticle

Abstract

The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of informationalbeit at a gracefully degraded modein order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.

Original languageEnglish (US)
Article number6238398
Pages (from-to)1235-1248
Number of pages14
JournalIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Volume31
Issue number8
DOIs
StatePublished - Jul 27 2012

Fingerprint

Wire
Transistors
Communication
Routing algorithms
Routers
Wear of materials
Hardware
Silicon
Defects
Costs
Network-on-chip

Keywords

  • Fault-tolerance
  • Networks-on-chip (NoCs)
  • On-chip interconnection networks
  • Router microarchitecture
  • Routing algorithm

ASJC Scopus subject areas

  • Software
  • Computer Graphics and Computer-Aided Design
  • Electrical and Electronic Engineering

Cite this

A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs. / Vitkovskiy, Arseniy; Soteriou, Vassos Soteriou; Nicopoulos, Chrysostomos.

In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 31, No. 8, 6238398, 27.07.2012, p. 1235-1248.

Research output: Contribution to journalArticle

@article{edc9e4dd94914bf6a19d33e214786c6f,
title = "A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs",
abstract = "The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of informationalbeit at a gracefully degraded modein order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.",
keywords = "Fault-tolerance, Networks-on-chip (NoCs), On-chip interconnection networks, Router microarchitecture, Routing algorithm",
author = "Arseniy Vitkovskiy and Soteriou, {Vassos Soteriou} and Chrysostomos Nicopoulos",
year = "2012",
month = "7",
day = "27",
doi = "10.1109/TCAD.2012.2188801",
language = "English (US)",
volume = "31",
pages = "1235--1248",
journal = "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems",
issn = "0278-0070",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "8",

}

TY - JOUR

T1 - A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs

AU - Vitkovskiy, Arseniy

AU - Soteriou, Vassos Soteriou

AU - Nicopoulos, Chrysostomos

PY - 2012/7/27

Y1 - 2012/7/27

N2 - The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of informationalbeit at a gracefully degraded modein order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.

AB - The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of informationalbeit at a gracefully degraded modein order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.

KW - Fault-tolerance

KW - Networks-on-chip (NoCs)

KW - On-chip interconnection networks

KW - Router microarchitecture

KW - Routing algorithm

UR - http://www.scopus.com/inward/record.url?scp=84864116812&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84864116812&partnerID=8YFLogxK

U2 - 10.1109/TCAD.2012.2188801

DO - 10.1109/TCAD.2012.2188801

M3 - Article

AN - SCOPUS:84864116812

VL - 31

SP - 1235

EP - 1248

JO - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

JF - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

SN - 0278-0070

IS - 8

M1 - 6238398

ER -