A fine-grained link-level fault-tolerant mechanism for networks-on-chip

Arseniy Vitkovskiy, Vassos Soteriou Soteriou, Chrysostomos Nicopoulos

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Silicon technology scaling is continuously enabling denser integration capabilities. However, this comes at the expense of higher variability and susceptibility to wear-out. With an escalating number of on-chip components expected to be defective in near-future chips, modern parallel systems, such as Chip Multi-Processors (CMP), become especially vulnerable to these faults. Just a single link failure in the underlying Network on-Chip (NoC) may cause inter-tile communication to halt and even deadlock, rendering the chip useless. While fault-tolerant routing schemes do exist, they can only handle a finite number of link faults. In this paper, we address permanent wire failures which can occur in on-chip parallel links at manufacture-time or while in operation. Instead of marking the entire link as faulty, we present a methodology where the Partially Faulty Link (PFL) can still be used to transfer data between NoC routers, thus maintaining network connectivity, extending the yield and lifetime of the chip, and allowing for graceful performance degradation. To achieve this, we devise architectural augmentations both to the router and link micro-architectures, along with link fault detection, diagnosis, and re-configuration at the level of wire granularity. Statistical link-level fault models present the usability of PFLs, while relevant load-balancing routing algorithms and low-cost re-transmission mechanisms are also presented and coupled to the proposed architecture. Hardware synthesis demonstrates the feasibility of the proposed extensions to the base NoC architecture. Results obtained from full-system simulations show that high-performance NoCs are realizable in the presence of PFLs.

Original languageEnglish (US)
Title of host publication2010 IEEE International Conference on Computer Design, ICCD 2010
Pages447-454
Number of pages8
DOIs
StatePublished - Dec 1 2010
Event28th IEEE International Conference on Computer Design, ICCD 2010 - Amsterdam, Netherlands
Duration: Oct 3 2010Oct 6 2010

Other

Other28th IEEE International Conference on Computer Design, ICCD 2010
CountryNetherlands
CityAmsterdam
Period10/3/1010/6/10

Fingerprint

Routers
Wire
Data transfer
Routing algorithms
Tile
Fault detection
Resource allocation
Wear of materials
Hardware
Degradation
Silicon
Communication
Network-on-chip
Costs

ASJC Scopus subject areas

  • Hardware and Architecture
  • Electrical and Electronic Engineering

Cite this

Vitkovskiy, A., Soteriou, V. S., & Nicopoulos, C. (2010). A fine-grained link-level fault-tolerant mechanism for networks-on-chip. In 2010 IEEE International Conference on Computer Design, ICCD 2010 (pp. 447-454). [5647663] https://doi.org/10.1109/ICCD.2010.5647663

A fine-grained link-level fault-tolerant mechanism for networks-on-chip. / Vitkovskiy, Arseniy; Soteriou, Vassos Soteriou; Nicopoulos, Chrysostomos.

2010 IEEE International Conference on Computer Design, ICCD 2010. 2010. p. 447-454 5647663.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Vitkovskiy, A, Soteriou, VS & Nicopoulos, C 2010, A fine-grained link-level fault-tolerant mechanism for networks-on-chip. in 2010 IEEE International Conference on Computer Design, ICCD 2010., 5647663, pp. 447-454, 28th IEEE International Conference on Computer Design, ICCD 2010, Amsterdam, Netherlands, 10/3/10. https://doi.org/10.1109/ICCD.2010.5647663
Vitkovskiy A, Soteriou VS, Nicopoulos C. A fine-grained link-level fault-tolerant mechanism for networks-on-chip. In 2010 IEEE International Conference on Computer Design, ICCD 2010. 2010. p. 447-454. 5647663 https://doi.org/10.1109/ICCD.2010.5647663
Vitkovskiy, Arseniy ; Soteriou, Vassos Soteriou ; Nicopoulos, Chrysostomos. / A fine-grained link-level fault-tolerant mechanism for networks-on-chip. 2010 IEEE International Conference on Computer Design, ICCD 2010. 2010. pp. 447-454
@inproceedings{3d5053290c054b0098ca96c24004647f,
title = "A fine-grained link-level fault-tolerant mechanism for networks-on-chip",
abstract = "Silicon technology scaling is continuously enabling denser integration capabilities. However, this comes at the expense of higher variability and susceptibility to wear-out. With an escalating number of on-chip components expected to be defective in near-future chips, modern parallel systems, such as Chip Multi-Processors (CMP), become especially vulnerable to these faults. Just a single link failure in the underlying Network on-Chip (NoC) may cause inter-tile communication to halt and even deadlock, rendering the chip useless. While fault-tolerant routing schemes do exist, they can only handle a finite number of link faults. In this paper, we address permanent wire failures which can occur in on-chip parallel links at manufacture-time or while in operation. Instead of marking the entire link as faulty, we present a methodology where the Partially Faulty Link (PFL) can still be used to transfer data between NoC routers, thus maintaining network connectivity, extending the yield and lifetime of the chip, and allowing for graceful performance degradation. To achieve this, we devise architectural augmentations both to the router and link micro-architectures, along with link fault detection, diagnosis, and re-configuration at the level of wire granularity. Statistical link-level fault models present the usability of PFLs, while relevant load-balancing routing algorithms and low-cost re-transmission mechanisms are also presented and coupled to the proposed architecture. Hardware synthesis demonstrates the feasibility of the proposed extensions to the base NoC architecture. Results obtained from full-system simulations show that high-performance NoCs are realizable in the presence of PFLs.",
author = "Arseniy Vitkovskiy and Soteriou, {Vassos Soteriou} and Chrysostomos Nicopoulos",
year = "2010",
month = "12",
day = "1",
doi = "10.1109/ICCD.2010.5647663",
language = "English (US)",
isbn = "9781424489350",
pages = "447--454",
booktitle = "2010 IEEE International Conference on Computer Design, ICCD 2010",

}

TY - GEN

T1 - A fine-grained link-level fault-tolerant mechanism for networks-on-chip

AU - Vitkovskiy, Arseniy

AU - Soteriou, Vassos Soteriou

AU - Nicopoulos, Chrysostomos

PY - 2010/12/1

Y1 - 2010/12/1

N2 - Silicon technology scaling is continuously enabling denser integration capabilities. However, this comes at the expense of higher variability and susceptibility to wear-out. With an escalating number of on-chip components expected to be defective in near-future chips, modern parallel systems, such as Chip Multi-Processors (CMP), become especially vulnerable to these faults. Just a single link failure in the underlying Network on-Chip (NoC) may cause inter-tile communication to halt and even deadlock, rendering the chip useless. While fault-tolerant routing schemes do exist, they can only handle a finite number of link faults. In this paper, we address permanent wire failures which can occur in on-chip parallel links at manufacture-time or while in operation. Instead of marking the entire link as faulty, we present a methodology where the Partially Faulty Link (PFL) can still be used to transfer data between NoC routers, thus maintaining network connectivity, extending the yield and lifetime of the chip, and allowing for graceful performance degradation. To achieve this, we devise architectural augmentations both to the router and link micro-architectures, along with link fault detection, diagnosis, and re-configuration at the level of wire granularity. Statistical link-level fault models present the usability of PFLs, while relevant load-balancing routing algorithms and low-cost re-transmission mechanisms are also presented and coupled to the proposed architecture. Hardware synthesis demonstrates the feasibility of the proposed extensions to the base NoC architecture. Results obtained from full-system simulations show that high-performance NoCs are realizable in the presence of PFLs.

AB - Silicon technology scaling is continuously enabling denser integration capabilities. However, this comes at the expense of higher variability and susceptibility to wear-out. With an escalating number of on-chip components expected to be defective in near-future chips, modern parallel systems, such as Chip Multi-Processors (CMP), become especially vulnerable to these faults. Just a single link failure in the underlying Network on-Chip (NoC) may cause inter-tile communication to halt and even deadlock, rendering the chip useless. While fault-tolerant routing schemes do exist, they can only handle a finite number of link faults. In this paper, we address permanent wire failures which can occur in on-chip parallel links at manufacture-time or while in operation. Instead of marking the entire link as faulty, we present a methodology where the Partially Faulty Link (PFL) can still be used to transfer data between NoC routers, thus maintaining network connectivity, extending the yield and lifetime of the chip, and allowing for graceful performance degradation. To achieve this, we devise architectural augmentations both to the router and link micro-architectures, along with link fault detection, diagnosis, and re-configuration at the level of wire granularity. Statistical link-level fault models present the usability of PFLs, while relevant load-balancing routing algorithms and low-cost re-transmission mechanisms are also presented and coupled to the proposed architecture. Hardware synthesis demonstrates the feasibility of the proposed extensions to the base NoC architecture. Results obtained from full-system simulations show that high-performance NoCs are realizable in the presence of PFLs.

UR - http://www.scopus.com/inward/record.url?scp=78650756156&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78650756156&partnerID=8YFLogxK

U2 - 10.1109/ICCD.2010.5647663

DO - 10.1109/ICCD.2010.5647663

M3 - Conference contribution

AN - SCOPUS:78650756156

SN - 9781424489350

SP - 447

EP - 454

BT - 2010 IEEE International Conference on Computer Design, ICCD 2010

ER -