A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs

Arseniy Vitkovskiy, Vassos Soteriou Soteriou, Chrysostomos Nicopoulos

    Research output: Contribution to journalArticle

    Abstract

    The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of informationalbeit at a gracefully degraded modein order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.

    Original languageEnglish (US)
    Article number6238398
    Pages (from-to)1235-1248
    Number of pages14
    JournalIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    Volume31
    Issue number8
    DOIs
    StatePublished - Jul 27 2012

    Fingerprint

    Wire
    Transistors
    Communication
    Routing algorithms
    Routers
    Wear of materials
    Hardware
    Silicon
    Defects
    Costs
    Network-on-chip

    Keywords

    • Fault-tolerance
    • Networks-on-chip (NoCs)
    • On-chip interconnection networks
    • Router microarchitecture
    • Routing algorithm

    ASJC Scopus subject areas

    • Software
    • Computer Graphics and Computer-Aided Design
    • Electrical and Electronic Engineering

    Cite this

    A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs. / Vitkovskiy, Arseniy; Soteriou, Vassos Soteriou; Nicopoulos, Chrysostomos.

    In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 31, No. 8, 6238398, 27.07.2012, p. 1235-1248.

    Research output: Contribution to journalArticle

    Vitkovskiy, Arseniy ; Soteriou, Vassos Soteriou ; Nicopoulos, Chrysostomos. / A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2012 ; Vol. 31, No. 8. pp. 1235-1248.
    @article{e03201ea7a1c438ab5754f8aaad9d411,
    title = "A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs",
    abstract = "The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of informationalbeit at a gracefully degraded modein order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.",
    keywords = "Fault-tolerance, Networks-on-chip (NoCs), On-chip interconnection networks, Router microarchitecture, Routing algorithm",
    author = "Arseniy Vitkovskiy and Soteriou, {Vassos Soteriou} and Chrysostomos Nicopoulos",
    year = "2012",
    month = "7",
    day = "27",
    doi = "10.1109/TCAD.2012.2188801",
    language = "English (US)",
    volume = "31",
    pages = "1235--1248",
    journal = "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems",
    issn = "0278-0070",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "8",

    }

    TY - JOUR

    T1 - A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs

    AU - Vitkovskiy, Arseniy

    AU - Soteriou, Vassos Soteriou

    AU - Nicopoulos, Chrysostomos

    PY - 2012/7/27

    Y1 - 2012/7/27

    N2 - The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of informationalbeit at a gracefully degraded modein order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.

    AB - The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of informationalbeit at a gracefully degraded modein order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.

    KW - Fault-tolerance

    KW - Networks-on-chip (NoCs)

    KW - On-chip interconnection networks

    KW - Router microarchitecture

    KW - Routing algorithm

    UR - http://www.scopus.com/inward/record.url?scp=84864116812&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84864116812&partnerID=8YFLogxK

    U2 - 10.1109/TCAD.2012.2188801

    DO - 10.1109/TCAD.2012.2188801

    M3 - Article

    VL - 31

    SP - 1235

    EP - 1248

    JO - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

    JF - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

    SN - 0278-0070

    IS - 8

    M1 - 6238398

    ER -