Award Abstract # 1420718
SHF: Small: Collaborative Research: A Holistic Design Methodology for Fault-Tolerant and Robust Network-on-Chips (NoCs) Architectures

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: OHIO UNIVERSITY
Initial Amendment Date: July 10, 2014
Latest Amendment Date: June 16, 2015
Award Number: 1420718
Award Instrument: Standard Grant
Program Manager: Yuanyuan Yang
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: July 15, 2014
End Date: June 30, 2019 (Estimated)
Total Intended Award Amount: $200,000.00
Total Awarded Amount to Date: $208,000.00
Funds Obligated to Date: FY 2014 = $200,000.00
FY 2015 = $8,000.00
History of Investigator:
  • Avinash Karanth (Principal Investigator)
    karanth@ohio.edu
Recipient Sponsored Research Office: Ohio University
1 OHIO UNIVERSITY
ATHENS
OH  US  45701-2979
(740)593-2857
Sponsor Congressional District: 12
Primary Place of Performance: Ohio University
329 Stocker Center
Athens
OH  US  45701-2979
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): LXHMMWRKN5N8
Parent UEI:
NSF Program(s): Software & Hardware Foundation
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7941, 9251
Program Element Code(s): 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Technology scaling down to the nanometer regime has aided the growth in transistors that have made multi-core architectures a power-efficient approach to harnessing parallelism and improving performance. Consequently, the design of low latency, high bandwidth, power-efficient and reliable Network-on-Chips (NoCs) is proving to be one of the most critical challenges to achieving the performance potential of future chips. While multicores are facilitating an enormous integration capacity, aggressive transistor scaling has also led to a steady degradation of the device and circuit reliability. Increased device wear-out (due to negative-bias temperature instability (NBTI), electro migration (EM) and hot carrier injection (HCI)) has exacerbated the waning reliability of transistors, thereby resulting in a significant increase in faults (both permanent and transient), and hardware failures. As faults manifest within the NoC substrate, multicore chips are faced with excessive delays and increased power consumption while recovering from the fault. While NoC reliability research has made significant strides at inter- and intra-router levels, there is still a lack of a holistic design approach covering the reliability of the entire NoC architecture, from device wear-out, to links and routers, to routing protocols, to applications in a cohesive manner.

This project will develop a holistic design methodology that addresses the reliability of the entire NoC communication infrastructure (device, links, routers, routing algorithms, and topology) while minimizing energy footprint, reducing the area overhead and only marginally impacting performance. To achieve our goal of improving link fault-recovery, this project will develop techniques to maximize the utilization of the inter-router links with minimum power and area overhead. For the router, this project will propose intra-router reliability techniques with the goals of maximizing hardware utilization, reducing redundancy and area overhead, and minimizing router pipeline latency. Further, wear-leveling techniques developed by this project will improve the reliability of NoCs and the lifetime of the chip. Finally, the proposed techniques will be evaluated by developing fault models that are injected into the NoC and evaluate the fault coverage, performance degradation and energy efficiency through extensive modeling and simulation. The holistic design methodology spanning the entire NoC architecture and the reliability techniques developed from this project will positively impact the next generation multi-core and System-on-Chip (SoC) architectures with improvements in energy efficiency, performance and robustness to hard faults and soft errors. This project will play a major role in education by integrating discovery with teaching and training, and by attracting and training minority students in this field.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 12)
A. Karanth and S. Kaya and A. M. Sikder and A. Louri and S. Laha and D. Carbaugh and H. Xin and J. Wu and D. DiTomaso "Sustainability in Network-on-Chips by Exploring Heterogeneity in Emerging Technologies" IEEE Transactions on Sustainable Computing , v.4 , 2019 , p.293-307 10.1109/TSUSC.2018.2861362
Boraten, Travis and Kodi, Avinash "Mitigation of Hardware Trojan Based Denial-of-Service Attack for Secure NoCs" Journal of Parallel and Distributed Computing , v.111 , 2018 , p.24 10.1016/j.jpdc.2017.06.014
D. DiTomaso and A. K. Kodi and A. Louri and R. Bunescu "Resilient and Power-Efficient Multi-Function Channel Buffers in Network-on-Chip Architectures" IEEE Transactions on Computers , v.64 , 2015 , p.3555-3568 10.1109/TC.2015.2401013
Dominic DiTomaso and Avinash Kodi and Ahmed Louri and Razvan Bunescu "Resilient and Power-Efficient Multi-Function Channel Buffers in Network-on-Chip Architectures" IEEE Transactions on Computers , 2015 10.1109/TC.2015.2401013
Louri, Ahmed and Collet, Jacques and Karanth, Avinash "Limit of Hardware Solutions for Self-Protecting Fault-Tolerant NoCs" J. Emerg. Technol. Comput. Syst. , v.15 , 2019 , p.4:1--4:16 10.1145/3233986
Matthew Kennedy and Avinash Karanth Kodi "CLAP-NET: Bandwidth adaptive optical crossbar architecture" Journal of Parallel and Distributed Computing , v.100 , 2017 , p.130 - 139 http://dx.doi.org/10.1016/j.jpdc.2016.05.004
M. Kennedy and A. K. Kodi "Laser Pooling: Static and Dynamic Laser Power Allocation for On-Chip Optical Interconnects" Journal of Lightwave Technology , v.35 , 2017 , p.3159 10.1109/JLT.2017.2681960
Q. Fettes and M. Clark and R. Bunescu and A. Karanth and A. Louri "Dynamic Voltage and Frequency Scaling in NoCs with Supervised and Reinforcement Learning Techniques" IEEE Transactions on Computers , v.68 , 2019 , p.375-389 10.1109/TC.2018.2875476
S. Sefton and T. Siddiqui and N. S. Amour and G. Stewart and A. K. Kodi "GARUDA: Designing Energy-Efficient Hardware Monitors From High-Level Policies for Secure Information Flow" IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , v.37 , 2018 , p.2509-2518 10.1109/TCAD.2018.2857041
T. Boraten and A. Kodi "Runtime Techniques to Mitigate Soft Errors in Network-on-Chip (NoC) Architectures" IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , v.37 , 2018 , p.682 10.1109/TCAD.2017.2664066
T. F. Canan and S. Kaya and A. Karanth and H. Xin and A. Louri "Ambipolar SB-FinFETs: A New Path to Ultra-Compact Sub-10 nm Logic Circuits" IEEE Transactions on Electron Devices , v.66 , 2019 , p.255-263 10.1109/TED.2018.2874000
(Showing: 1 - 10 of 12)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

In this project, we developed a holistic design methodology that addressed the entire Network-on-Chips communication infrastructure which is critical for multicore functionality. First, we designed reversible and power efficient NoC architectures to improve reliability, performance and overall power consumption. To further improve the efficiency, we propose to use machine learning techniques to predict the direction of traffic flows on our links. Using decision trees, we predict the direction of links to be reversed with the goal of minimizing latency and improving energy-efficiency. Second, we analyzed the stress levels due to asymmetric traffic patterns and routing algorithms in NoC architectures. In this work, we proposed a novel in-flight, adaptive, routing algorithm to reduce the accumulation of Electromigration (EM), Hot carrier injection (HCI) and Negative Bias Temperature Instability (NBTI) effects on the lifetime of NoC. The proposed routing algorithm is based on a new metric called Packet-Per-Port (P3) which equalizes the stress throughout the network. The net impact is that the network components such as the links and routers will age evenly and thereby maximize the lifetime of the chip. Third, soft errors in NoCs could be timing errors (data dependent errors due to process variations and wear-out effects) or data corruption errors (crosstalk). In this task, we proposed a comprehensive fault-prediction system in which we (a) created a methodology to obtain the training/testing sets, (b) trained a machine learning (ML) algorithm to predict timing faults on links, and (c) mitigated soft errors. From these data sets, we train a ML model which can predict timing faults during runtime and produce several different outcomes each time a flit uses a link: none of the bits will be in error, few bits will be in error (1-2 bits), or several bits will be in error (> 2 bits). The model we use for predicting faults is a decision tree due to the low overhead during the testing phase which consists of a few comparisons instead of more complicated operations such as multiplication as seen with other ML models. Fourth, we proposed an adaptive multi-layered on chip error correction and detection scheme that provides variable strength fault coverage and graceful network degradation. As packets traverse the network, flits are systematically scrubbed on the switch-to-switch (s2s) layer to prevent faults from accumulating into the end-to-end (e2e) layer, on demand. To dynamically adjust the level of fault protection, we propose a configurable s2s encoder capable of adjusting ECC strength for burst errors, cross-talk, and intra-router faults to reduce the average hop count and costly retransmissions. Fifth, we evaluated the impact of NTV (Near Threshold Voltage) scaling on NoC architectures. However, lowering operating voltage increases the susceptibility of devices to faults and compromises reliability. In this task, we propose RETUNES: Reliable and Energy-effcient NoC, where NTV scaling is uniquely combined with enhanced reliability for on-chip communication. RETUNES uses multiple voltage modes to manage congestion and energy efficiency of the network. Sixth, we proposed IntelliNoC, an intelligent NoC design framework which introduces architectural innovations and uses reinforcement learning to manage the design complexity and simultaneously optimize performance, energy-efficiency, and reliability in a holistic manner. IntelliNoC integrates three NoC architectural techniques, namely (1) multi-function adaptive channels (MFCs) to improve energy-efficiency, (2) adaptive error detection/correction and re-transmission control to enhance reliability, and (3) a stress-relaxing bypass feature which dynamically powers off NoC components to prevent overheating and fatigue. To handle the complex dynamic interactions induced by these techniques, we train a dynamic control policy using Q-learning, with the goal of providing improved fault-tolerance and performance while reducing power consumption and area overhead.

Publications:

  1. Dominic DiTomaso, Avinash Kodi, Razvan Bunescu and Ahmed Louri, "Resilient and Power-Efficient Multi-Function Channel Buffers in Network-on-Chips (NoCs) using Machine Learning," IEEE Transactions on Computers (TC), vol. 26, no. 12, pp. 3289-3302, December 2015.
  2. Juman Alshraiedeh and Avinash Kodi, "An Adaptive Routing Algorithm to Improve Lifetime Reliability in NoC Architectures," IEEE Defect and Fault Tolerance in VLSI and Nanotechnology Symposium (DFT), Storrs, CT, September 19-20, 2016.
  3. Dominic DiTomaso, Travis Boraten, Avinash Kodi and Ahmed Louri, "Predicting and Mitigating Faults in NoCs using Machine Learning," IEEE/ACM International Conference on Microarchitecture (MICRO-49), Taipei, Taiwan, October 15-19, 2016.
  4. Travis Boraten and Avinash Kodi, "Runtime Fault Tolerant Techniques to Mitigate Soft Errors in Network-on-Chips (NoCs) Architectures," IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems (TCAD), vol. 37, no. 3, pp. 682-695, March 2018.
  5. Padmaja Bhamidipati and Avinash Karanth, "Reliable and Power-Efficient Network-on-Chips Using Voltage Scaling Techniques," 33rd International Conference on Computer Design (ICCD), Orlando, Florida, October 7-10, 2018.  
  6. Ke Wang, Ahmed Louri, Avinash Karanth, and Razvan Bunescu, "IntelliNoC: A Holistic Framework for Energy-Efficient and Reliable On-Chip Communication for Manycores," Accepted to appear in 45th IEEE International Symposium on Computer Architecture (ISCA), Phoenix, AZ, June 22-26, 2019.

 


Last Modified: 12/09/2019
Modified by: Avinash Karanth

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page