
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | July 10, 2014 |
Latest Amendment Date: | June 16, 2015 |
Award Number: | 1420718 |
Award Instrument: | Standard Grant |
Program Manager: |
Yuanyuan Yang
CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 15, 2014 |
End Date: | June 30, 2019 (Estimated) |
Total Intended Award Amount: | $200,000.00 |
Total Awarded Amount to Date: | $208,000.00 |
Funds Obligated to Date: |
FY 2015 = $8,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
1 OHIO UNIVERSITY ATHENS OH US 45701-2979 (740)593-2857 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
329 Stocker Center Athens OH US 45701-2979 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Software & Hardware Foundation |
Primary Program Source: |
01001516DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Technology scaling down to the nanometer regime has aided the growth in transistors that have made multi-core architectures a power-efficient approach to harnessing parallelism and improving performance. Consequently, the design of low latency, high bandwidth, power-efficient and reliable Network-on-Chips (NoCs) is proving to be one of the most critical challenges to achieving the performance potential of future chips. While multicores are facilitating an enormous integration capacity, aggressive transistor scaling has also led to a steady degradation of the device and circuit reliability. Increased device wear-out (due to negative-bias temperature instability (NBTI), electro migration (EM) and hot carrier injection (HCI)) has exacerbated the waning reliability of transistors, thereby resulting in a significant increase in faults (both permanent and transient), and hardware failures. As faults manifest within the NoC substrate, multicore chips are faced with excessive delays and increased power consumption while recovering from the fault. While NoC reliability research has made significant strides at inter- and intra-router levels, there is still a lack of a holistic design approach covering the reliability of the entire NoC architecture, from device wear-out, to links and routers, to routing protocols, to applications in a cohesive manner.
This project will develop a holistic design methodology that addresses the reliability of the entire NoC communication infrastructure (device, links, routers, routing algorithms, and topology) while minimizing energy footprint, reducing the area overhead and only marginally impacting performance. To achieve our goal of improving link fault-recovery, this project will develop techniques to maximize the utilization of the inter-router links with minimum power and area overhead. For the router, this project will propose intra-router reliability techniques with the goals of maximizing hardware utilization, reducing redundancy and area overhead, and minimizing router pipeline latency. Further, wear-leveling techniques developed by this project will improve the reliability of NoCs and the lifetime of the chip. Finally, the proposed techniques will be evaluated by developing fault models that are injected into the NoC and evaluate the fault coverage, performance degradation and energy efficiency through extensive modeling and simulation. The holistic design methodology spanning the entire NoC architecture and the reliability techniques developed from this project will positively impact the next generation multi-core and System-on-Chip (SoC) architectures with improvements in energy efficiency, performance and robustness to hard faults and soft errors. This project will play a major role in education by integrating discovery with teaching and training, and by attracting and training minority students in this field.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
In this project, we developed a holistic design methodology that addressed the entire Network-on-Chips communication infrastructure which is critical for multicore functionality. First, we designed reversible and power efficient NoC architectures to improve reliability, performance and overall power consumption. To further improve the efficiency, we propose to use machine learning techniques to predict the direction of traffic flows on our links. Using decision trees, we predict the direction of links to be reversed with the goal of minimizing latency and improving energy-efficiency. Second, we analyzed the stress levels due to asymmetric traffic patterns and routing algorithms in NoC architectures. In this work, we proposed a novel in-flight, adaptive, routing algorithm to reduce the accumulation of Electromigration (EM), Hot carrier injection (HCI) and Negative Bias Temperature Instability (NBTI) effects on the lifetime of NoC. The proposed routing algorithm is based on a new metric called Packet-Per-Port (P3) which equalizes the stress throughout the network. The net impact is that the network components such as the links and routers will age evenly and thereby maximize the lifetime of the chip. Third, soft errors in NoCs could be timing errors (data dependent errors due to process variations and wear-out effects) or data corruption errors (crosstalk). In this task, we proposed a comprehensive fault-prediction system in which we (a) created a methodology to obtain the training/testing sets, (b) trained a machine learning (ML) algorithm to predict timing faults on links, and (c) mitigated soft errors. From these data sets, we train a ML model which can predict timing faults during runtime and produce several different outcomes each time a flit uses a link: none of the bits will be in error, few bits will be in error (1-2 bits), or several bits will be in error (> 2 bits). The model we use for predicting faults is a decision tree due to the low overhead during the testing phase which consists of a few comparisons instead of more complicated operations such as multiplication as seen with other ML models. Fourth, we proposed an adaptive multi-layered on chip error correction and detection scheme that provides variable strength fault coverage and graceful network degradation. As packets traverse the network, flits are systematically scrubbed on the switch-to-switch (s2s) layer to prevent faults from accumulating into the end-to-end (e2e) layer, on demand. To dynamically adjust the level of fault protection, we propose a configurable s2s encoder capable of adjusting ECC strength for burst errors, cross-talk, and intra-router faults to reduce the average hop count and costly retransmissions. Fifth, we evaluated the impact of NTV (Near Threshold Voltage) scaling on NoC architectures. However, lowering operating voltage increases the susceptibility of devices to faults and compromises reliability. In this task, we propose RETUNES: Reliable and Energy-effcient NoC, where NTV scaling is uniquely combined with enhanced reliability for on-chip communication. RETUNES uses multiple voltage modes to manage congestion and energy efficiency of the network. Sixth, we proposed IntelliNoC, an intelligent NoC design framework which introduces architectural innovations and uses reinforcement learning to manage the design complexity and simultaneously optimize performance, energy-efficiency, and reliability in a holistic manner. IntelliNoC integrates three NoC architectural techniques, namely (1) multi-function adaptive channels (MFCs) to improve energy-efficiency, (2) adaptive error detection/correction and re-transmission control to enhance reliability, and (3) a stress-relaxing bypass feature which dynamically powers off NoC components to prevent overheating and fatigue. To handle the complex dynamic interactions induced by these techniques, we train a dynamic control policy using Q-learning, with the goal of providing improved fault-tolerance and performance while reducing power consumption and area overhead.
Publications:
- Dominic DiTomaso, Avinash Kodi, Razvan Bunescu and Ahmed Louri, "Resilient and Power-Efficient Multi-Function Channel Buffers in Network-on-Chips (NoCs) using Machine Learning," IEEE Transactions on Computers (TC), vol. 26, no. 12, pp. 3289-3302, December 2015.
- Juman Alshraiedeh and Avinash Kodi, "An Adaptive Routing Algorithm to Improve Lifetime Reliability in NoC Architectures," IEEE Defect and Fault Tolerance in VLSI and Nanotechnology Symposium (DFT), Storrs, CT, September 19-20, 2016.
- Dominic DiTomaso, Travis Boraten, Avinash Kodi and Ahmed Louri, "Predicting and Mitigating Faults in NoCs using Machine Learning," IEEE/ACM International Conference on Microarchitecture (MICRO-49), Taipei, Taiwan, October 15-19, 2016.
- Travis Boraten and Avinash Kodi, "Runtime Fault Tolerant Techniques to Mitigate Soft Errors in Network-on-Chips (NoCs) Architectures," IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems (TCAD), vol. 37, no. 3, pp. 682-695, March 2018.
- Padmaja Bhamidipati and Avinash Karanth, "Reliable and Power-Efficient Network-on-Chips Using Voltage Scaling Techniques," 33rd International Conference on Computer Design (ICCD), Orlando, Florida, October 7-10, 2018.
- Ke Wang, Ahmed Louri, Avinash Karanth, and Razvan Bunescu, "IntelliNoC: A Holistic Framework for Energy-Efficient and Reliable On-Chip Communication for Manycores," Accepted to appear in 45th IEEE International Symposium on Computer Architecture (ISCA), Phoenix, AZ, June 22-26, 2019.
Last Modified: 12/09/2019
Modified by: Avinash Karanth
Please report errors in award information by writing to: awardsearch@nsf.gov.