NSF Award Search: Award # 1725456

Award Abstract # 1725456

SPX: Collaborative Research: Ula! - An Integrated Deep Neural Network (DNN) Acceleration Framework with Enhanced Unsupervised Learning Capability

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	DUKE UNIVERSITY
Initial Amendment Date:	July 22, 2017
Latest Amendment Date:	July 22, 2017
Award Number:	1725456
Award Instrument:	Standard Grant
Program Manager:	Danella Zhao CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2017
End Date:	August 31, 2022 (Estimated)
Total Intended Award Amount:	$520,000.00
Total Awarded Amount to Date:	$520,000.00
Funds Obligated to Date:	FY 2017 = $520,000.00
History of Investigator:	Yiran Chen (Principal Investigator) yiran.chen@duke.edu Hai Li (Co-Principal Investigator)
Recipient Sponsored Research Office:	Duke University 2200 W MAIN ST DURHAM NC US 27705-4640 (919)684-3030
Sponsor Congressional District:	04
Primary Place of Performance:	Duke University Hudson Hall Durham NC US 27708-0001
Primary Place of Performance Congressional District:	04
Unique Entity Identifier (UEI):	TP7EK8DZV6N5
Parent UEI:
NSF Program(s):	PPoSS-PP of Scalable Systems
Primary Program Source:	01001718DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	026Z
Program Element Code(s):	042Y00
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

In light of very recent revolutions of unsupervised learning algorithms (e.g., generative adversarial networks and dual-learning) and the emergence of their applications, three PIs/co-PI from Duke and UCSB form a team to design Ula! - an integrated DNN acceleration framework with enhanced unsupervised learning capability. The project revolutionizes the DNN research by introducing an integrated unsupervised learning computation framework with three vertically-integrated components from the aspects of software (algorithm), hardware (computing), and application (realization). The project echoes the call from the BRAIN Initiative (2013) and the Nanotechnology-Inspired Grand Challenge for Future Computing (2015) from the White House. The research outcomes will benefit both Computational Intelligence (CI) and Computer Architecture (CA) industries at large by introducing a synergy between computing paradigm and artificial intelligence (AI). The corresponding education components enhance existing curricula and pedagogy by introducing interdisciplinary modules on the software/hardware co-design for AI with creative teaching practices, and give special attentions to women and underrepresented minority groups.

The project performs three tasks: (1) At the software level, a generalized hierarchical decision-making (GHDM) system is designed to efficiently execute the state-of-the-art unsupervised learning and reinforcement learning processes with substantially reduced computation cost; (2) At the hardware level, a novel DNN computing paradigm is designed with enhanced unsupervised learning supports, based on the novelties in near data computing, GPU architecture, and FGPA + heterogeneous platforms; (3) At the application level, the usage of Ula! is exploited in scenarios that can greatly benefit from unsupervised learning and reinforcement learning. The developed techniques are also demonstrated and evaluated on three representative computing platforms: GPU, FPGA, and emerging nanoscale computing systems, respectively.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 26)

Show All

Chen, Fan and Li, Hai "EMAT: an efficient multi-task architecture for transfer learning using ReRAM" International Conference on Computer-Aided Design , 2018 10.1145/3240765.3240805 Citation Details

Chen, Fan and Song, Linghao and Chen, Yiran "ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks" Asia and South Pacific Design Automation Conference (ASP-DAC) , 2018 10.1109/ASPDAC.2018.8297302 Citation Details

Chen, Fan and Song, Linghao and Li, Hai and Chen, Yiran "PARC: A Processing-in-CAM Architecture for Genomic Long Read Pairwise Alignment using ReRAM" 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC) , 2020 10.1109/ASP-DAC47756.2020.9045555 Citation Details

Chen, Fan and Song, Linghao and Li, Hai 'Helen' "Efficient Process-in-Memory Architecture Design for Unsupervised GAN-based Deep Learning using ReRAM" Great Lakes Symposium on VLSI , 2019 10.1145/3299874.3319482 Citation Details

Chen, Fan and Song, Linghao and Li, Hai Helen and Chen, Yiran "ZARA: A Novel Zero-free Dataflow Accelerator for Generative Adversarial Networks in 3D ReRAM" Annual Design Automation Conference , 2019 10.1145/3316781.3317936 Citation Details

Chen, Yiran "Reshaping Future Computing Systems With Emerging Nonvolatile Memory Technologies" IEEE Micro , v.39 , 2019 10.1109/MM.2018.2885588 Citation Details

Chen, Yiran and Xie, Yuan and Song, Linghao and Chen, Fan and Tang, Tianqi "A Survey of Accelerator Architectures for Deep Neural Networks" Engineering , v.6 , 2020 10.1016/j.eng.2020.01.007 Citation Details

Li, Bing and Song, Linghao and Chen, Fan and Qian, Xuehai and Chen, Yiran and Li, Hai Helen "ReRAM-based accelerator for deep learning" Design, Automation and Test in Europe Conference & Exhibition (DATE) , 2018 10.23919/DATE.2018.8342118 Citation Details

Li, Bing and Wen, Wei and Mao, Jiachen and Li, Sicheng and Chen, Yiran and Li, Hai Helen "Running sparse and low-precision neural network: When algorithm meets hardware" Asia and South Pacific Design Automation Conference (ASP-DAC) , 2018 10.1109/ASPDAC.2018.8297378 Citation Details

LI, Shiyu and Hanson, Edward and Li, Hai Li and Chen, Yiran "PENNI: Pruned Kernel Sharing for Efficient CNN Inference" International Conference on Machine Learning , 2020 Citation Details

Liu, Xiaoxiao and Mao, Mengjie and Bi, Xiuyuan and Li, Hai and Chen, Yiran "Exploring Applications of STT-RAM in GPU Architectures" IEEE Transactions on Circuits and Systems I: Regular Papers , v.68 , 2021 https://doi.org/10.1109/TCSI.2020.3031895 Citation Details

(Showing: 1 - 10 of 26)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

We developed Ula! - an integrated Deep Neural Network (DNN) acceleration framework with advanced learning capability. Ula! consists of three vertically integrated components at the levels of algorithm, hardware, and application as: 1) Reinforcement/Supervised based enhanced unsupervised learning; 2) Novel DNN computing paradigm; 3) Technology applications and realizations.

Generative Adversarial Networks (GANs) have recently drawn tremendous attention as a powerful tool to improve the performance of unsupervised or semi-supervised DNN models. While GANs deliver state-of-the-art performance on these AI tasks, it comes at the cost of high computational complexity. We proposed ReGAN - a novel ReRAM-based Process-In-Memory (PIM) accelerator that can efficiently reduce off-chip memory accesses. Two techniques, namely, Spatial Parallelism and Computation Sharing are particularly proposed to further enhance training efficiency of GANs.

Model compression is significant for the wide adoption of Recurrent Neural Networks (RNNs) in both user devices possessing limited resources and business clusters requiring quick responses to large-scale service requests. To alleviate the computational cost of RNN, we proposed to learn structurally-sparse Long Short-Term Memory (LSTM) by reducing the sizes of basic structures within LSTM units, including input updates, gates, hidden states, cell states and outputs. In the training process, removing a component of Intrinsic Sparse Structures (ISS) in the LSTMs simultaneously decreases the sizes of all basic structures by one and thereby always maintain the dimension consistency.

Process-in-memory (PIM) architecture such as Hybrid Memory Cube (HMC) has been used to improve the data locality for efficient DNN executions. However, it is still hard to efficiently deploy largescale matrix computation in DNN on HMC because of its coarse-grained packet protocol. We proposed NeuralHMC, the first HMCbased accelerator tailored for efficient DNN executions.

Existing nonvolatile memory-based machine learning accelerators could not support the computational needs required by GAN training. Specifically, the generator utilizes a new operator, called transposed convolution, which introduces significant resource underutilization when executed on conventional neural network accelerators as it inserts massive zeros in its input before a convolution operation. We proposed ZARA - A Novel Zero-free Dataflow Accelerator for Generative Adversarial Networks in 3D ReRAM.

Stochastic Gradient Descent (SGD) is a popular training method of DNNs because of its efficiency. However, large-batch SGD tends to converge to sharp minima in the DNNs. We propose the SmoothOut framework to smooth out sharp minima in DNNs and thereby improve generalization of the DNNs. In particular, SmoothOut perturbs multiple copies of the DNN by noise injection and averages these copies. Our experimental results showed that SmoothOut improves generalization in both small-batch and large-batch training on the top of state-of-the-art solutions.

Previous works on Convolutional Neural Network (CNN) acceleration utilize low-rank approximation of the original convolution layers to reduce computation cost. However, these methods are very difficult to conduct upon sparse models, which limits execution speedup since redundancies within the CNN model are not fully exploited. We argue that kernel granularity decomposition can be conducted with low-rank assumption while exploiting the redundancy within the remaining compact coefficients. Based on this observation, we propose PENNI, a CNN model compression framework that is able to achieve model compactness and hardware efficiency simultaneously.

Depth is a key component of DNNs. However, designing depth is heuristic and requires many human efforts. We propose AutoGrow to automate depth discovery in DNNs: starting from a shallow seed architecture, AutoGrow grows new layers if the growth improves the accuracy; otherwise, stops growing and thus discovers the depth. Our experiments showed that by applying the same policy to different network architectures, AutoGrow can always discover near-optimal depth on various datasets.

In modern GPUs, on-chip memory capacity keeps increasing to support thousands of chip-resident threads. The on-chip memory capacity of GPUs, however, is highly constrained by the large memory cell area and high static power consumption of conventional SRAM implementation. We propose to utilize the emerging multi-level cell (MLC) spin-transfer torque RAM (STT-RAM) technology to implement register files and shared memory in GPUs.

Federated learning (FL) is a popular distributed learning framework that trains a global model through iterative communications between a central server and edge devices. We empirically show that under extremely strong poisoning attacks, the existing defensive methods fail to guarantee the robustness of FL. More importantly, we observed that as long as the global model is polluted, the impact of attacks on the global model will remain in subsequent rounds even if there are no subsequent attacks. We proposed a client-based defense, named White Blood Cell for Federated Learning (FL-WBC), which can mitigate model poisoning attacks that have already polluted the global model.

This 5-year research project supported 7 graduate students and publications of 20 conference papers and 6 journal papers in total.

Last Modified: 11/23/2022
Modified by: Yiran Chen

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error