
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | July 2, 2018 |
Latest Amendment Date: | June 22, 2020 |
Award Number: | 1763747 |
Award Instrument: | Continuing Grant |
Program Manager: |
Almadena Chtchelkanova
achtchel@nsf.gov (703)292-7498 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2018 |
End Date: | June 30, 2023 (Estimated) |
Total Intended Award Amount: | $1,199,849.00 |
Total Awarded Amount to Date: | $1,199,849.00 |
Funds Obligated to Date: |
FY 2020 = $441,249.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
3720 S FLOWER ST FL 3 LOS ANGELES CA US 90033 (213)740-7762 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
3740 McClintock Avenue Los Angeles CA US 90089-2565 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Special Projects - CCF, Software & Hardware Foundation |
Primary Program Source: |
01002021DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Machine learning systems are critical drivers of new technologies such as near-perfect automatic speech recognition, autonomous vehicles, computer vision, and natural language understanding. The underlying inference engine for many of these systems is based on neural networks. Before a neural network can be used for these inference tasks, it must be trained using a data corpus of known input-output pairs. This training process is very computationally intensive with current systems requiring weeks to months of time on graphic processing units (GPUs) or central processing units in the cloud. As more data becomes available, this problem of long training time is further exacerbated because larger, more effective network models become desirable. The theoretical understanding of neural networks is limited, so experimentation and empirical optimization remains the primary tool for understanding deep neural networks and innovating in the field. However, the ability to conduct larger scale experiments is becoming concentrated with a few large entities with the necessary financial and computational resources. Even for those with such resources, the painfully long experimental cycle for training neural networks means that large-scale searches and optimizations over the neural network model structure are not performed. The ultimate goal of this research project is to democratize and distribute the ability to conduct large scale neural network training and model optimizations at high speed, using hardware accelerators. Reducing the training time from weeks to hours will allow researchers to run many more experiments, gaining knowledge into the fundamental inner workings of deep learning systems. The hardware accelerators are also much more energy efficient than the existing GPU-based training paradigm, so advances made in this project can significantly reduce the energy consumption required for neural network training tasks.
This project comprises an interdisciplinary research plan that spans theory, hardware architecture and design, software control, and system integration. A new class of neural networks that have pre-defined sparsity is being explored. These sparse neural networks are co-designed with a very flexible, high-speed, energy-efficient hardware architecture that maximizes circuit speed for any model size in a given Field Programmable Gate Array (FPGA) chip. This algorithm-hardware co-design is a key research theme that differentiates this approach from previous research that enforces some sparsity during the training process in a manner incompatible with parallel hardware acceleration. In particular, the proposed architecture operates on each network layer simultaneously, executing the forward- and back-propagation in parallel and pipelined fully across layers. With high precision arithmetic, a speed-up of about 5X relative to GPUs is expected. Using log-domain arithmetic, these gains are expected to increase to 100X or larger. Software and algorithms are being developed to manage multiple FPGA boards, simplifying and automating the model search and training process. These algorithms exploit the ability to reconfigure the FPGAs to trade speed for accuracy, a capability lacking in GPUs. These software tools will also serve as a bridge to popular Python libraries used by the machine learning community.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
We developed new methods for advancing the training of large neural networks using application specific hardware. This contrasts with the conventional approach of using general purpose processors, such as graphics processing units (GPUs).
Our work was the first to introduce the concept of pre-defined, structured sparsity. Conventional neural networks have fully-connected or dense architectures, but after training, previous research has shown that many of these connections can be disregarded. This results in sparse connectivity pattern that reduce complexity, but do not map well to custom, highly parallel circuit architectures. Our work demonstrated that one can pre-define a structured sparse connection pattern and still maintain excellent learning performance. We also showed how one can co-design such a pre-defined, structured sparsity pattern with a highly parallel circuit architecture for training neural networks. This has the potential to significantly reduce the energy cost of large-scale training and/or enable embedded systems to train large neural networks at the edge.
Our work also explores automated model search and training hyper-parameter optimization. We released an open-source software package for broader use in the research and industry communities.
Our work also conducted some of the earliest work in log-number system (LNS) computational approaches for training neural networks, which eliminate costly multiplier circuits. These LNS approaches have the potential to reduce the area and/or energy consumption of training circuitry by a factor of two.
Both pre-defined, structured sparsity and LNS approaches have become widely studied topics in the machine learning field and have received significant uptake in industry.
Beyond the research component of our project, this collaboration led to significant advances in the curriculum at the USC Ming Hsieh Department of Electrical and Computing Engineering. Specifically, this project directly led to the creation of four new graduate level course in deep learning and software skills for machine learning, the first undergraduate machine learning class in the department, and a significantly revised MS degree program in Machine Learning and Data Sciences.
Last Modified: 12/08/2023
Modified by: Keith M Chugg
Please report errors in award information by writing to: awardsearch@nsf.gov.