NSF Award Search: Award # 1916736 - EAGER: Truly Distributed Deep Learning: Representation and Computation

Award Abstract # 1916736

EAGER: Truly Distributed Deep Learning: Representation and Computation

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	UNIVERSITY OF MARYLAND BALTIMORE COUNTY
Initial Amendment Date:	June 3, 2019
Latest Amendment Date:	June 3, 2019
Award Number:	1916736
Award Instrument:	Standard Grant
Program Manager:	Wei Ding IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	June 15, 2019
End Date:	May 31, 2022 (Estimated)
Total Intended Award Amount:	$164,729.00
Total Awarded Amount to Date:	$164,729.00
Funds Obligated to Date:	FY 2019 = $164,729.00
History of Investigator:	James Oates (Principal Investigator) jtooates@gmail.com
Recipient Sponsored Research Office:	University of Maryland Baltimore County 1000 HILLTOP CIR BALTIMORE MD US 21250-0001 (410)455-3140
Sponsor Congressional District:	07
Primary Place of Performance:	University of Maryland Baltimore County CSEE, 1000 Hilltop Circle Baltimore MD US 21250-0001
Primary Place of Performance Congressional District:	07
Unique Entity Identifier (UEI):	RNKYWXURFRL5
Parent UEI:
NSF Program(s):	Info Integration & Informatics
Primary Program Source:	01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7364, 7916
Program Element Code(s):	736400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

In many scientific domains, from healthcare to astronomy, our ability to gather data far outstrips our ability to analyze it. Most data analysis algorithms require all of the data to be available at one central location, but that is not always possible due to either the sheer size of the data or, as in healthcare, privacy concerns. The goal of this project is to develop data analysis algorithms that can be run on distributed datasets, where different physical locations contain a subset of the data. Applications include medical diagnostic tools that are more accurate because they are based on significantly larger datasets than is currently possible, and crowdsourcing data analysis tasks by allowing anyone with some spare compute capacity to participate in a global-scale computation.

The project has two aims. The first is the design and implement an ontologically backed Deep Learning Description Language (DL2) for representing all phases on deep learning, including model structure, hyperparameters, and training methods. DL2 will serve as an interlingua between deep learning frameworks, regardless of the hardware architecture on which they run, to support model sharing, primarily in service of truly distributed learning. The ontological underpinnings of DL2 will support, among other things, explicit reasoning about framework compatibility when sharing models; a "model zoo" that is open to all, not just users of a specific framework; and the ability to formulate semantic queries against model libraries to, for example, find similar models. The second aim is to design, implement, and thoroughly evaluate a number of truly distributed algorithms for deep learning that leverage DL2 for model sharing. Existing approaches to distributed machine learning rely on distributed algorithms that exchange shallow, compact models that are orders of magnitude smaller than modern deep networks, leading to interesting challenges in adapting distributed averaging to deep learning.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Modern machine learning systems rely on lots of data and lots ofcompute power. Sometimes organizations have data that they do notwant to share, such as medical data maintained by hospitals, but theywould like to use it collectively with other such organizations tobuild models. This work developed approaches to machine learning insuch settings with the goal of building models that are as good asthose that would be learned if all of the data was available at acentral location.

This grant funded two PhD students and one undergraduate student for asummer who was part of the LSAMP program at UMBC, which seeks to"significantly increase the numbers of (minority) studentsmatriculating into and successfully completing high-quality degreeprograms in science, technology, engineering and mathematics (STEM)disciplines in order to diversify the STEM workforce."

In this project we showed that it is possible to enlist citizenscientists for large scale dataset construction in machine learningusing distributed computing. Constructing datasets of this sizerequires a large amount of parallel computation. To facilitate this,we leveraged the open source BOINC distributed computing platform andcreated the Machine Learning Comprehension at Home (MLC@Home)project. Through this project, we enlisted the help of thousands ofvolunteers who donate their home computer resources to the project tofurther scientific causes. Other well-known BOINC projects includeSETI@Home and World Community Grid. Volunteers install a unified BOINCclient, then choose which projects to donate their computer’sresources. This client a) downloads ”work units” from a project’sserver, b) performs the work on behalf of the project in thebackground of the user’s system when idle, and c) uploads the resultsto the project server. MLC@Home is the first BOINC project dedicatedto machine learning research. MLC@Home’s BOINC-enabled application isbuilt using PyTorch’s C++ API [13], and supports Windows and Linuxplatforms with AMD64, ARM, and AARCH64 CPUs and (optionally) NVidiaand AMD GPUs. Computations are intentionally set to 32-bit floatingpoint to keep the computations uniform across CPUs and GPUs. MLC’sapplication is open source and available online 4 . As of thiswriting, MLC@Home has received support from over 2,200 volunteers and8,000 separate computers, and those numbers are growing everyday. These volunteers have trained over 750,000 neural networks. Thisinfrastructure and example show that other disciplines can build largedatasets of this form quickly and cheaply.

Scientifically, we developed methods for ensuring that complex neuralnetworks learned from different data at different sites can becombined into a single network, which is non-trivial given thattraining a single network on the same data can lead to very differentoutcomes in terms of the learned weights. We also worked with aprofessor and one of his PhD students to understand how our approachcan work when some of the compute nodes are quantum, not classical.That professor has implemented algorithms for deep learning (thoughwith much smaller networks) of the DWave quantum computer. Thatexploration is the first that we know of that tries to leverage bothtypes of computation in a single distributed learning effort.

Last Modified: 05/11/2023
Modified by: Tim Oates

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error