
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | May 8, 2018 |
Latest Amendment Date: | August 8, 2022 |
Award Number: | 1751009 |
Award Instrument: | Continuing Grant |
Program Manager: |
Ann Von Lehmen
CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | May 15, 2018 |
End Date: | April 30, 2024 (Estimated) |
Total Intended Award Amount: | $628,029.00 |
Total Awarded Amount to Date: | $628,029.00 |
Funds Obligated to Date: |
FY 2019 = $128,134.00 FY 2020 = $119,913.00 FY 2021 = $123,531.00 FY 2022 = $127,279.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
77 MASSACHUSETTS AVE CAMBRIDGE MA US 02139-4301 (617)253-1000 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
77 Massachusetts Ave. Cambridge MA US 02139-4307 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Networking Technology and Syst |
Primary Program Source: |
01001920DB NSF RESEARCH & RELATED ACTIVIT 01002021DB NSF RESEARCH & RELATED ACTIVIT 01002122DB NSF RESEARCH & RELATED ACTIVIT 01002223DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Modern networks require sophisticated systems and algorithms to manage resources efficiently and deliver high quality of experience to users. These systems are critical to services society has come to rely on, from video streaming to social networks to AI applications. Video streaming, for example, involves numerous systems that control everything from the resolution of the video to the network path and the video download speed based on dynamic network conditions. As networks and applications have become more complex, existing approaches have become inadequate and designing algorithms that deliver high performance in all conditions has become exceedingly difficult. The goal of this research is to address this challenge by developing network systems that learn to manage resources automatically through experience by applying new machine learning techniques. This new paradigm, if successful, will make networks simpler to design, more efficient and cost effective, and able to deliver better services to businesses and consumers.
This project's goal is to develop the algorithmic and systems foundations for designing resource management systems that use modern reinforcement learning and other predictive control techniques to achieve strong performance across heterogeneous networks and applications. To this end, the researchers plan to build a series of practical systems for important applications, including schedulers for cluster computing systems (e.g., for data-parallel analytics workloads), and context-aware network control protocols (e.g., for adaptive streaming of 360 virtual reality video). In building these systems, the researchers will tackle fundamental challenges that confront data-driven network resource management, including (i) techniques to represent workloads (e.g., graph-structured jobs) and networks (e.g., topologies, queues, flows) to facilitate learning using neural networks; (ii) techniques to handle challenging resource management problems with large and deep action spaces; (iii) techniques to efficiently collect data across a myriad of devices for learning control models; and (iv) techniques to bootstrap learning models from data collected offline and continually train models safely after deployment
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Today's networks rely on many human-engineered control and resource management heuristics, but the complexity and heterogeneity of modern systems make it difficult to design effective heuristics for different settings. This project developed principled approaches for building systems that autonomously learn to make resource management decisions and adapt to various environments without human intervention.
Intellectual Merit
Learning-based Systems and Benchmarks
Pensieve: A system that uses reinforcement learning (RL) to adapt video bitrate algorithms based on the performance of past choices, tailoring control policies to network and video characteristics.
Decima: An RL-based job scheduler for data processing clusters that learns workload-specific scheduling policies for jobs represented as dataflow graphs. Decima's learned policies outperform standard heuristics, providing significant cost savings by enabling large-scale data processing clusters to run at higher utilization. To invent Decima, we developed new techniques, including a variance reduction method for robust RL training in environments with stochastic input processes.
Other contributions in this area include:
-
Placeto: An RL system for learning generalizable task placement strategies for parallelizing neural network training jobs on multiple GPUs.
-
Neo, Bao, Flow-loss: Learning-based relational query optimizers and robust learned cardinality estimation for database management systems.
-
Park: An open benchmark suite for research on learning-augmented computer systems.
Advances in Simulation and Modeling
CausalSim: Addresses a fundamental problem in trace-driven simulation, where replaying an observed traces, such as network throughput measurements, can be flawed because interventions (such as new protocols) that are simulated could have changed the trace. CausalSim uses a novel causal inference algorithm to learn the relationship between interventions and trace observations, modifying the trace during simulation to reflect the effect of interventions.
m3: A scale-free, fast, and accurate model for estimating the tail latency of network transfers in datacenters. m3 trains a neural network to approximate the packet-level simulation function, using a simple flow-level simulator to generate a compact feature map capturing the important characteristics of a network scenario.
Continual, Online Learning in Systems and Their Applications
Many systems must operate in non-stationary environments where workload or operating conditions change over time. Learning-based systems can adapt their models over time to continually optimize for current conditions, enabling the system to leverage simpler models that perform well over a narrower set of inputs and enhancing robustness by reducing reliance on training data coverage.
MMNet: Revisits MIMO detection from an online learning perspective, using a neural network architecture based on iterative soft-thresholding algorithms and an online training algorithm that exploits the locality of channel matrices, accelerating training by more than two orders of magnitude.
AMS: Offloads retraining of specialized DNNs for video analytics to a remote server that continually adapts a small student model running on the edge device for real-time video. AMS introduces techniques to reduce the bandwidth required for over-the-network model adaptation.
RECL: Reduces compute resources required for model retraining in video analytics by reusing specialized DNNs whenever possible, enabling systems to scale to more video streams with limited resources.
SRVC: Combines existing video compression algorithms with a lightweight, content-adaptive super-resolution (SR) neural network, significantly boosting performance with low computation cost. SRVC compresses input video into two bitstreams: a content stream and a model stream, dynamically specializing the SR network for short video segments to achieve significant compression efficiency.
Broader Impacts
This project resulted in 20 technical papers published at top venues in the networking, systems, and machine learning communities. Projects like Pensieve, Decima, and Neo are among the earliest successes in learning-based systems, spurring considerable follow-on research and serving as benchmarks for evaluating new ideas in learned systems.
Our research on learning-based systems has led to broader results in machine learning. For instance, in Decima, the stochastic nature of real-world workloads creates significant variance in the reward signal, making it difficult for RL algorithms to assess the quality of different actions. Decima introduced input-specific baselines to tackle this problem, a concept we generalized for Input-driven Markov Decision Processes where an exogenous, stochastic input process impacts system dynamics. CausalSim led to a theoretical study of causal models with bijective generation mechanisms (BGMs), proving counterfactual identifiability of BGMs in several settings and proposing a practical learning algorithm that generalizes CausalSim's method.
This project has helped several graduate students and postdocs. Five PhD. students have transitioned into the technology industry and one postdoc is now in a faculty position.
Last Modified: 06/29/2024
Modified by: Mohammad Alizadeh
Please report errors in award information by writing to: awardsearch@nsf.gov.