Award Abstract # 1751009
CAREER: Data-Driven Network Resource Management Systems

NSF Org: CNS
Division Of Computer and Network Systems
Recipient: MASSACHUSETTS INSTITUTE OF TECHNOLOGY
Initial Amendment Date: May 8, 2018
Latest Amendment Date: August 8, 2022
Award Number: 1751009
Award Instrument: Continuing Grant
Program Manager: Ann Von Lehmen
CNS
 Division Of Computer and Network Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: May 15, 2018
End Date: April 30, 2024 (Estimated)
Total Intended Award Amount: $628,029.00
Total Awarded Amount to Date: $628,029.00
Funds Obligated to Date: FY 2018 = $129,172.00
FY 2019 = $128,134.00

FY 2020 = $119,913.00

FY 2021 = $123,531.00

FY 2022 = $127,279.00
History of Investigator:
  • Mohammad Alizadeh (Principal Investigator)
    alizadeh@csail.mit.edu
Recipient Sponsored Research Office: Massachusetts Institute of Technology
77 MASSACHUSETTS AVE
CAMBRIDGE
MA  US  02139-4301
(617)253-1000
Sponsor Congressional District: 07
Primary Place of Performance: Massachusetts Institute of Technology
77 Massachusetts Ave.
Cambridge
MA  US  02139-4307
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): E2NYLCDML6V1
Parent UEI: E2NYLCDML6V1
NSF Program(s): Networking Technology and Syst
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
01001920DB NSF RESEARCH & RELATED ACTIVIT

01002021DB NSF RESEARCH & RELATED ACTIVIT

01002122DB NSF RESEARCH & RELATED ACTIVIT

01002223DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1045
Program Element Code(s): 736300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Modern networks require sophisticated systems and algorithms to manage resources efficiently and deliver high quality of experience to users. These systems are critical to services society has come to rely on, from video streaming to social networks to AI applications. Video streaming, for example, involves numerous systems that control everything from the resolution of the video to the network path and the video download speed based on dynamic network conditions. As networks and applications have become more complex, existing approaches have become inadequate and designing algorithms that deliver high performance in all conditions has become exceedingly difficult. The goal of this research is to address this challenge by developing network systems that learn to manage resources automatically through experience by applying new machine learning techniques. This new paradigm, if successful, will make networks simpler to design, more efficient and cost effective, and able to deliver better services to businesses and consumers.

This project's goal is to develop the algorithmic and systems foundations for designing resource management systems that use modern reinforcement learning and other predictive control techniques to achieve strong performance across heterogeneous networks and applications. To this end, the researchers plan to build a series of practical systems for important applications, including schedulers for cluster computing systems (e.g., for data-parallel analytics workloads), and context-aware network control protocols (e.g., for adaptive streaming of 360 virtual reality video). In building these systems, the researchers will tackle fundamental challenges that confront data-driven network resource management, including (i) techniques to represent workloads (e.g., graph-structured jobs) and networks (e.g., topologies, queues, flows) to facilitate learning using neural networks; (ii) techniques to handle challenging resource management problems with large and deep action spaces; (iii) techniques to efficiently collect data across a myriad of devices for learning control models; and (iv) techniques to bootstrap learning models from data collected offline and continually train models safely after deployment

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 17)
Addanki, Ravichandra and Bojja Venkatakrishnan, Shaileshh and Gupta, Shreyan and Mao, Hongzi and Alizadeh, Mohammad "Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning" Advances in Neural Information Processing Systems 32 (NIPS 2019) , 2019 Citation Details
Alomar, Abdullah and Hamadanian, Pouya and Nasr-Esfahany, Arash and Agarwal, Anish and Alizadeh, Mohammad and Shah, Devavrat "CausalSim: A Causal Framework for Unbiased Trace-Driven Simulation" 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) , 2023 Citation Details
Hamadanian, Pouya and Schwarzkopf, Malte and Sen, Siddartha "How Reinforcement Learning Systems Fail and What to do About It" The 2nd Workshop on Machine Learning and Systems (EuroMLSys) , 2022 Citation Details
Khani, Mehrdad and Alizadeh, Mohammad and Hoydis, Jakob and Fleming, Phil "Adaptive Neural Signal Detection for Massive MIMO" IEEE Transactions on Wireless Communications , 2020 10.1109/TWC.2020.2996144 Citation Details
Khani, Mehrdad and Ananthanarayanan, Ganesh and Hsieh, Kevin and Jiang, Junchen and Netravali, Ravi and Shu, Yuanchao and Alizadeh, Mohammad and Bahl, Victor "RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics" , 2023 Citation Details
Khani, Mehrdad and Hamadanian, Pouya and Nasr-Esfahany, Arash and Alizadeh Mohammad "Real-Time Video Inference on Edge Devices via Adaptive Model Streaming" 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , 2021 https://doi.org/10.1109/ICCV48922.2021.00453 Citation Details
Khani, Mehrdad and Sivaraman, Vibhaalakshmi and Alizadeh, Mohammad "Efficient Video Compression via Content-Adaptive Super-Resolution" 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , 2021 https://doi.org/10.1109/ICCV48922.2021.00448 Citation Details
Li, Chenning and Nasr-Esfahany, Arash and Zhao, Kevin and Noorbakhsh, Kimia and Goyal, Prateesh and Alizadeh, Mohammad and Anderson, Thomas "m3: Accurate Flow-Level Performance Estimation using Machine Learning" , 2024 Citation Details
Mao, Hongzi and Negi, Parimarjan and Narayan, Akshay and Wang, Hanrui and Yang, Jiacheng and Wang, Haonan and Marcus, Ryan and Addanki, Ravichandra and Khani Shirkoohi, Mehrdad and He, Songtao and Nathan, Vikram and Cangialosi, Frank and Venkatakrishnan, "Park: An Open Platform for Learning-Augmented Computer Systems" Advances in Neural Information Processing Systems 32 (NIPS 2019) , 2019 Citation Details
Mao, Hongzi and Schwarzkopf, Malte and Venkatakrishnan, Shaileshh Bojja and Meng, Zili and Alizadeh, Mohammad "Learning scheduling algorithms for data processing clusters" ACM SIGCOMM 2019 , 2019 10.1145/3341302.3342080 Citation Details
Marcus, Ryan and Kipf, Andreas and van Renen, Alexander and Stoia, Mihail and Misra, Sanchit and Kemper, Alfons and Neumann, Thomas and Kraska, Tim "Flow-Loss: Learning Cardinality Estimates That Matter" Proceedings of the VLDB Endowment , v.14 , 2021 Citation Details
(Showing: 1 - 10 of 17)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Today's networks rely on many human-engineered control and resource management heuristics, but the complexity and heterogeneity of modern systems make it difficult to design effective heuristics for different settings. This project developed principled approaches for building systems that autonomously learn to make resource management decisions and adapt to various environments without human intervention.

Intellectual Merit

Learning-based Systems and Benchmarks

Pensieve: A system that uses reinforcement learning (RL) to adapt video bitrate algorithms based on the performance of past choices, tailoring control policies to network and video characteristics.

Decima: An RL-based job scheduler for data processing clusters that learns workload-specific scheduling policies for jobs represented as dataflow graphs. Decima's learned policies outperform standard heuristics, providing significant cost savings by enabling large-scale data processing clusters to run at higher utilization. To invent Decima, we developed new techniques, including a variance reduction method for robust RL training in environments with stochastic input processes.

Other contributions in this area include:

  • Placeto: An RL system for learning generalizable task placement strategies for parallelizing neural network training jobs on multiple GPUs.

  • Neo, Bao, Flow-loss: Learning-based relational query optimizers and robust learned cardinality estimation for database management systems.

  • Park: An open benchmark suite for research on learning-augmented computer systems.

Advances in Simulation and Modeling

CausalSim: Addresses a fundamental problem in trace-driven simulation, where replaying an observed traces, such as network throughput measurements, can be flawed because interventions (such as new protocols) that are simulated could have changed the trace. CausalSim uses a novel causal inference algorithm to learn the relationship between interventions and trace observations, modifying the trace during simulation to reflect the effect of interventions. 

m3: A scale-free, fast, and accurate model for estimating the tail latency of network transfers in datacenters. m3 trains a neural network to approximate the packet-level simulation function, using a simple flow-level simulator to generate a compact feature map capturing the important characteristics of a network scenario.

Continual, Online Learning in Systems and Their Applications

Many systems must operate in non-stationary environments where workload or operating conditions change over time. Learning-based systems can adapt their models over time to continually optimize for current conditions, enabling the system to leverage simpler models that perform well over a narrower set of inputs and enhancing robustness by reducing reliance on training data coverage.

MMNet: Revisits MIMO detection from an online learning perspective, using a neural network architecture based on iterative soft-thresholding algorithms and an online training algorithm that exploits the locality of channel matrices, accelerating training by more than two orders of magnitude.

AMS: Offloads retraining of specialized DNNs for video analytics to a remote server that continually adapts a small student model running on the edge device for real-time video. AMS introduces techniques to reduce the bandwidth required for over-the-network model adaptation.

RECL: Reduces compute resources required for model retraining in video analytics by reusing specialized DNNs whenever possible, enabling systems to scale to more video streams with limited resources.

SRVC: Combines existing video compression algorithms with a lightweight, content-adaptive super-resolution (SR) neural network, significantly boosting performance with low computation cost. SRVC compresses input video into two bitstreams: a content stream and a model stream, dynamically specializing the SR network for short video segments to achieve significant compression efficiency.

Broader Impacts

This project resulted in 20 technical papers published at top venues in the networking, systems, and machine learning communities. Projects like Pensieve, Decima, and Neo are among the earliest successes in learning-based systems, spurring considerable follow-on research and serving as benchmarks for evaluating new ideas in learned systems.

Our research on learning-based systems has led to broader results in machine learning. For instance, in Decima, the stochastic nature of real-world workloads creates significant variance in the reward signal, making it difficult for RL algorithms to assess the quality of different actions. Decima introduced input-specific baselines to tackle this problem, a concept we generalized for Input-driven Markov Decision Processes where an exogenous, stochastic input process impacts system dynamics. CausalSim led to a theoretical study of causal models with bijective generation mechanisms (BGMs), proving counterfactual identifiability of BGMs in several settings and proposing a practical learning algorithm that generalizes CausalSim's method.

This project has helped several graduate students and postdocs. Five PhD. students have transitioned into the technology industry and one postdoc is now in a faculty position. 


 


Last Modified: 06/29/2024
Modified by: Mohammad Alizadeh

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page