NSF Award Search: Award # 2023468

Award Abstract # 2023468

ASCENT: Collaborative Research: Scaling Distributed AI Systems based on Universal Optical I/O

NSF Org:	ECCS Division of Electrical, Communications and Cyber Systems
Recipient:	MASSACHUSETTS INSTITUTE OF TECHNOLOGY
Initial Amendment Date:	July 27, 2020
Latest Amendment Date:	July 27, 2020
Award Number:	2023468
Award Instrument:	Standard Grant
Program Manager:	Aranya Chakrabortty ECCS Division of Electrical, Communications and Cyber Systems ENG Directorate for Engineering
Start Date:	August 15, 2020
End Date:	July 31, 2023 (Estimated)
Total Intended Award Amount:	$325,000.00
Total Awarded Amount to Date:	$325,000.00
Funds Obligated to Date:	FY 2020 = $325,000.00
History of Investigator:	Manya Ghobadi (Principal Investigator) ghobadi@mit.edu
Recipient Sponsored Research Office:	Massachusetts Institute of Technology 77 MASSACHUSETTS AVE CAMBRIDGE MA US 02139-4301 (617)253-1000
Sponsor Congressional District:	07
Primary Place of Performance:	Massachusetts Institute of Technology MA US 02139-4309
Primary Place of Performance Congressional District:	07
Unique Entity Identifier (UEI):	E2NYLCDML6V1
Parent UEI:	E2NYLCDML6V1
NSF Program(s):	ASCENT-Address-Chalg-Eng-Teams
Primary Program Source:	01002021DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	1653
Program Element Code(s):	133Y00
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.041

ABSTRACT

Our society is rapidly becoming reliant on neural networks based artificial intelligence computation. New algorithms are invented daily, increasing the memory and computational requirements for both inference and training. This explosive growth has created an enormous demand for distributed machine learning (ML) training and inference. Estimates by OpenAI illustrate the steady growth of computational requirements of 100x every two years since 2012, which is a 50x faster than the rate of computation improvements enabled previously through Moore?s Law of semiconductor industry that we have enjoyed in the last half-century. This new computation demand has been partly met by rapid development of hardware accelerators and software stacks to support these specialized computations. Hardware accelerators have provided a significant amount of speed-up but today?s training tasks can still take days and even weeks. The reason for this: as the number of workers (e.g. compute nodes) increases, the computation time per worker decreases, but the communication requirements between the nodes increase, creating a bottleneck in the interconnect between the compute nodes. Future distributed ML systems will require 1-2 orders of magnitude higher interconnect bandwidth per node, creating a pressing need for entirely new ways to build interconnects for distributed ML systems. This proposal aims to create a new paradigm for scaling distributed ML computation, by developing a scalable interconnect solution based on advancing the integrated electronics and photonics technology that enables direct node-to-node optical fiber connectivity. The proposed cross-stack collaborative multi-disciplinary work will enable the education and training of a unique crop of engineers and scientists that cross the boundaries of machine learning, networking, and electronic-photonic systems and devices, which are in severe demand. The principal investigators have an established track record of direct engagement with high-school students providing summer internships at Berkeley Wireless Research Center and MIT?s Women?s Technology Program, as well as exemplary undergraduate research activities at Boston University. The educational and outreach activities the PIs have put in place will ensure early exposure and continued training of new generation of leaders in this field, from K-12, through undergraduate and graduate studies, and continuing workforce education, with special focus on underrepresented students.

The interconnect has emerged as the key bottleneck in enabling the full potential of distributed ML. Future ML workloads are likely to require tens of Tbps of bandwidth per device. Ubiquitous deployment of logically-connected, physically distributed computation across shelf, rack and row scale can only be enabled by a new universal I/O that enables socket to socket communication at the energy, latency and bandwidth density of in-package interconnects. No such technology currently exists. Silicon-photonics based optical I/O has the potential to address this critical challenge, but fundamental advances?from chip manufacturing to routing algorithms?are still needed to ensure the scalability of these interconnect systems. To enable high-bandwidth density and energy-efficiency, dense wavelength division multiplexing must be used. High-efficiency ring resonator-based modulators and comb laser sources are needed to enable Tbps rates over each fiber connection and socket bandwidth scaling from 10s to 100s of Tbps. New link architectures like the proposed laser-forwarded coherent link are needed to enable high-efficiency external centralized comb laser sources with modest (sub-mW) power per wavelength per fiber port. The proposed work will also develop new scheduling algorithms, network architectures, and workload parallelism strategy to leverage the bandwidth density and low-latency of the universal optical I/O, to map large AI workloads with massive datasets to a scalable distributed compute system.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Ghobadi, Manya "Emerging Optical Interconnects for AI Systems" OFC , 2022 https://doi.org/10.1364/OFC.2022.Th1G.1 Citation Details

Khani, Mehrdad and Ghobadi, Manya and Alizadeh, Mohammad and Zhu, Ziyi and Glick, Madeleine and Bergman, Keren and Vahdat, Amin and Klenk, Benjamin and Ebrahimi, Eiman "SiP-ML: high-bandwidth optical network interconnects for machine learning training" ACM SIGCOMM , 2021 https://doi.org/10.1145/3452296.3472900 Citation Details

Sludds, Alexander and Hamerly, Ryan and Bandyopadhyay, Saumil and Zhong, Zhizhen and Chen, Zaijun and Bernstein, Liane and Ghobadi, Manya and Englund, Dirk "Demonstration of WDM-Enabled Ultralow-Energy Photonic Edge Computing" OFC , 2022 https://doi.org/10.1364/OFC.2022.Th3A.3 Citation Details

Weiyang Wang and Moein Khazraee and Zhizhen Zhong and Manya Ghobadi and Zhihao Jia and Dheevatsa Mudigere and Ying Zhang and Anthony Kewitsch "TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs" 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) , 2023 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Project outcomes report:

This project aimed to create a set of technologies -- from applications and computation architectures to electronic-photonic links and new fundamental device components -- all optimized to enable scaling distributed ML computation problems. The work blended the innovation synergistically across the levels of design hierarchy, building on the capabilities of breakthrough photonic device and link technologies to create transformative compute platforms.

Our objectives were: (1) to lay the foundation for the scalable distributed ML platform based on photonic I/O interconnects; (2) develop underlying link device and circuit components that demonstrate the scaling potential of photonic I/Os; and (3) to build a new collaboration between four groups with leading expertise in photonic device design, electronic-photonic link design and integration, and distributed compute system design, to take this vision, and developed toolkits, forward to continue the innovations required to meet the computational demands of continuously developing ML applications and algorithms.

The contributions of this project lie in the development of photonic components, interconnect circuits and architectural framework, to realize a scalable distributed ML computational platform:

1) We have developed the simulation frameworks for large-scale distributed machine learning (Rostam) as well as associated hardware experiments, which enabled the development of the new silicon-photonic fabric topologies tailored for machine learning systems (SiP-ML) and associated optimization frameworks (TopoOpt). The SiP-ML interconnect framework that was developed and evaluated reported speed ups up to 9.1x on large GPU clusters (scale 1024 GPUs). This represents a breakthrough justifying the use of silicon-photonic chiplet interconnects in AI/ML applications at scale; 2) Two chips were designed and fabricated in the high-volume monolithic electronic-photonic process platform, with novel devices and link sub-blocks for this ASCENT project. Chip 1: ring-based transmitters (driver and modulators) and phase-tracking coherent receivers; and Chip 2: electro-optic receiver ring-tuning (single-ring, WDM ring designs of experiments, and WDM ring receiver with backend electronics). A number of novel devices were designed and characterized, and papers published on the results. The infrastructure for the link experiments based on these chips has been set up demonstrating link-level sub-system functionality.

This project involved students ranging from undergraduate to MS and PhD level, and exposed them to the multidisciplinary nature of research – from new electronic-photonic process platforms and design tools, to novel photonic device and circuit designs, link architectures to network topologies and machine-learning algorithms, enabled by the cross-stack approach of the ASCENT program.

Last Modified: 12/27/2023
Modified by: Manya Ghobadi

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error