
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | September 5, 2014 |
Latest Amendment Date: | February 2, 2018 |
Award Number: | 1405697 |
Award Instrument: | Continuing Grant |
Program Manager: |
Marilyn McClure
mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2014 |
End Date: | September 30, 2020 (Estimated) |
Total Intended Award Amount: | $750,000.00 |
Total Awarded Amount to Date: | $946,001.00 |
Funds Obligated to Date: |
FY 2015 = $149,935.00 FY 2016 = $282,795.00 FY 2018 = $180,001.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
300 TURNER ST NW BLACKSBURG VA US 24060-3359 (540)231-5281 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Blacksburg VA US 24060-0001 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Special Projects - CNS, CSR-Computer Systems Research |
Primary Program Source: |
01001617DB NSF RESEARCH & RELATED ACTIVIT 01001516DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Computer applications that process large amounts of information are becoming common in a variety of science domains, such as High-Speed Physics, Economics, Genomics, Astronomy, and Meteorology. The overall goal of this project is to design software tools and technologies to support such applications efficiently on advanced computing systems. Moreover, the hardware that is used to implement such advanced systems often boasts of different types of resources, e.g., a conventional computer processor running alongside specialized graphic processing units, and this heterogeneity presents a major challenge when running the applications at the needed large scale. Having a better understanding of the applications behavior on the emerging hardware is key to sustaining these systems. To this end, the project designs and develops Pythia, software that models and predicts how applications would behave on given hardware. This information is then used to better utilize the resources, and achieve scalable and high performance computing systems.
The intellectual value of this research involves three intermediate research goals. 1) Design an accurate application classifier using compile-time program analysis that captures workflow behavior and application characteristics, and provides detailed insights into expected runtime application interactions. 2) Design and develop an accurate simulation model that incorporates workflow and application characteristics into a heuristics engine to predict how the application will perform under given conditions and resources. 3) Design a distributed, flexible, efficient, and easy-to-use online oracle framework that captures the infrastructure heterogeneity and integrates with live systems to predict application behavior, which in turn can help guide application-attuned resource scheduling and management. Completion of the project will create tools and technologies for realization of more efficient and scalable computing systems. This work impacts a broad range of disciplines that regularly employ high-performance large-scale computing systems, especially for data-driven discovery. Consequently, use of Pythia will reduce the time-to-solution for modern and emerging applications, and therefore directly affect our way of life. The educational activities, which include recruiting and mentoring women and minority students, will help produce graduates with highly marketable skill sets. The integration of the research discoveries and software tools, which will be open source and made public, into the educational curriculum will help capture the interest of the next generation of computer scientists.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The project designed, implemented, and evaluated Pythia, a software system aimed at fine-tuning distributed data management frameworks (DSFs), such as Hadoop and associated tools, on emerging heterogeneous resources. Pythia aims at improving the performance of crucial large-scale applications that are routinely employed in a variety of fields, e.g., high-speed physics, healthcare, and finance. Pythia utilized compile-time program analysis to build an application model, creates a runtime simulation-based performance oracle to predict application behavior, and uses the predicted application behavior to devise dynamic resource management strategies for large-scale computing systems.
It was observed that load imbalance across distributed computing servers is a key obstacle to achieving high performance in DSFs, and this led to the development of dynamic solutions to address the imbalance in extant key-value stores that power modern applications and Internet services. The concepts were then extended to design storage tiering support for data analytics in the cloud. The approach enables cloud service providers to scale their revenue systems in a cost-aware manner by supporting the systems to adapt to the ever evolving business context. Another challenge in DSFs is the growing heterogeneity in the hardware and software components. The work showed that heterogeneous resources can be integrated into Hadoop and similar DSFs, while achieving high performance and sustaining ease-of-use and flexibility offered by the frameworks.
Modern DSFs increasingly employ the containerized model. In this context, a study of containers registry was conducted and provided key findings, which enabled designing of effective caching and prefetching strategies that exploit registry-specific workload characteristics to significantly improve performance of DSFs. The traces and a trace replayer tool were made available to the community as open-source resources, which can be used to serve as a solid basis for new research and studies on container registries and container-based virtualization. The works were further extended by designing a new hyperconverged architecture for container registries, which combines all registry servers in a distributed container setup into a tightly connected cluster and play the same consolidated role: each registry server caches images in its memory, stores images in its local storage, and provides computational resources to process client requests. The design employs a custom consistent hashing function to take advantage of the layered structure and addressing of images and to load balance requests across different servers. Evaluation using real production workloads showed that the approach outperforms the conventional registry design significantly and improves latency by an order of magnitude. Moreover, the conducted container analysis allows for making recommendations on the current landscape of containers in HPC and further informs designs for container solutions tailor made for future HPC applications.
The approaches have been applied to crucial applications such as big data processing and deep learning. In this context, a deep learning scheduler for multi-GPU HPC systems was designed and achieved efficient resource usage in representative HPC systems that are equipped with multiple GPUs, fast networks, powerful CPUs, and abundant memory.
The broader impact of the project included mentoring of eight Ph.D. theses (four by women), four MS theses (two by women), and 12 REU students (four women). Outreach modules were also designed for including in classroom instruction as well as to apprise stakeholders at national labs. (e.g., ORNL) and industry (e.g., IBM Research). All software tools developed as part of the project have been released to the community to enable followup research.
Last Modified: 11/29/2020
Modified by: Ali Butt
Please report errors in award information by writing to: awardsearch@nsf.gov.