Award Abstract # 1405697
CSR: Medium: Pythia: An Application Analysis and Online Modeling Based Prediction Framework for Scalable Resource Management

NSF Org: CNS
Division Of Computer and Network Systems
Recipient: VIRGINIA POLYTECHNIC INSTITUTE & STATE UNIVERSITY
Initial Amendment Date: September 5, 2014
Latest Amendment Date: February 2, 2018
Award Number: 1405697
Award Instrument: Continuing Grant
Program Manager: Marilyn McClure
mmcclure@nsf.gov
 (703)292-5197
CNS
 Division Of Computer and Network Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2014
End Date: September 30, 2020 (Estimated)
Total Intended Award Amount: $750,000.00
Total Awarded Amount to Date: $946,001.00
Funds Obligated to Date: FY 2014 = $333,270.00
FY 2015 = $149,935.00

FY 2016 = $282,795.00

FY 2018 = $180,001.00
History of Investigator:
  • Ali Butt (Principal Investigator)
    butta@cs.vt.edu
  • Chao Wang (Co-Principal Investigator)
Recipient Sponsored Research Office: Virginia Polytechnic Institute and State University
300 TURNER ST NW
BLACKSBURG
VA  US  24060-3359
(540)231-5281
Sponsor Congressional District: 09
Primary Place of Performance: Virginia Polytechnic Institute and State University
Blacksburg
VA  US  24060-0001
Primary Place of Performance
Congressional District:
09
Unique Entity Identifier (UEI): QDE5UHE5XD16
Parent UEI: M515A1DKXAN8
NSF Program(s): Special Projects - CNS,
CSR-Computer Systems Research
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
01001617DB NSF RESEARCH & RELATED ACTIVIT

01001516DB NSF RESEARCH & RELATED ACTIVIT

01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 9251, 7924
Program Element Code(s): 171400, 735400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Computer applications that process large amounts of information are becoming common in a variety of science domains, such as High-Speed Physics, Economics, Genomics, Astronomy, and Meteorology. The overall goal of this project is to design software tools and technologies to support such applications efficiently on advanced computing systems. Moreover, the hardware that is used to implement such advanced systems often boasts of different types of resources, e.g., a conventional computer processor running alongside specialized graphic processing units, and this heterogeneity presents a major challenge when running the applications at the needed large scale. Having a better understanding of the applications behavior on the emerging hardware is key to sustaining these systems. To this end, the project designs and develops Pythia, software that models and predicts how applications would behave on given hardware. This information is then used to better utilize the resources, and achieve scalable and high performance computing systems.


The intellectual value of this research involves three intermediate research goals. 1) Design an accurate application classifier using compile-time program analysis that captures workflow behavior and application characteristics, and provides detailed insights into expected runtime application interactions. 2) Design and develop an accurate simulation model that incorporates workflow and application characteristics into a heuristics engine to predict how the application will perform under given conditions and resources. 3) Design a distributed, flexible, efficient, and easy-to-use online oracle framework that captures the infrastructure heterogeneity and integrates with live systems to predict application behavior, which in turn can help guide application-attuned resource scheduling and management. Completion of the project will create tools and technologies for realization of more efficient and scalable computing systems. This work impacts a broad range of disciplines that regularly employ high-performance large-scale computing systems, especially for data-driven discovery. Consequently, use of Pythia will reduce the time-to-solution for modern and emerging applications, and therefore directly affect our way of life. The educational activities, which include recruiting and mentoring women and minority students, will help produce graduates with highly marketable skill sets. The integration of the research discoveries and software tools, which will be open source and made public, into the educational curriculum will help capture the interest of the next generation of computer scientists.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 34)
Zhao, Nannan and Tarasov, Vasily and Albahar, Hadeel and Anwar, Ali and Rupprecht, Lukas and Skourtis, Dimitrios and Warke, Amit S. and Mohamed, Mohamed and Butt, Ali R. "Large-Scale Analysis of the Docker Hub Dataset" IEEE International Conference on Cluster Computing (CLUSTER) , 2019 10.1109/CLUSTER.2019.8891000 Citation Details
zhao, nannan and Tarasov, Vasily and Albahar, Hadeel and Anwar, Ali and Rupprecht, Lukas and Skourtis, Dimitris and Paul, Arnab K. and Chen, Keren and Butt, Ali R. "Large-Scale Analysis of the Docker Images and Performance Implications for Container Storage Systems" IEEE Transactions on Parallel and Distributed Systems , 2020 https://doi.org/10.1109/TPDS.2020.3034517 Citation Details
Zhao, Zhao and Chen, Langshi and Avram, Mihai and Li, Meng and Wang, Guanying and Butt, Ali and Khan, Maleq and Marathe, Madhav and Qiu, Judy and Vullikanti, Anil "Finding and counting tree-like subgraphs using MapReduce" IEEE Transactions on Multi-Scale Computing Systems , 2017 10.1109/TMSCS.2017.2768426 Citation Details
Yue Cheng, M. Safdar Iqbal, Aayush Gupta, and Ali R. Butt "Provider versus Tenant Pricing Games for Hybrid Object Stores in the Cloud" IEEE Internet Computing: Special Issue on Cloud Storage , v.20 , 2016 , p.28 http://doi.ieeecomputersociety.org/10.1109/MIC.2016.50
Yue Cheng, M. Safdar Iqbal, Aayush Gupta, and Ali R. Butt. "Provider versus Tenant Pricing Games for Hybrid Object Stores in the Cloud." IEEE Internet Computing: Special Issue on Cloud Storage , v.20 , 2016 , p.28
Yu, Tingting and Zaman, Tarannum and Wang, Chao "DESCRY: Reproducing System-Level Concurrency Failures" ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE) , 2017 Citation Details
Zhao, Nannan and Albahar, Hadeel and Abraham, Subil and Chen, Keren and Tarasov, Vasily and Skourtis, Dimitrios and Rupprecht, Lukas and Anwar, Ali and Butt, Ali R. "DupHunter: Flexible High-Performance Deduplication for Docker Registries" USENIX Annual Technical Conference (ATC'20) , 2020 Citation Details
Zhao, Nannan and Anwar, Ali and Cheng, Yue and Salman, Mohammed and Li, Daping and Wan, Jiguang and Xie, Changsheng and He, Xubin and Wang, Feiyi and Butt, Ali R. "Chameleon: An Adaptive Wear Balancer for Flash Clusters" IPDPS .... [proceedings] , 2018 Citation Details
Zhao, Nannan and Anwar, Ali and Tarasov, Vasily and Rupprecht, Lukas and Skourtis, Dimitrios and Warke, Amit S and Mohamed, Mohamed and Butt, Ali R. "Slimmer: Weight Loss Secrets for Docker Registries" Work-in-Progress report in the IEEE International Conference on Cloud Computing (CLOUD) , 2019 Citation Details
Littley, Michael and Anwar, Ali and Fayyaz, Hannan and Fayyaz, Zeshan and Tarasov, Vasily and Rupprecht, Lukas and Skourtis, Dimitrios and Mohamed, Mohamed and Ludwig, Heiko and Cheng, Yue and Butt, Ali R. "Bolt: Towards a Scalable Docker Registry" Proceedings of the IEEE International Conference on Cloud Computing(CLOUD) , 2019 Citation Details
Luna Xu, Seung-Hwan Lim, Ali R. Butt, Sreenivas R. Sukumar, and Ramakrishnan Kannan "FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms" Proceedings ofthe First Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems (PDSW-DISCS) , 2016
(Showing: 1 - 10 of 34)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

 

The project designed, implemented, and evaluated Pythia, a software system aimed at fine-tuning distributed data management frameworks (DSFs), such as Hadoop and associated tools, on emerging heterogeneous resources. Pythia aims at improving the performance of crucial large-scale applications that are routinely employed in a variety of fields, e.g., high-speed physics, healthcare, and finance. Pythia utilized compile-time program analysis to build an application model, creates a runtime simulation-based performance oracle to predict application behavior, and uses the predicted application behavior to devise dynamic resource management strategies for large-scale computing systems. 

It was observed that load imbalance across distributed computing servers is a key obstacle to achieving high performance in DSFs, and this led to the development of dynamic solutions to address the imbalance in extant key-value stores that power modern applications and Internet services. The concepts were then extended to design storage tiering support for data analytics in the cloud. The approach enables cloud service providers to scale their revenue systems in a cost-aware manner by supporting the systems to adapt to the ever evolving business context. Another challenge in DSFs is the growing heterogeneity in the hardware and software components. The work showed that  heterogeneous resources can be integrated into Hadoop and similar DSFs, while achieving high performance and sustaining ease-of-use and flexibility offered by the frameworks. 

Modern DSFs increasingly employ the containerized model. In this context, a study of containers registry was conducted and provided key findings, which enabled designing of effective caching and prefetching strategies that exploit registry-specific workload characteristics to significantly improve performance of DSFs. The traces and a trace replayer tool were made available to the community as open-source resources, which can be used to serve as a solid basis for new research and studies on container registries and container-based virtualization. The works were further extended by designing a new hyperconverged architecture for container registries, which combines all registry servers in a distributed container setup into a tightly connected cluster and play the same consolidated role: each registry server caches images in its memory, stores images in its local storage, and provides computational resources to process client requests. The design employs a custom consistent hashing function to take advantage of the layered structure and addressing of images and to load balance requests across different servers. Evaluation using real production workloads showed that the approach outperforms the conventional registry design significantly and improves latency by an order of magnitude. Moreover, the conducted container analysis allows for making recommendations on the current landscape of containers in HPC and further informs designs for container solutions tailor made for future HPC applications. 

The approaches have been applied to crucial applications such as big data processing and deep learning. In this context, a deep learning scheduler for multi-GPU HPC systems was designed and achieved efficient resource usage in representative HPC systems that are equipped with multiple GPUs, fast networks, powerful CPUs, and abundant memory. 

 The broader impact of the project included mentoring of eight Ph.D. theses (four by women), four MS theses (two by women), and 12 REU students (four women). Outreach modules were also designed for including in classroom instruction as well as to apprise stakeholders at national labs. (e.g., ORNL)  and industry (e.g., IBM Research).  All software tools developed as part of the project have been released to the community to enable followup research.

 


Last Modified: 11/29/2020
Modified by: Ali Butt

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page