NSF Award Search: Award # 1337158

Award Abstract # 1337158

XPS: CLCCA: On the Hunt for Correctness and Performance Bugs in Large-scale Programs

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	PURDUE UNIVERSITY
Initial Amendment Date:	September 13, 2013
Latest Amendment Date:	April 24, 2015
Award Number:	1337158
Award Instrument:	Standard Grant
Program Manager:	Anindya Banerjee abanerje@nsf.gov (703)292-7885 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 15, 2013
End Date:	August 31, 2016 (Estimated)
Total Intended Award Amount:	$260,331.00
Total Awarded Amount to Date:	$292,331.00
Funds Obligated to Date:	FY 2013 = $260,331.00 FY 2014 = $16,000.00 FY 2015 = $16,000.00
History of Investigator:	Milind Kulkarni (Principal Investigator) Michael Gribskov (Co-Principal Investigator) Saurabh Bagchi (Co-Principal Investigator)
Recipient Sponsored Research Office:	Purdue University 2550 NORTHWESTERN AVE # 1100 WEST LAFAYETTE IN US 47906 (765)494-1055
Sponsor Congressional District:	04
Primary Place of Performance:	Purdue University 465 Northwestern Ave. West Lafayette IN US 47907-2035
Primary Place of Performance Congressional District:	04
Unique Entity Identifier (UEI):	YRXVL4JYCEF5
Parent UEI:	YRXVL4JYCEF5
NSF Program(s):	Information Technology Researc, Software & Hardware Foundation, Exploiting Parallel&Scalabilty
Primary Program Source:	01001415DB NSF RESEARCH & RELATED ACTIVIT 01001314DB NSF RESEARCH & RELATED ACTIVIT 01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	9251, 7943
Program Element Code(s):	164000, 779800, 828300
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The scale of computing applications has been dramatically increasing over the past several years. As applications in domains such as computational genomics, data mining, and machine learning are let loose on ever-more-complex problems, the scale of the inputs to these applications has shot up. And as the pursuit of parallelism has led to increasing core counts for servers, and increasing numbers of servers and racks for data centers, the scale of the systems that these applications must run on has also dramatically risen. A critical problem in developing large scale applications is detecting and debugging scaling issues, which are problems with program behavior that emerge only as a program scales up. Scaling issues show up as correctness bugs or performance bottlenecks. Unfortunately, detecting bugs that arise at large scales is difficult. Manually poring through logs or performance profiling individual application processes is not practical. Moreover, the developer may not have access to the inputs and systems necessary to run the application at large scales. This research project aims to develop automated techniques to detect and diagnose correctness and performance bugs for large-scale programs using program behavior modeling, training at small scale runs, and extrapolating to large-scale runs.

To achieve our objectives, we build statistical models that incorporate scale. By relating program scale to program behavior, we can predict how a program behaves at large scales, without ever seeing correct behavior at that scale, and use those predictions to detect and diagnose bugs. The project is structured around three thrusts, each using the computational genomics applications for context. In the first, we build statistical models of program behavior that incorporate scale. In the second, we build statistical techniques for detecting when there is an error and then drilling down to identify potential root causes in the software. In the third, we build a testing tool which will allow us to uncover such scaling issues in an accelerated manner. In aggregate, the project combines in innovative ways applications of static analysis, dynamic instrumentation, modeling, and machine learning-based data analysis. The project will use computational genomics applications, such as Blast, Bowtie, Trinity/Butterfly, and Margin, to evaluate the approach.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Wilke, Andreas and Bischof, Jared and Gerlach, Wolfgang and Glass, Elizabeth and Harrison, Travis and Keegan, Kevin P and Paczian, Tobias and Trimble, William L and Bagchi, Saurabh and Grama, Ananth and others "The MG-RAST metagenomics database and portal in 2015" Nucleic acids research , v.44 , 2016 , p.D590--D59

Kanak Mahadik, Christopher Wright, Milind Kulkarni, Saurabh Bagchi, Somali Chaterji "SARVAVID: A DSL for Computational Genomics Applications" Programming Languages Design and Implementation , 2016

Ghoshal, Asish and Shankar, Raghavendran and Bagchi, Saurabh and Grama, Ananth and Chaterji, Somali "MicroRNA target prediction using thermodynamic and sequence curves" BMC genomics , v.16 , 2015 , p.1

Kanak Mahadik, Somali Chaterji, Bowen Zhou, Milind Kulkarni and Saurabh Bagchi "Orion: Scaling Genomic Sequence Matching with Fine-Grained Parallelization" The International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing) , 2014 , p.449 10.1109/SC.2014.42

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Genomics is an important area of study that holds the promise of unraveling mysteries of how our genes guide our lives and how personalized gene therapies may be used to treat various medical conditions. The increasing adoption of massively-parallel sequencing technologies, known as next generation sequencing (NGS) technologies, has resulted in a situation where computation has become the bottleneck in our ability to ingest all the genetic information and to synthesize actionable knowledge from this information.

Intellectual Merit

We addressed this problem by developing computational approaches, with a focus on the above domain of computational genomics.

Extracting parallelism. We developed a principled way to extract parallelism from applications. We leveraged the approximate nature of some of these applications (e.g., the DNA sequence match does not need to be exact) to aid in the parallelism. We identified coarse-grained, inter-query parallelism and fine-grained, intra-query parallelism inherent in many of the genomics workflows. We developed parallel building blocks for alignment of genomic sequences, assembly of genomes, and pattern mining for determining regulatory elements in the genome, epigenome, and metagenome.

Modeling. The quality of the scaling model determines how well our system can predict the behavior of a program at large scales. We extended existing modeling techniques that can capture a wide variety of scaling behaviors and automatically select which program behaviors to track. We also modeled the data dependence that many of the target algorithms exhibit and came up with techniques to quantify the confidence in the output of the model.

Detection and diagnosis. Because the scaling model captures the relationship between scale and behavior, we used it to predict a program’s behavior even when run at a new, heretofore-unseen scale. Thus, discrepancy between the predicted behavior and the observed behavior was an indicator that there is a manifested software bug. By isolating the source of those deviations to particular program behaviors, such as a particular loop’s executing too many times, we could localize the source of the bug. The key feature of our approach is that the modeling can be done, efficiently, at small scales, and the model applied for detection and diagnosis at large production scales.

Broader Impact

We disseminated the deliverables of the project through open source software, a domain specific language geared toward genomics researchers and practitioners who want to develop scalable and high-performing applications, and tutorials presented at the Biology Division of the Argonne National Laboratory. To further simplify the use of our deliverables, we created a web server that can take queries about which parts of the human epigenome exert a regulatory effect and visualize it. At the backend, this runs a distributed machine learning algorithm that runs a classification task.

Last Modified: 01/15/2017
Modified by: Milind Kulkarni

Images (1 of 4)

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error