
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | September 13, 2013 |
Latest Amendment Date: | April 24, 2015 |
Award Number: | 1337158 |
Award Instrument: | Standard Grant |
Program Manager: |
Anindya Banerjee
abanerje@nsf.gov (703)292-7885 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 15, 2013 |
End Date: | August 31, 2016 (Estimated) |
Total Intended Award Amount: | $260,331.00 |
Total Awarded Amount to Date: | $292,331.00 |
Funds Obligated to Date: |
FY 2014 = $16,000.00 FY 2015 = $16,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
2550 NORTHWESTERN AVE # 1100 WEST LAFAYETTE IN US 47906-1332 (765)494-1055 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
465 Northwestern Ave. West Lafayette IN US 47907-2035 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Information Technology Researc, Software & Hardware Foundation, Exploiting Parallel&Scalabilty |
Primary Program Source: |
01001314DB NSF RESEARCH & RELATED ACTIVIT 01001415DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The scale of computing applications has been dramatically increasing over the past several years. As applications in domains such as computational genomics, data mining, and machine learning are let loose on ever-more-complex problems, the scale of the inputs to these applications has shot up. And as the pursuit of parallelism has led to increasing core counts for servers, and increasing numbers of servers and racks for data centers, the scale of the systems that these applications must run on has also dramatically risen. A critical problem in developing large scale applications is detecting and debugging scaling issues, which are problems with program behavior that emerge only as a program scales up. Scaling issues show up as correctness bugs or performance bottlenecks. Unfortunately, detecting bugs that arise at large scales is difficult. Manually poring through logs or performance profiling individual application processes is not practical. Moreover, the developer may not have access to the inputs and systems necessary to run the application at large scales. This research project aims to develop automated techniques to detect and diagnose correctness and performance bugs for large-scale programs using program behavior modeling, training at small scale runs, and extrapolating to large-scale runs.
To achieve our objectives, we build statistical models that incorporate scale. By relating program scale to program behavior, we can predict how a program behaves at large scales, without ever seeing correct behavior at that scale, and use those predictions to detect and diagnose bugs. The project is structured around three thrusts, each using the computational genomics applications for context. In the first, we build statistical models of program behavior that incorporate scale. In the second, we build statistical techniques for detecting when there is an error and then drilling down to identify potential root causes in the software. In the third, we build a testing tool which will allow us to uncover such scaling issues in an accelerated manner. In aggregate, the project combines in innovative ways applications of static analysis, dynamic instrumentation, modeling, and machine learning-based data analysis. The project will use computational genomics applications, such as Blast, Bowtie, Trinity/Butterfly, and Margin, to evaluate the approach.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Genomics is an important area of study that holds the promise of unraveling mysteries of how our genes guide our lives and how personalized gene therapies may be used to treat various medical conditions. The increasing adoption of massively-parallel sequencing technologies, known as next generation sequencing (NGS) technologies, has resulted in a situation where computation has become the bottleneck in our ability to ingest all the genetic information and to synthesize actionable knowledge from this information.
Intellectual Merit
We addressed this problem by developing computational approaches, with a focus on the above domain of computational genomics.
Extracting parallelism. We developed a principled way to extract parallelism from applications. We leveraged the approximate nature of some of these applications (e.g., the DNA sequence match does not need to be exact) to aid in the parallelism. We identified coarse-grained, inter-query parallelism and fine-grained, intra-query parallelism inherent in many of the genomics workflows. We developed parallel building blocks for alignment of genomic sequences, assembly of genomes, and pattern mining for determining regulatory elements in the genome, epigenome, and metagenome.
Modeling. The quality of the scaling model determines how well our system can predict the behavior of a program at large scales. We extended existing modeling techniques that can capture a wide variety of scaling behaviors and automatically select which program behaviors to track. We also modeled the data dependence that many of the target algorithms exhibit and came up with techniques to quantify the confidence in the output of the model.
Detection and diagnosis. Because the scaling model captures the relationship between scale and behavior, we used it to predict a program’s behavior even when run at a new, heretofore-unseen scale. Thus, discrepancy between the predicted behavior and the observed behavior was an indicator that there is a manifested software bug. By isolating the source of those deviations to particular program behaviors, such as a particular loop’s executing too many times, we could localize the source of the bug. The key feature of our approach is that the modeling can be done, efficiently, at small scales, and the model applied for detection and diagnosis at large production scales.
Broader Impact
We disseminated the deliverables of the project through open source software, a domain specific language geared toward genomics researchers and practitioners who want to develop scalable and high-performing applications, and tutorials presented at the Biology Division of the Argonne National Laboratory. To further simplify the use of our deliverables, we created a web server that can take queries about which parts of the human epigenome exert a regulatory effect and visualize it. At the backend, this runs a distributed machine learning algorithm that runs a classification task.
Last Modified: 01/15/2017
Modified by: Milind Kulkarni
Please report errors in award information by writing to: awardsearch@nsf.gov.