NSF Award Search: Award # 0953478

Award Abstract # 0953478

CAREER: Advanced Methods for Post-Deployment Debugging

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	UNIVERSITY OF WISCONSIN SYSTEM
Initial Amendment Date:	March 8, 2010
Latest Amendment Date:	July 22, 2013
Award Number:	0953478
Award Instrument:	Continuing Grant
Program Manager:	Sol Greenspan sgreensp@nsf.gov (703)292-7841 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	March 15, 2010
End Date:	February 29, 2016 (Estimated)
Total Intended Award Amount:	$499,990.00
Total Awarded Amount to Date:	$499,990.00
Funds Obligated to Date:	FY 2010 = $75,301.00 FY 2011 = $106,347.00 FY 2012 = $112,690.00 FY 2013 = $205,652.00
History of Investigator:	Benjamin Liblit (Principal Investigator) liblit@cs.wisc.edu
Recipient Sponsored Research Office:	University of Wisconsin-Madison 21 N PARK ST STE 6301 MADISON WI US 53715-1218 (608)262-3822
Sponsor Congressional District:	02
Primary Place of Performance:	University of Wisconsin-Madison 21 N PARK ST STE 6301 MADISON WI US 53715-1218
Primary Place of Performance Congressional District:	02
Unique Entity Identifier (UEI):	LCLSJAGTNZQ7
Parent UEI:
NSF Program(s):	Software & Hardware Foundation, SOFTWARE ENG & FORMAL METHODS
Primary Program Source:	01001011DB NSF RESEARCH & RELATED ACTIVIT 01001112DB NSF RESEARCH & RELATED ACTIVIT 01001213DB NSF RESEARCH & RELATED ACTIVIT 01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	1045, 1187, 7944, 9218, HPCC
Program Element Code(s):	779800, 794400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Computer technology is rapidly permeating all spheres of society. A computer system that affects the lives of thousands or millions of people creates a massive community of users who have an interest in the correct behavior of that system. Widespread interconnectivity means that we now have the ability to tap this potential.

This work confronts the challenge of diagnosing and mitigating concurrency bugs. A suite of novel instrumentation schemes will be developed for monitoring thread interleaving patterns. Coupled with statistical debugging models developed previously, this lets developers identify bad thread interleavings which constitute root causes of program failure. A new approach to coordinated cross-thread random sampling keeps overheads low while still providing ample data for diagnosis. Static analysis will play a role to further reduce instrumentation load. Prior statistical debugging work was content with diagnosis only, but this project will develop a speculative locking strategy, guided by the statistical models, to avoid and thereby mitigate the effects of a variety of concurrency bugs.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cathrin Weiss, Cindy Rubio González, and Ben Liblit "Database-Backed Program Analysis for Scalable Error Propagation" 37th International Conference on Software Engineering (ICSE 2015) , 2015 978-1-4799-1934-5

Dongdong Deng, Guoliang Jin, Marc de Kruijf, Ang Li, Ben Liblit, Shan Lu, Shanxiang Qi, Jinglei Ren, Karthikeyan Sankaralingam, Linhai Song, Yongwei Wu, Mingxing Zhang, Wei Zhang, and Weimin Zheng "Fixing, Preventing, and Recovering From Concurrency Bugs" Science China Information Sciences , 2015

Peter Ohmann and Ben Liblit "CSIclipse: Presenting Crash Analysis Data to Developers" Proceedings of the 2015 Workshop on Eclipse Technology eXchange (ETX 2015) , 2015 10.1145/2846650.2846651

Peter Ohmann, David Bingham Brown, Ben Liblit, and Thomas Reps "Recovering Execution Data from Incomplete Observations" 13th International Workshop on Dynamic Analysis (WODA) , 2015 10.1145/2823363.2823368

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Computer technology is rapidly permeating all spheres of society. A computer system that affects the lives of thousands or millions of people creates a massive community of users who have an interest in the correct behavior of that system. Widespread network connectivity means that we now have the ability to tap this potential. The work supported by this grant explored how to harness the power of large user communities to diagnose a particularly pernicious class of computer defects: software concurrency bugs. These elusive bugs that arise when a computer system is performing multiple tasks simultaneously, and some of those tasks interfere with each other in destructive, unpredictable ways.

To address these problems, we developed Cooperative Crug Isolation (CCI), a low-overhead instrumentation framework to diagnose production-run failures caused by concurrency bugs (crugs). CCI tracks specific thread interleavings at run-time, and uses statistical models to identify strong failure predictors among these. We offer a varied suite of predicates that represent different trade-offs between complexity and fault isolation capability. We also developed variant random sampling strategies that suit different types of predicates and help keep the run-time overhead low. Experiments show that these schemes span a wide spectrum of performance and diagnosis capabilities, each suitable for different usage scenarios.

However, recognizing that a bug exists is only half of the battle. Someone still needs to fix the problem’s root cause. This debugging task is difficult in general; it is especially slow and error-prone for concurrency bugs. With NSF’s generous support, we created CFix, a system that automates the repair of concurrency bugs. CFix works with a wide variety of concurrency-bug detectors, including (but definitely not limited to) our own CCI system. For each failure-inducing interleaving reported by a bug detector, CFix first determines a combination of mutual-exclusion and order relationships that, once enforced, can prevent the buggy interleaving. CFix then uses static analysis and testing to determine where to insert what synchronization operations to force the desired mutual-exclusion and order relationships, with a best effort to avoid deadlocks and excessive performance losses. CFix also simplifies its own patches by merging fixes for related bugs. Evaluation using four different types of bug detectors and thirteen real-world concurrency-bug cases shows that CFix can successfully patch these cases without causing deadlocks or excessive performance degradation. Patches automatically generated by CFix are of similar quality to those manually written by developers. An award nomination for part of this work cited it as “one of the first papers to attack the problem of automated bug fixing” of any kind. Thus, this represents a major step forward not only for concurrent software but also for reliable computing in general.

Concurrency is the future: of this there is no doubt. Our ability to maintain software quality in that concurrent future is, however, very much in doubt. The work sponsored by this research grant represents several major steps forward toward safeguarding that future.

Last Modified: 05/26/2016
Modified by: Benjamin R Liblit

Please report errors in award information by writing to: awardsearch@nsf.gov.