NSF Award Search: Award # 2027516 - BIGDATA: Collaborative Research: F: Holistic Optimization of Data-Driven Applications

Award Abstract # 2027516

BIGDATA: Collaborative Research: F: Holistic Optimization of Data-Driven Applications

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	REGENTS OF THE UNIVERSITY OF CALIFORNIA, THE
Initial Amendment Date:	April 22, 2020
Latest Amendment Date:	April 22, 2020
Award Number:	2027516
Award Instrument:	Standard Grant
Program Manager:	Hector Munoz-Avila hmunoz@nsf.gov (703)292-4481 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	April 1, 2020
End Date:	September 30, 2021 (Estimated)
Total Intended Award Amount:	$200,432.00
Total Awarded Amount to Date:	$200,432.00
Funds Obligated to Date:	FY 2015 = $200,432.00
History of Investigator:	Alvin Cheung (Principal Investigator) akcheung@cs.berkeley.edu
Recipient Sponsored Research Office:	University of California-Berkeley 1608 4TH ST STE 201 BERKELEY CA US 94710-1749 (510)643-3891
Sponsor Congressional District:	12
Primary Place of Performance:	University of California-Berkeley CA US 94710-1749
Primary Place of Performance Congressional District:	12
Unique Entity Identifier (UEI):	GS3YEVSS12N6
Parent UEI:
NSF Program(s):	Big Data Science &Engineering
Primary Program Source:	01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7433, 8083
Program Element Code(s):	808300
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

We interact with online shopping and banking websites on a daily basis. Many of these websites are powered by data-driven applications. Such application often consists of two parts: an application hosted on an application server, and a database management system (DBMS) hosted on a separate server from the application server that maintains persistent data. Unfortunately, many data-driven applications suffer from performance problems, such as taking a long time to load a page or inability to scale up to serve large number of clients simultaneously. The state of the art in discovering and fixing performance problems in data-driven applications is to examine the two parts of the application separately, and doing so misses many opportunities in discovering and fixing such problems. Unlike prior approaches, in this project we will treat the DBMS and the application in tandem. In particular, we will devise new techniques and tools to help identify performance problems, understand the cause of such problems, and fix them automatically. This project will open up new opportunities in cross-layer program compilation and optimization, with the practical goal of improving the performance of data-driven applications that will have a significant impact in many aspects of our daily lives. The findings from this project will be incorporated into undergraduate and graduate software engineering, introduction to data management, and compiler classes to be offered at the University of Chicago and the University of Washington. The outreach activities of this project will include engaging and advising students through special programs geared toward under-represented groups such as the Distributed Research Experiences for Undergraduates (DREU) organized by CRA-W (Computing Research Association -- Women) and Diversity Workshops organized by CRA-W.

Specifically, the proposed research consists of three thrusts: (1) a new cross-layer program analysis framework that produces an end-to-end profile of data-driven applications by understanding the application code, the queries that the application sends to the DBMS, and how the DBMS processes such queries; (2) a program analysis and testing framework that identify performance problems in data-driven applications by leveraging the end-to-end profile created from (1); and (3) new means to optimize data-driven applications by transforming both the application code and the queries that are issued. These three thrusts will work together to improve the performance of data-driven applications and help programmers detect performance problems during development. Software developed by this project, benchmarks used for evaluation, and performance comparison with existing techniques will be released to public domain through the project website. Further information will be available at the project website (https://people.eecs.berkeley.edu/~akcheung/coopt.html).

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cong Yan, Alvin Cheung, Junwen Yang, Shan Lu "View-Driven Optimization of Database-Backed Web Applications" Conference on Innovative Data Systems Research , 2020

Gabriel Matute, Alvin Cheung, Sarah Chasins "Change in Software Ecosystems: Social Challenges of Automating Upgrades" PLATEAU workshop , 2021

Junwen Yang, Utsav Sethi "Managing data constraints in database-backed web applications" ICSE , 2020 10.1145/3377811.3380375 Citation Details

Junwen Yang, Utsav Sethi, Cong Yan, Shan Lu, Alvin Cheung "Managing Data Constraints in Database-Backed Web Applications" International Conference on Software Engineering , 2020 10.1145/3377812.3390798

Yan, Cong and Cheung, Alvin "Generating application-specific data layouts for in-memory databases" Proceedings of the VLDB Endowment , v.12 , 2019 10.14778/3342263.3342630 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project investigated techniques to identify and improve the performance of data-driven applications. Such applications are prevalent in our daily lives --- essentially all web pages are data-driven applications where data is stored persistently in databases that are manipulated and retrieved as the webpage loads.

During this project, we studied three specific aspects of this problem. First, we performed a comprehensive study of 12 representative real-world data-driven applications that are built on top of object-relational mapping (ORM) frameworks. We generalize 9 ORM performance anti-patterns from more than 200 performance issues that we obtain by studying their bug-tracking systems and profiling their latest versions. To prove our point, we manually fix 64 performance issues in their latest versions and obtain a median speedup of 2x (and up to 39x max) with fewer than 5 lines of code change in most cases. Many of the issues we found have been confirmed by developers, and we have implemented ways to identify other code fragments with similar issues as well.

Next, we recognize that many modern database-backed web applications are built upon Object Relational Mapping (ORM) frameworks. While such frameworks ease application development by abstracting persistent data as objects, such convenience comes with a performance cost. In addition to the study above, we also performed studied another 27 real-world open-source applications built on top of the popular Ruby on Rails ORM framework, with the goal to understand the database-related performance inefficiencies in these applications. We discovered a number of inefficiencies ranging from physical design issues to how queries are expressed in the application code. We applied static program analysis to identify and measure how prevalent these issues are, then suggested techniques to alleviate these issues and measured the potential performance gain as a result.

Web developers face the stringent task of designing informative web pages while keeping the page-load time low. This task has become increasingly challenging as most web contents are now generated by processing ever-growing amount of user data stored in back-end databases. It is difficult for developers to understand the cost of generating every web-page element, not to mention explore and pick the web design with the best trade-off between performance and functionality. In response, we built Panorama, a view-centric and database-aware development environment for web developers. Using database-aware program analysis and novel IDE design, Panorama provides developers with intuitive information about the cost and the performance-enhancing opportunities behind every HTML element, as well as suggesting various global code refactorings that enable developers to easily explore a wide spectrum of performance and functionality trade-offs.

Our code and datasets created from this project have been released on open source: https://hyperloop-rails.github.io. Moreoever, concepts developed from this project have been incorporated into courses that are taught by the PIs, at both undergraduate and graduate levels. The results have also been published and presented at top-tier venues in software engineering and data management research communities.

Last Modified: 02/06/2022
Modified by: Alvin Cheung

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error