Award Abstract # 1064505
III: Medium: Collaborative Research: Database-As-A-Service for Long Tail Science

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF WASHINGTON
Initial Amendment Date: August 9, 2011
Latest Amendment Date: July 23, 2013
Award Number: 1064505
Award Instrument: Continuing Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 1, 2011
End Date: July 31, 2015 (Estimated)
Total Intended Award Amount: $343,024.00
Total Awarded Amount to Date: $343,024.00
Funds Obligated to Date: FY 2011 = $246,257.00
FY 2013 = $96,767.00
History of Investigator:
  • Bill Howe (Principal Investigator)
    billhowe@uw.edu
  • Dan Suciu (Co-Principal Investigator)
Recipient Sponsored Research Office: University of Washington
4333 BROOKLYN AVE NE
SEATTLE
WA  US  98195-1016
(206)543-4043
Sponsor Congressional District: 07
Primary Place of Performance: University of Washington
4333 BROOKLYN AVE NE
SEATTLE
WA  US  98195-1016
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): HD1WMN6945W6
Parent UEI:
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001112DB NSF RESEARCH & RELATED ACTIVIT
01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7924
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

With tremendous amounts of data existing in scientific applications, database management becomes a critical issue, but database technology is not keeping pace. This problem is especially acute in the long tail of science: the large number of relatively small labs and individual researchers who collectively produce the majority of scientific results. These researchers lack the IT staff and specialized skills to deploy technology at scale, but have begun to routinely access hundreds of files and potentially terabytes of data to answer a scientific question. This project develops the architecture for a database-as-a-service platform for science. It explores techniques to automate the remaining barriers to use: ingesting data from native sources and automatically bootstrapping an initial set of queries and visualizations, in part by aggressively mining a shared corpus of data, queries, and user activity. It investigates methods to extract global knowledge and patterns while offering scientists access control over their data, and some formal privacy guarantees. The Intellectual Merit of this proposal consists of automating non-trivial cognitive tasks associated with data work: information extraction from unstructured data sources, data cleaning, logical schema design, privacy control, visualization, and application-building. As Broader Impacts, the project helps scientists reduce the proportion of time spent "handling data" rather than "doing science." All software resulting from this project are open source, and all findings are disseminated broadly through publications and workshops. Sustainable support for science users of the software is coordinated through the University of Washington eScience Institute. The research is incorporated in both undergraduate and graduate computer science courses, and the software is also incorporated into domain science courses as well. The project's outreach activities include advising students through special programs geared toward under-represented groups such as the CRA-W DREU. More information about this project is found at http://escience.washington.edu/dbaas.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Abdussalam Alawini, David Maier, Kristin Tufte, Bill Howe, Rashmi Nandikur "Towards automated prediction of relationships among scientific datasets." Conference on Scientific and Statistical Database Management 2015 , 2015
Bill Howe, Daniel Halperin "Advancing Declarative Query in the Long Tail of Science" Data Engineering Bulletin , v.35 , 2012 , p.16-26
Bill Howe, Daniel Halperin, Francois Ribalet, Sagar Chitnis, and E. Virginia Armbrust "Collaborative Science Workflows in SQL" Computing in Science and Engineering , v.15 , 2013 , p.22-31
Christopher Re, Dan Suciu "Understanding cardinality estimation using entropy maximization" ACM Trans. Database Syst. (TODS) , v.37(1):6 , 2012 10.1145/1807085.1807095
K. Wongsuphasawat, D. Moritz, A. Anand, J. Mackinlay, B. Howe, J. Heer "Voyager: Exploratory analysis via faceted browsing of visualization recommendations" Visualization and Computer Graphics, IEEE Transactions , v.22 , 2015 , p.649

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project, jointly undertaken by University of Washington, University of Michigan and Portland State University, explored new systems and algorithms aimed at delivering advanced database capabilities to the long tail of science -- the smaller labs and individual researchers who have increasingly extreme data requirements but limited IT staff.

We designed and deployed SQLShare, an open source, database-as-a-service platform that reduced the up-front costs of using database technology (installation, configuration, schema design, data ingest) to better fit a research environment where both questions and data sources change rapidly, and to lower barriers to adoption of data management technology.  The query and data workload collected over the multi-year lifetime of the project has exposed that scientists tend to analyze data in bursts -- a few complex queries over messy, short-lived datasets –- motivating new kinds of database systems, languages, and algorithms.  The workload has been released publicly for study by other researchers and represents a primary deliverable of this project.

To complement SQLShare, we developed the VizDeck automatic visualization system.SQLShare enables working with hundreds or thousands of distinct datasets, shifting the bottleneck of insight to visual analysis. VizDeck automatically recommends appropriate visualizations based on the statistical and structural properties of the data, allowing researchers to spend less time crafting visualizations and more time answering questions.  In our analysis, we found that the automatic visualization techniques in VizDeck significantly outperformed some commercial systems for visual data-exploration tasks. The VizDeck approach helped usher in a number of important projects in automatic visualization in both the database and visualization communities.

We developed the ReConnect and ReDiscover technologies to attack a different stage in the data pipeline: preparing data for publication and sharing in online databases, such as SQLShare.  We observed that scientists work with multiple variants and versions of datasets, often represented as spreadsheets, and that determining which datasets in a collection are the most appropriate to upload, combine, analyze and share is an impediment to sharing and the application of database technology. We posited that helping a researcher retroactively determine the relationships between datasets could facilitate the process. The ReConnect technology helps scientists determine the pairwise connections between datasets such as containment and complementarity. However, such pairwise analysis still leave a large burden on the researcher working with even a modest-sized collection of datasets, motivating the ReDiscover system, which applies data profiling, schema matching and machine learning techniques to automate the process of predicting relationships between datasets.

We developed the Senbazuru Spreadsheet Database Management System to focus on supplying database-style search-and-query capabilities over collections of heterogeneous spreadsheets.  The system allows automatic information extraction from a class of spreadsheets with complex implicit relational structures, extraction of brittle metadata by supporting efficient manual repair, and a suite of algorithms and UI tools on mobile and web platforms to support keyword search, structured query, and rich integration over the extracted information.


Last Modified: 12/28/2015
Modified by: Bill Howe

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page