
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 9, 2011 |
Latest Amendment Date: | July 23, 2013 |
Award Number: | 1064505 |
Award Instrument: | Continuing Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 1, 2011 |
End Date: | July 31, 2015 (Estimated) |
Total Intended Award Amount: | $343,024.00 |
Total Awarded Amount to Date: | $343,024.00 |
Funds Obligated to Date: |
FY 2013 = $96,767.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
4333 BROOKLYN AVE NE SEATTLE WA US 98195-1016 (206)543-4043 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
4333 BROOKLYN AVE NE SEATTLE WA US 98195-1016 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
01001314DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
With tremendous amounts of data existing in scientific applications, database management becomes a critical issue, but database technology is not keeping pace. This problem is especially acute in the long tail of science: the large number of relatively small labs and individual researchers who collectively produce the majority of scientific results. These researchers lack the IT staff and specialized skills to deploy technology at scale, but have begun to routinely access hundreds of files and potentially terabytes of data to answer a scientific question. This project develops the architecture for a database-as-a-service platform for science. It explores techniques to automate the remaining barriers to use: ingesting data from native sources and automatically bootstrapping an initial set of queries and visualizations, in part by aggressively mining a shared corpus of data, queries, and user activity. It investigates methods to extract global knowledge and patterns while offering scientists access control over their data, and some formal privacy guarantees. The Intellectual Merit of this proposal consists of automating non-trivial cognitive tasks associated with data work: information extraction from unstructured data sources, data cleaning, logical schema design, privacy control, visualization, and application-building. As Broader Impacts, the project helps scientists reduce the proportion of time spent "handling data" rather than "doing science." All software resulting from this project are open source, and all findings are disseminated broadly through publications and workshops. Sustainable support for science users of the software is coordinated through the University of Washington eScience Institute. The research is incorporated in both undergraduate and graduate computer science courses, and the software is also incorporated into domain science courses as well. The project's outreach activities include advising students through special programs geared toward under-represented groups such as the CRA-W DREU. More information about this project is found at http://escience.washington.edu/dbaas.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This project, jointly undertaken by University of Washington, University of Michigan and Portland State University, explored new systems and algorithms aimed at delivering advanced database capabilities to the long tail of science -- the smaller labs and individual researchers who have increasingly extreme data requirements but limited IT staff.
We designed and deployed SQLShare, an open source, database-as-a-service platform that reduced the up-front costs of using database technology (installation, configuration, schema design, data ingest) to better fit a research environment where both questions and data sources change rapidly, and to lower barriers to adoption of data management technology. The query and data workload collected over the multi-year lifetime of the project has exposed that scientists tend to analyze data in bursts -- a few complex queries over messy, short-lived datasets –- motivating new kinds of database systems, languages, and algorithms. The workload has been released publicly for study by other researchers and represents a primary deliverable of this project.
To complement SQLShare, we developed the VizDeck automatic visualization system.SQLShare enables working with hundreds or thousands of distinct datasets, shifting the bottleneck of insight to visual analysis. VizDeck automatically recommends appropriate visualizations based on the statistical and structural properties of the data, allowing researchers to spend less time crafting visualizations and more time answering questions. In our analysis, we found that the automatic visualization techniques in VizDeck significantly outperformed some commercial systems for visual data-exploration tasks. The VizDeck approach helped usher in a number of important projects in automatic visualization in both the database and visualization communities.
We developed the ReConnect and ReDiscover technologies to attack a different stage in the data pipeline: preparing data for publication and sharing in online databases, such as SQLShare. We observed that scientists work with multiple variants and versions of datasets, often represented as spreadsheets, and that determining which datasets in a collection are the most appropriate to upload, combine, analyze and share is an impediment to sharing and the application of database technology. We posited that helping a researcher retroactively determine the relationships between datasets could facilitate the process. The ReConnect technology helps scientists determine the pairwise connections between datasets such as containment and complementarity. However, such pairwise analysis still leave a large burden on the researcher working with even a modest-sized collection of datasets, motivating the ReDiscover system, which applies data profiling, schema matching and machine learning techniques to automate the process of predicting relationships between datasets.
We developed the Senbazuru Spreadsheet Database Management System to focus on supplying database-style search-and-query capabilities over collections of heterogeneous spreadsheets. The system allows automatic information extraction from a class of spreadsheets with complex implicit relational structures, extraction of brittle metadata by supporting efficient manual repair, and a suite of algorithms and UI tools on mobile and web platforms to support keyword search, structured query, and rich integration over the extracted information.
Last Modified: 12/28/2015
Modified by: Bill Howe
Please report errors in award information by writing to: awardsearch@nsf.gov.