Award Abstract # 1447720
BIGDATA: F: DKM: Collaborative Research: Making Big Data Active: From Petabytes to Megafolks in Milliseconds

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF CALIFORNIA IRVINE
Initial Amendment Date: August 26, 2014
Latest Amendment Date: August 26, 2014
Award Number: 1447720
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2014
End Date: June 30, 2019 (Estimated)
Total Intended Award Amount: $784,380.00
Total Awarded Amount to Date: $784,380.00
Funds Obligated to Date: FY 2014 = $784,380.00
History of Investigator:
  • Michael Carey (Principal Investigator)
    mjcarey@ics.uci.edu
  • Nalini Venkatasubramanian (Co-Principal Investigator)
Recipient Sponsored Research Office: University of California-Irvine
160 ALDRICH HALL
IRVINE
CA  US  92697-0001
(949)824-7295
Sponsor Congressional District: 47
Primary Place of Performance: University of California-Irvine
4199 Campus Dr Ste 300 (CompSci)
Irvine
CA  US  92617-3067
Primary Place of Performance
Congressional District:
47
Unique Entity Identifier (UEI): MJC5FCYQTPE6
Parent UEI: MJC5FCYQTPE6
NSF Program(s): Information Technology Researc
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 8083, 7433, 1640
Program Element Code(s): 164000
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

A wealth of digital information is being generated daily through social networks, blogs, online communities, news sources, and mobile applications in an increasingly sensed world. Organizations and researchers recognize that tremendous value and insight can be gained by capturing this emerging data and making it available for querying and analysis. First-generation Big Data management efforts have been passive in nature -- queries, updates, and/or analysis tasks were mainly scaled to handle very large volumes of data. In contrast, this project will develop new techniques for continuously and reliably capturing Big Data collections (arising from social, mobile, Web, and sensed data sources) and will enable timely delivery of the right information to the relevant end users. In short, this project will provide a scalable foundation for moving from Big Passive Data to Big Active Data. Techniques should be developed to enable the accumulation and monitoring of petabytes of data of potential interest to millions of end users; when "interesting" new data appears, it should be delivered to end users in a time frame measured in (100's of) milliseconds. This project will build such an Active Big Data Management system and make it available as open source to the community. Students will be trained in technologies related to Big Active Data management and applications; such training is critical to addressing the information explosion that social media and the mobile Web are driving today. The general-purpose foundation for active information dissemination from Big Data will have broader impacts in areas such as public safety and public health.

There are many challenges involved in building a foundation for Big Active Data. On the "data in" side, these include resource management in very large scale, LSM-based storage systems and the provision of a highly available, elastic facility for fast data ingestion. On the "data processing" side, challenges include the parallel evaluation of a large number of declarative data subscriptions over multiple) highly partitioned data sets. Amplifying this challenge is a need to efficiently support spatial, temporal, and similarity predicates in data subscriptions. Big Data also makes result ranking and diversification techniques critical in order for large result sets to be manageable. On the "data out" side, challenges include the reliable and timely dissemination of data of interest to a sometimes-connected subscriber base of unprecedented scale. As a software base, this project will be jump-started by using AsterixDB(http://asterixdb.ics.uci.edu/), an open-source Big Data Management System that supports the scalable storage, searching, and analysis of mass quantities of semi-structured data.

For further information see the project web sites at https://www.ics.uci.edu/BigActiveData and http://www.cs.ucr.edu/~tsotras/BigActiveData

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 30)
C. Luo and M. Carey "Efficient Data Ingestion and Query Processing for LSM-Based Storage Systems" PVLDB , v.12 , 2019
C. Pavlopoulou, E. Carman, T. Westmann, M. Carey, and V. Tsotras "A Parallel and Scalable Processor for JSON Data" Proc. Int'l. Conf. on Extending Database Technology (EDBT), Vienna, Austria , 2018
C. Pavlopoulou, E. Carman, T. Westmann, M. Carey, and V. Tsotras "A Parallel and Scalable Processor for JSON Data" Proc. Int'l. Conf. on Extending Database Technology (EDBT), Vienna, Austria , 2018
H. Nguyen, M. Sarwar Uddin, and N. Venkatasubramanian: "Multistage Adaptive Load Balancing for Big Active Data Publish Subscribe Systems" Proc. 13th ACM Int'l. Conf. on Distributed and Event-based Systems (DEBS), Darmstadt, Germany, June 24-28, 2019 , 2019
I. Absalyamov, M. Carey, and V. Tsotras "Lightweight Cardinality Estimation in LSM-Based Systems" Proc. ACM SIGMOD Int'l. Conf. on Management of Data, Houston, TX , 2018
I. Absalyamov, M. Carey, and V. Tsotras "Lightweight Cardinality Estimation in LSM-Based Systems" Proc. ACM SIGMOD Int'l. Conf. on Management of Data, Houston, TX , 2018
J. Jia, C. Li, and M. Carey "Drum: A Rhythmic Approach to Interactive Analytics on Large Data" Proc. IEEE Int'l. Conf. on Big Data, Boston, MA , 2017
J. Jia, C. Li, and M. Carey "Drum: A Rhythmic Approach to Interactive Analytics on Large Data" Proc. IEEE Int'l. Conf. on Big Data, Boston, MA , 2017
M. Carey, S. Jacobs, and V. Tsotras "Breaking BAD: A Data Serving Vision for Big Active Data" Proc.10th ACM Int?l.l Conf. on Distributed and Event-Based Systems (DEBS '16), Irvine, CA , 2016
M. Carey, S. Jacobs, V. Tsotras "Breaking BAD: A Data Serving Vision for Big Active Data" Proc. 10th ACM Int?l. Conf. on Distributed and Event-Based Systems , 2016
Md Yusuf Sarwar Uddin and N. Venkatasubramanian. "Edge Caching for Enriched Notifications Delivery in Big Active Data." In 38th IEEE International Conference on Distributed Computing Systems (ICDCS), July 2 ? 5, 2018, Vienna, Austria, 2018. , 2018
(Showing: 1 - 10 of 30)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

First-generation Big Data management efforts brought frameworks such as Hadoop and Spark that focus on after-the-fact data analytics, document stores that provide scalable key-based record storage and retrieval, and a handful of specialized systems for problems like graph analytics and data stream analysis. The BAD project has aimed to continuously and reliably capture Big Data arising from social, mobile, Web, and sensed data sources and enable timely delivery of information to users with indicated interests. Our aim was to develop techniques to enable the accumulation and monitoring of petabytes of data of potential interest to millions of end users; when "interesting" new data appears, it should be delivered to end users in a timeframe measured in (100's of) milliseconds. The effort involved challenges related to parallel databases, Big Data platforms, stream data management, and publish/subscribe systems. It required scaling out solutions to individual problems as well as creating a coherent overall software architecture.

A priori requirements driving the project were:

  • Incoming data items might not be important in isolation, but in their relationship(s) to other items in the data. Subscriptions must therefore consider data in context, not just the newly arriving items' content.
  • Important information may be missing in the incoming items, instead existing elsewhere within the data as a whole. Results delivered to users must be enriched with other data to provide actionable information to each user.
  • Historical queries and analyses over collected data often yields important insights. Retrospective Big Data analytics must therefore be supported as well.

Based on those requirements, we designed, built, and evaluated components of a BAD Platform prototype - based on extending Apache AsterixDB - including:

  • A user model and language based on parameterized, query-based channels that actively deliver data of interest to interested channel subscribers.
  • Optimizations to AsterixDB to support rapid continuous data ingestion, and development of an "Active Toolkit" to extend AsterixDB with additional capabilities to support BAD, including richer data feeds and a variety of optimizations to support a scalable subscriber base.
  • A distributed Broker Network that coordinates and manages a large volume of end-user data subscriptions and results, including caching and load balancing mechanisms to support a highly scalable user base.

To summarize the project's technology contributions, we have championed a novel Big Data paradigm, Big Active Data, that merges Big Data Management with active data handling capabilities. We built a BAD system prototype, starting from a modern Big Data Platform (Apache AsterixDB), and have demonstrated that it can outperform passive Big Data by an order (or two) of magnitude in practical scenarios. The BAD system is able to consider data in context and to enrich results in ways unavailable in other platforms, and in addition allows for retrospective Big Data analytics. The code (over 20,000 LOC) is available as an open-source Apache project.

Big Data and information dissemination technologies like BAD are crucial for the next generation of computer science students. Working on this project, and with the artifacts that it has produced, has enabled such training at UCI, UCR, and elsewhere.  Multiple undergraduate and graduate students have gone on to jobs in industry (at Google, Amazon, Facebook, LinkedIn, and others) and the BAD postdoctoral researcher is now an Assistant Professor of CS at a US research university. In terms of society, the technology developed in this project has potential societal benefits for usage in domains such as public health, national security, and public safety.

 


Last Modified: 09/22/2019
Modified by: Michael Carey

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page