Award Abstract # 1247750
BIGDATA: Mid-Scale: DCM: Collaborative Research: Eliminating the Data Ingestion Bottleneck in Big-Data Applications

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: RUTGERS, THE STATE UNIVERSITY
Initial Amendment Date: September 20, 2012
Latest Amendment Date: July 31, 2015
Award Number: 1247750
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: February 1, 2013
End Date: January 31, 2017 (Estimated)
Total Intended Award Amount: $400,000.00
Total Awarded Amount to Date: $406,000.00
Funds Obligated to Date: FY 2012 = $400,000.00
FY 2015 = $6,000.00
History of Investigator:
  • Martin Farach-Colton (Principal Investigator)
    mlf9579@nyu.edu
Recipient Sponsored Research Office: Rutgers University New Brunswick
3 RUTGERS PLZ
NEW BRUNSWICK
NJ  US  08901-8559
(848)932-0150
Sponsor Congressional District: 12
Primary Place of Performance: Rutgers University New Brunswick
NJ  US  08901-8559
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): M1LVPE5GLSD9
Parent UEI:
NSF Program(s): Algorithmic Foundations,
Big Data Science &Engineering
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 7924, 7926, 8083, 9251
Program Element Code(s): 779600, 808300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Big-data practice suggests that there is a tradeoff between the speed of data ingestion, the ability to answer queries quickly (e.g., via indexing), and the freshness of data. This perceived tradeoff lies, for example, at the heart of the historic division between OLTP (online transaction processing) and OLAP (online analytical processing). In an OLTP database, data gets ingested quickly and the data available for querying is fresh, but analytical queries run prohibitively slowly. In an OLAP data warehouse, data is buffered for off-line indexing so that analytical queries run quickly, but by the time the data gets indexed, it is stale.

This tradeoff has manifestations in the design of all types of storage systems. For example, some file-systems are optimized for reads and others for writes, but workloads generally involve a mixture of reads and writes.

In this project the PIs show that this is not a fundamental tradeoff, but rather a tradeoff imposed by the choice of data structure. The PIs use write-optimized structures, an alternative to traditional indexing methodologies, to build storage systems in which this tradeoff is significantly mitigated or alleviated altogether. The performance promise of such indexing schemes follows from the PIs previous work establishing that write-optimized data structures can speed up both inserts and queries.

This project addresses the remaining obstacles in the deployment of write-optimized indexes within big-data file-systems and databases. Big data imposes a new set of constraints on any storage system, and the PIs will show how write-optimized indexing can yield order-of-magnitude performance improvements at scale. In particular, this project will show that such techniques are not only applicable today but that they will scale with hardware trends, including the widespread adoption of solid-state disks (SSDs).

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 20)
Alberto Apostolico and Maxime Crochemore and Martin Farach{-}Colton and Zvi Galil and S. Muthukrishnan "40 years of suffix trees" Commun. {ACM} , v.59 , 2016 , p.66--73
Bender, Michael A and Bose, Ritwik and Chowdhury, Rezaul and McCauley, Samuel "The kissing problem: how to end a gathering when everyone kisses everyone else goodbye" Theory of Computing Systems Special Issue on Fun With Algorithms , 2013 , p.1-16
Bender, Michael A. and Farach-Colton, Mart\'\i{}n and Fekete, S\'a{}ndor P. and Fineman, Jeremy T. and Gilbert, Seth "Reallocation Problems in Scheduling" Algorithmica , v.73 , 2015 , p.389--409 10.1007/s00453-014-9930-4
Martin Farach-Colton and Antonio Fern{\'a}ndez-Anta and Miguel A. Mosteiro "Optimal memory-aware Sensor Network Gossiping (or how to 1break the Broadcast lower bound)" Theoretical Computer Science , v.472 , 2013 , p.60-80
Martin Farach-Colton and Antonio Fern{\'a}ndez-Anta and Miguel A. Mosteiro "Optimal memory-aware Sensor Network Gossiping (or how to 1break the Broadcast lower bound)" Theoretical Computer Science , v.472 , 2013 , p.60-80
Martin Farach{-}Colton and Meng{-}Tsung Tsai "Exact Sublinear Binomial Sampling" Algorithmica , v.73 , 2015 , p.637--651
Martin Farach-Colton and Miguel A. Mosteiro "Initialiazing Sensor Networks of Non-uniform Density in The Weak Sensor Model" Algorithmica , 2014
Martin Farach{-}Colton and Miguel A. Mosteiro "Initializing Sensor Networks of Non-uniform Density in the Weak Sensor Model" Algorithmica , v.73 , 2015 , p.87--114
Martin Farach-Colton, Antonio Fernández Anta, and Miguel A. Mosteiro "Optimal Memory-Aware Sensor Network Gossiping (or How to Break the Broadcast Lower Bound)" Theoretical Computer Science , v.472 , 2013
Meng, Jie and McCauley, Samuel and Kaplan, Fulya and Leung, Vitus and Coskun, Ayse K "Simulation and Optimization of HPC Job Allocation for Reducing Communication and Cooling Costs" Sustainable Computing (SUSCOM) Special Issue for the International Green Computing Conference , v.5 , 2015
Meng, Jie and McCauley, Samuel and Kaplan, Fulya and Leung, Vitus and Coskun, Ayse K "Simulation and Optimization of HPC Job Allocation for Reducing Communication and Cooling Costs" Sustainable Computing (SUSCOM) Special Issue for the International Green Computing Conference , 2014
(Showing: 1 - 10 of 20)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

In big data applications, records often arrive faster than they can be organized by traditional methods.  Two recent technological developments show great promise in helping big data applications deal with high-bandwidth data streams: new algorithms and solid state disks.  However, many challenges remain in using these technologies to their full potential.

In this grant, we were able to show how to use these new technologies to achieve substantial performance gains in file systems and databases.  We were also able to prove that these technologies are not a panacea: some performance improvements are beyond the research of what we now know, and substantial further improvements in our algorihtmic understanding and in our hardware will be necessary to make further progress.

Specifically, we have built a file system called BetrFS.  It is faster for many basic operations than traditional file systems.  Recently, we showed that all but one major categories of file systems age, which means that their performance drops as they are used.  The one class that does not age is the so-called write-optimized file systems, and in particular BetrFS.

Bloom filters are a space-economial way to filter out search keys that do not appear in a set.  They have been the subject on intensive research.  We have shown that it is possible to make Bloom-filter alternatives that are both smaller (by 40%) and that interact mroe efficiently with the memory subsystem of the CPU.

Querying a database is slower than updating it in most cases.  We have shown how to piggy back queries onto insertions, in cases where the deadline for answering the query is not immediate.  We call our approach lazy analytics.

Finally, we have shown lower bounds in maintaining so-called compoud dictionaries.  Consider the case of a (physical) library, where books may be organized by author in one index, by title in another, by call number in a third, etc.  We have shown that fast deletions are not achievable in a system that simultaneously supports optimal queries.

In short, we have achieved a wide range of improvements in the performance of basic infrastructure for large data, and we have shown that some performance in not achievable.


Last Modified: 03/03/2017
Modified by: Martin Farach-Colton

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page