Award Abstract # 1319984
CSR: Small: Yesterday's News: Theory of Staleness under Data Churn

NSF Org: CNS
Division Of Computer and Network Systems
Recipient: TEXAS A&M ENGINEERING EXPERIMENT STATION
Initial Amendment Date: August 26, 2013
Latest Amendment Date: August 26, 2013
Award Number: 1319984
Award Instrument: Standard Grant
Program Manager: Marilyn McClure
mmcclure@nsf.gov
 (703)292-5197
CNS
 Division Of Computer and Network Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2013
End Date: August 31, 2017 (Estimated)
Total Intended Award Amount: $473,420.00
Total Awarded Amount to Date: $473,420.00
Funds Obligated to Date: FY 2013 = $473,420.00
History of Investigator:
  • Dmitri Loguinov (Principal Investigator)
    dmitri@cs.tamu.edu
  • Daren Cline (Co-Principal Investigator)
Recipient Sponsored Research Office: Texas A&M Engineering Experiment Station
3124 TAMU
COLLEGE STATION
TX  US  77843-3124
(979)862-6777
Sponsor Congressional District: 10
Primary Place of Performance: Texas Engineering Experiment Station
MS 3112
College Station
TX  US  77843-3112
Primary Place of Performance
Congressional District:
10
Unique Entity Identifier (UEI): QD1MX6N5YTN4
Parent UEI: QD1MX6N5YTN4
NSF Program(s): CSR-Computer Systems Research
Primary Program Source: 01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923
Program Element Code(s): 735400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Many distributed applications in the current Internet are massively replicated to ensure unsurpassed data robustness and scalability; however, constant data churn (i.e., update of the source) and delayed synchronization lead to staleness and thus lower performance in these systems. The goal of this project is to pioneer a stochastic theory of data replication that can tackle non-trivial dependency issues in synchronization of general non-Poisson point processes, design more accurate sampling and prediction algorithms for measuring data churn, solve novel multi-source and multi-replica staleness-optimization problems, establish new fundamental understanding of cooperative and multi-hop replication, and model non-stationary update processes of real sources.

The now omnipresent cloud technology has become a vast consumer and generator of data that must be stored, replicated, and streamed to a variety of clients. This project focuses on understanding theoretical and experimental properties of data evolution and staleness in such systems, whose outcomes are likely to impact Internet computing through creation of insight that leads to better content-distribution mechanisms, more accurate search results, and ultimately higher satisfaction among everyday users. Furthermore, this project blends a variety of inter-disciplinary scientific areas, reaches out to the student population at Texas A&M to engage them in research activities from early stages of their careers, trains well-rounded PhD students knowledgeable in both theoretical and experimental aspects of large-scale networked systems, engages under-represented student groups in STEM fields, disseminates information through two new seminars at Texas A&M, and shares data models and experimental results with the public.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

X. Li, D.B.H. Cline, and D. Loguinov "On Sample-Path Staleness in Lazy Data Replication" IEEE/ACM Transactions on Networking , v.24 , 2016
X. Li, D.B.H. Cline, and D. Loguinov "Temporal Update Dynamics under Blind Sampling" IEEE/ACM Transactions on Networking , v.25 , 2017

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Many distributed applications in the current Internet (e.g., cloud computing, world-scale web search, online banking, content-distribution systems, social networks) are now massively replicated, which ensures unsurpassed data robustness and scalability. However, constant data churn (i.e., update of the source) and delayed synchronization lead to staleness, which refers to deviation of the information being served to clients and/or processed by the system from that at the source. This project studied replication under non-Poisson update processes, general cost functions, and multi-hop operation. Until now, these topics have remained virtually unexplored, which prevented accurate characterization of extreme-scale distributed systems, creation of alternative replication paradigms, and new theoretical insight into data churn. In this work, we addressed these issues by pioneering a stochastic theory of data replication that could tackle non-trivial dependency issues in synchronization of general non-Poisson point processes, designed more accurate sampling and prediction algorithms for measuring data churn, solved novel multi-source and multi-replica joint staleness-optimization problems, established new fundamental understanding of cooperative and multi-hop replication, and modeled non-stationary update processes of real sources. 

Accurate modeling of data churn has remained one of the most elusive topics in the literature, while being of paramount importance to improving the design and scalability of existing networks, databases, cloud applications, and large-scale distributed systems. The theory component of this work formalized data evolution using random process theory and achieved novel results in the understanding of staleness, our ability to sample network sources through random observation, and recovery of interval-censored data using non-parametric estimation algorithms. By studying the effect of various data-update distributions and refresh techniques on staleness, this work also produced insight into how to better design future replication solutions and achieve higher resilience to failure in existing systems. The experimental part of this work measured existing Internet sources (such as Wikipedia and Yelp) to verify the assumptions and performance of the proposed models and created novel data-churn characterization that achieved higher fidelity in practice than prior techniques. This, coupled with our unifying modeling framework, has increased the body of practical and theoretical knowledge about caching networks, their performance, optimality, and various avenues for achieving more scalable operation. 

This project blended a variety of cross-disciplinary scientific areas including random process theory, stochastic modeling, renewal theory, databases, content retrieval, networking, distributed systems, and experimental Internet data sampling and measurement. The educational component of this project reached out to the student population at Texas A&M and engaged them in research activities they would not be otherwise exposed to. Outcomes include attraction of students to cross-disciplinary research programs, training of well-rounded PhD students knowledgeable in both theoretical and experimental aspects of large-scale networked systems, engagement of under-represented student groups in STEM fields, and information dissemination through publications/presentations.


Last Modified: 11/30/2017
Modified by: Dmitri Loguinov

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page