
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | August 26, 2013 |
Latest Amendment Date: | August 26, 2013 |
Award Number: | 1319984 |
Award Instrument: | Standard Grant |
Program Manager: |
Marilyn McClure
mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2013 |
End Date: | August 31, 2017 (Estimated) |
Total Intended Award Amount: | $473,420.00 |
Total Awarded Amount to Date: | $473,420.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
3124 TAMU COLLEGE STATION TX US 77843-3124 (979)862-6777 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
MS 3112 College Station TX US 77843-3112 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | CSR-Computer Systems Research |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Many distributed applications in the current Internet are massively replicated to ensure unsurpassed data robustness and scalability; however, constant data churn (i.e., update of the source) and delayed synchronization lead to staleness and thus lower performance in these systems. The goal of this project is to pioneer a stochastic theory of data replication that can tackle non-trivial dependency issues in synchronization of general non-Poisson point processes, design more accurate sampling and prediction algorithms for measuring data churn, solve novel multi-source and multi-replica staleness-optimization problems, establish new fundamental understanding of cooperative and multi-hop replication, and model non-stationary update processes of real sources.
The now omnipresent cloud technology has become a vast consumer and generator of data that must be stored, replicated, and streamed to a variety of clients. This project focuses on understanding theoretical and experimental properties of data evolution and staleness in such systems, whose outcomes are likely to impact Internet computing through creation of insight that leads to better content-distribution mechanisms, more accurate search results, and ultimately higher satisfaction among everyday users. Furthermore, this project blends a variety of inter-disciplinary scientific areas, reaches out to the student population at Texas A&M to engage them in research activities from early stages of their careers, trains well-rounded PhD students knowledgeable in both theoretical and experimental aspects of large-scale networked systems, engages under-represented student groups in STEM fields, disseminates information through two new seminars at Texas A&M, and shares data models and experimental results with the public.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Many distributed applications in the current Internet (e.g., cloud computing, world-scale web search, online banking, content-distribution systems, social networks) are now massively replicated, which ensures unsurpassed data robustness and scalability. However, constant data churn (i.e., update of the source) and delayed synchronization lead to staleness, which refers to deviation of the information being served to clients and/or processed by the system from that at the source. This project studied replication under non-Poisson update processes, general cost functions, and multi-hop operation. Until now, these topics have remained virtually unexplored, which prevented accurate characterization of extreme-scale distributed systems, creation of alternative replication paradigms, and new theoretical insight into data churn. In this work, we addressed these issues by pioneering a stochastic theory of data replication that could tackle non-trivial dependency issues in synchronization of general non-Poisson point processes, designed more accurate sampling and prediction algorithms for measuring data churn, solved novel multi-source and multi-replica joint staleness-optimization problems, established new fundamental understanding of cooperative and multi-hop replication, and modeled non-stationary update processes of real sources.
Accurate modeling of data churn has remained one of the most elusive topics in the literature, while being of paramount importance to improving the design and scalability of existing networks, databases, cloud applications, and large-scale distributed systems. The theory component of this work formalized data evolution using random process theory and achieved novel results in the understanding of staleness, our ability to sample network sources through random observation, and recovery of interval-censored data using non-parametric estimation algorithms. By studying the effect of various data-update distributions and refresh techniques on staleness, this work also produced insight into how to better design future replication solutions and achieve higher resilience to failure in existing systems. The experimental part of this work measured existing Internet sources (such as Wikipedia and Yelp) to verify the assumptions and performance of the proposed models and created novel data-churn characterization that achieved higher fidelity in practice than prior techniques. This, coupled with our unifying modeling framework, has increased the body of practical and theoretical knowledge about caching networks, their performance, optimality, and various avenues for achieving more scalable operation.
This project blended a variety of cross-disciplinary scientific areas including random process theory, stochastic modeling, renewal theory, databases, content retrieval, networking, distributed systems, and experimental Internet data sampling and measurement. The educational component of this project reached out to the student population at Texas A&M and engaged them in research activities they would not be otherwise exposed to. Outcomes include attraction of students to cross-disciplinary research programs, training of well-rounded PhD students knowledgeable in both theoretical and experimental aspects of large-scale networked systems, engagement of under-represented student groups in STEM fields, and information dissemination through publications/presentations.
Last Modified: 11/30/2017
Modified by: Dmitri Loguinov
Please report errors in award information by writing to: awardsearch@nsf.gov.