
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | August 14, 2015 |
Latest Amendment Date: | August 14, 2015 |
Award Number: | 1533802 |
Award Instrument: | Standard Grant |
Program Manager: |
Marilyn McClure
mmcclure@nsf.gov (703)292-5197 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2015 |
End Date: | August 31, 2021 (Estimated) |
Total Intended Award Amount: | $350,000.00 |
Total Awarded Amount to Date: | $350,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
426 AUDITORIUM RD RM 2 EAST LANSING MI US 48824-2600 (517)355-5040 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
428 S. Shaw Lane, Rm 3115 East Lansing MI US 48824-1226 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Exploiting Parallel&Scalabilty |
Primary Program Source: |
|
Program Reference Code(s): | |
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Auditability is a key property for developing highly scalable and highly available distributed systems, as it enables identifying performance bottlenecks, dependencies among events, and latent concurrency bugs. In turn, for the auditability of a system, time is a key concept. However, there is a gap between the theory and the practice of distributed systems in terms of the use of time. The theory of distributed systems shuns the notion of time and considers asynchronous systems whose event ordering is captured by logical clocks. The practical distributed systems employ NTP synchronized clocks to capture time but tend to use ad hoc methods. This project bridges this gap and provides synchrony-aware system primitives that support building highly auditable, highly scalable, and highly available distributed systems. The project has applications to cloud computing, distributed NewSQL databases, and globally distributed web services. The project enables other broader impacts through enhancing scientific/technological understanding via organizing academic workshops, outreaching to K-12 students, recruitment of minority groups, and distributing tools and software to the community.
To enable highly auditable systems, the project investigates lightweight and efficient designs for an augmented time (AT) primitive. AT combines the theoretical underpinnings of causality and the practicality of physical clocks by identifying how logical/vector clocks can be improved and tuned based on the availability of NTP synchronization. The principle guiding AT design will be "uncertainty resilience". These AT clocks enable highly auditable systems since they can efficiently provide global consistent-state snapshots without needing to wait out clock synchronization uncertainties and without requiring prior coordination. Leveraging on these auditability primitives, the project builds support for scalable and available systems. To enable highly scalable systems, the project investigates design of synchrony-aware coordination primitives, such as barrier synchronization, mutual exclusion, leader election, and causally and totally ordered communication support. The principle guiding the design of the synchrony-aware coordination primitives is "silent consent". These primitives improve performance and efficiency over their asynchronous system counterparts by trading timing information gathered from AT clocks and avoiding explicit communication needed for coordination. Finally, the project will leverage the auditability support provided by AT and will investigate the design of a monitor component that detects and corrects distributed system state corruptions. The principle guiding the design of the monitor component is "centralized oversight and override".
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Our project's major goal is to achieve efficient auditability of distributed systems. Auditability is important because it concerns with identifying performance bottlenecks, dependencies among events, and latent concurrency bugs in distributed systems. Leveraging the auditability primitives and services, our project will then build scalability and availability primitives and services for distributed systems.
We investigated the design space for the hybrid logical clocks. In contrast to previous literature on predicate detection with vector clocks which have space complexity O(n) and are prone to false positives in partially synchronous distributed systems, we developed efficient algorithms for detection of weak conjunctive predicates with the help of hybrid logical clocks.
We investigated the design space for the hybrid vector clocks. In our OPODIS 2015 paper, we showed that the hybrid vector clock size is a sigmoid function with respect to increasing epsilon; it has a slow start but it grows exponentially after a phase transition. We derived formulas to identify the phase transition point and showed that for many practical applications and deployment environments, the size of HVC remains only as a couple entries and substantially less than n, the number of processes. In our Runtime Verification 16 paper we analyzed the effects of the impedance mismatch between the monitor and the underlying program for the detection of distributed system predicates, and found that there is a small interval where the monitor assumptions are hypersensitive to the underlying program environment. We provided analytical derivations for this interval, and also provide experimental support for exploring the sensitivity of predicate detection to the impedance mismatch between the monitor and the program under a partially synchronous system.
We designed a scalable CATOCS leveraging hybrid clocks. To showcase this primitive, we designed and implemented an efficient WAN-scale causal replicated key-value store, called CausalSpartan. In this work, we focused on improving causal consistency in data store systems with the help of HLC. In this work, we rely on (partially) synchronized physical clocks. Specifically, using ideas such as those in GentleRain, we can maintain a stable time such that if stable time at a node is x then all updates upto time x have been received. In turn, if we do not make updates with time greater than x visible then we can ensure causal consistency. One problem with this approach, however, is that if a write needs to be performed at a node where the physical clock is lagging then it can introduce substantial delays. Moreover, this issue is made worse when a single query (such as update on a social media page) results in multiple reads and write on the data store. We demonstrated that these delays can be eliminated with HLC since HLC is a logical clock that allows itself to be updated (and moved ahead) to permit such writes immediately. With our approach, we showed that CausalSpartan can provide a significant benefit with no cost. Specifically, if clocks are perfectly synchronized, there is no overhead in CausalSpartan. And, CausalSpartan is immune to clock drift. Hence, although GentleRain causes PUT latency to become threefold with just 10ms clock drift, CausalSpartan remains unaffected. Moreover, when a single query results in multiple read/writes on data store, the query response time in CausalSpartan is substantially lower. For example, with a clock drift of 10ms and query amplification factor of 100, query response time of CausalSpartan was only 25% of that of GentleRain.
In that work, we implemented, demonstrated, and evaluated our hybrid logical clocks based snapshot service, called Retroscope, on VoldemortDB, HazelCast, and ZooKeeper. We architect this tool and built a horizontally scalable monitoring tool to handle datacenterlevel monitoring and querying. We also extended Retroscope with an SQL like frontend language to query the logs collected for distributed consistent snapshots that satisfy given the given predicates. This work is accepted in IEEE Transactions on Parallel and Distributed Systems.
We also worked on identifying effectiveness of cvf-tolerance (consistency violation faults tolerance) for various distributed programs. We found that many programs are highly tolerant to cvfs (consistency violation faults). This implies that executing them under weakened concurrency is significantly likely to benefit in their performance.
We also focused on developing monitors for distributed systems to identify concurrency bugs. We developed a two-layer monitor that improves the efficiency by over 90%. In this monitor, the first layer is efficient but inaccurate, whereas the second layer is accurate but relatively inefficient. By filtering provided by the first layer, we can reduce the cost of verification substantially.
The work has resulted in training of graduate students Duong Nguyen, Vidhya Tekken Valapil, Mohammad Roohitavaf, Gabe Appleton, Sorrachai Yingchareonthawornchai. It has resulted in several conference and journal publications listed above. Also, the research from this work was incorporated in the classroom at graduate and undergraduate students.
Last Modified: 11/17/2021
Modified by: Sandeep S Kulkarni
Please report errors in award information by writing to: awardsearch@nsf.gov.