NSF Award Search: Award # 1533802

Award Abstract # 1533802

XPS: FULL: FP: Collaborative Research: Synchrony-aware Primitives for Building Highly Auditable, Highly Scalable, Highly Available Distributed Systems

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	MICHIGAN STATE UNIVERSITY
Initial Amendment Date:	August 14, 2015
Latest Amendment Date:	August 14, 2015
Award Number:	1533802
Award Instrument:	Standard Grant
Program Manager:	Marilyn McClure mmcclure@nsf.gov (703)292-5197 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2015
End Date:	August 31, 2021 (Estimated)
Total Intended Award Amount:	$350,000.00
Total Awarded Amount to Date:	$350,000.00
Funds Obligated to Date:	FY 2015 = $350,000.00
History of Investigator:	Sandeep Kulkarni (Principal Investigator) sandeep@cse.msu.edu
Recipient Sponsored Research Office:	Michigan State University 426 AUDITORIUM RD RM 2 EAST LANSING MI US 48824-2600 (517)355-5040
Sponsor Congressional District:	07
Primary Place of Performance:	Michigan State University 428 S. Shaw Lane, Rm 3115 East Lansing MI US 48824-1226
Primary Place of Performance Congressional District:	07
Unique Entity Identifier (UEI):	R28EKN92ZTZ9
Parent UEI:	VJKZC4D1JN36
NSF Program(s):	Exploiting Parallel&Scalabilty
Primary Program Source:	01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):
Program Element Code(s):	828300
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Auditability is a key property for developing highly scalable and highly available distributed systems, as it enables identifying performance bottlenecks, dependencies among events, and latent concurrency bugs. In turn, for the auditability of a system, time is a key concept. However, there is a gap between the theory and the practice of distributed systems in terms of the use of time. The theory of distributed systems shuns the notion of time and considers asynchronous systems whose event ordering is captured by logical clocks. The practical distributed systems employ NTP synchronized clocks to capture time but tend to use ad hoc methods. This project bridges this gap and provides synchrony-aware system primitives that support building highly auditable, highly scalable, and highly available distributed systems. The project has applications to cloud computing, distributed NewSQL databases, and globally distributed web services. The project enables other broader impacts through enhancing scientific/technological understanding via organizing academic workshops, outreaching to K-12 students, recruitment of minority groups, and distributing tools and software to the community.

To enable highly auditable systems, the project investigates lightweight and efficient designs for an augmented time (AT) primitive. AT combines the theoretical underpinnings of causality and the practicality of physical clocks by identifying how logical/vector clocks can be improved and tuned based on the availability of NTP synchronization. The principle guiding AT design will be "uncertainty resilience". These AT clocks enable highly auditable systems since they can efficiently provide global consistent-state snapshots without needing to wait out clock synchronization uncertainties and without requiring prior coordination. Leveraging on these auditability primitives, the project builds support for scalable and available systems. To enable highly scalable systems, the project investigates design of synchrony-aware coordination primitives, such as barrier synchronization, mutual exclusion, leader election, and causally and totally ordered communication support. The principle guiding the design of the synchrony-aware coordination primitives is "silent consent". These primitives improve performance and efficiency over their asynchronous system counterparts by trading timing information gathered from AT clocks and avoiding explicit communication needed for coordination. Finally, the project will leverage the auditability support provided by AT and will investigate the design of a monitor component that detects and corrects distributed system state corruptions. The principle guiding the design of the monitor component is "centralized oversight and override".

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 24)

Show All

Mohammad Roohitavaf, Jung-Sang Ahn, Woon-Hak Kang, Kun Ren, Gene Zhang, Sami Ben-Romdhane, Sandeep S. Kulkarni "Session guarantees with raft and hybrid logical clocks." ICDCN 2019 , 2019

Aleksey Charapko and Ailidani Ailijiang and Murat Demirbas and Sandeep S. Kulkarni "Retroscope: Retrospective Monitoring of Distributed Systems" {IEEE} Trans. Parallel Distrib. Syst. , v.30 , 2019 , p.2582--259 10.1109/TPDS.2019.2911944

Arya T. Gupta and Sandeep S. Kulkarni "Extending Lattice linearity for Self-Stabilizing Algorithms" CoRR , v.abs/210 , 2021

D Nguyen, A Charapko, S Kulkarni, M Demirbas "Optimistic Execution in KeyValueStore" LADC , 2018

Duong N. Nguyen and Aleksey Charapko and Sandeep S. Kulkarni and Murat Demirbas "Using weaker consistency models with monitoring and recovery for improving performance of key-value stores" J. Braz. Comput. Soc. , v.25 , 2019 , p.10:1--10: 10.1186/s13173-019-0091-9

Duong N. Nguyen and Sandeep S. Kulkarni "Technical Report: Benefits of Stabilization versus Rollback in Self-Stabilizing Graph-Based Applications on Eventually Consistent Key-Value Stores" CoRR , v.abs/200 , 2020

Duong N. Nguyen and Sorrachai Yingchareonthawornchai and Vidhya Tekken Valapil and Sandeep S. Kulkarni and Murat Demirbas "Precision, recall, and sensitivity of monitoring partially synchronous distributed programs" Distributed Comput. , v.34 , 2021 , p.319--348 10.1007/s00446-021-00402-w

Duong N. Nguyen, Sandeep S. Kulkarni, Ajoy K. Datta: "Benefit of self-stabilizing protocols in eventually consistent key-value stores: a case study." ICDCN 2019 , 2019

M. Demirbas, S. Kulkarni. "Highly auditable distributed systems" HotCloud , 2015

Mohammad Roohitavaf, Murat Demirbas, Sandeep Kulkarni "CausalSpartan: Causal Consistency for Distributed Data Stores using HybridLogical Clocks" The 36th IEEE International Symposium on Reliable Distributed Systems (SRDS) , 2017

Mohammad Roohitavaf, Sandeep S. Kulkarni: "Automatic Addition of Fault-Tolerance in Presence of Unchangeable Environment Actions" Future Internet , 2019

(Showing: 1 - 10 of 24)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Our project's major goal is to achieve efficient auditability of distributed systems. Auditability is important because it concerns with identifying performance bottlenecks, dependencies among events, and latent concurrency bugs in distributed systems. Leveraging the auditability primitives and services, our project will then build scalability and availability primitives and services for distributed systems.

We investigated the design space for the hybrid logical clocks. In contrast to previous literature on predicate detection with vector clocks which have space complexity O(n) and are prone to false positives in partially synchronous distributed systems, we developed efficient algorithms for detection of weak conjunctive predicates with the help of hybrid logical clocks.

We investigated the design space for the hybrid vector clocks. In our OPODIS 2015 paper, we showed that the hybrid vector clock size is a sigmoid function with respect to increasing epsilon; it has a slow start but it grows exponentially after a phase transition. We derived formulas to identify the phase transition point and showed that for many practical applications and deployment environments, the size of HVC remains only as a couple entries and substantially less than n, the number of processes. In our Runtime Verification 16 paper we analyzed the effects of the impedance mismatch between the monitor and the underlying program for the detection of distributed system predicates, and found that there is a small interval where the monitor assumptions are hypersensitive to the underlying program environment. We provided analytical derivations for this interval, and also provide experimental support for exploring the sensitivity of predicate detection to the impedance mismatch between the monitor and the program under a partially synchronous system.

We designed a scalable CATOCS leveraging hybrid clocks. To showcase this primitive, we designed and implemented an efficient WAN-scale causal replicated key-value store, called CausalSpartan. In this work, we focused on improving causal consistency in data store systems with the help of HLC. In this work, we rely on (partially) synchronized physical clocks. Specifically, using ideas such as those in GentleRain, we can maintain a stable time such that if stable time at a node is x then all updates upto time x have been received. In turn, if we do not make updates with time greater than x visible then we can ensure causal consistency. One problem with this approach, however, is that if a write needs to be performed at a node where the physical clock is lagging then it can introduce substantial delays. Moreover, this issue is made worse when a single query (such as update on a social media page) results in multiple reads and write on the data store. We demonstrated that these delays can be eliminated with HLC since HLC is a logical clock that allows itself to be updated (and moved ahead) to permit such writes immediately. With our approach, we showed that CausalSpartan can provide a significant benefit with no cost. Specifically, if clocks are perfectly synchronized, there is no overhead in CausalSpartan. And, CausalSpartan is immune to clock drift. Hence, although GentleRain causes PUT latency to become threefold with just 10ms clock drift, CausalSpartan remains unaffected. Moreover, when a single query results in multiple read/writes on data store, the query response time in CausalSpartan is substantially lower. For example, with a clock drift of 10ms and query amplification factor of 100, query response time of CausalSpartan was only 25% of that of GentleRain.

In that work, we implemented, demonstrated, and evaluated our hybrid logical clocks based snapshot service, called Retroscope, on VoldemortDB, HazelCast, and ZooKeeper. We architect this tool and built a horizontally scalable monitoring tool to handle datacenterlevel monitoring and querying. We also extended Retroscope with an SQL like frontend language to query the logs collected for distributed consistent snapshots that satisfy given the given predicates. This work is accepted in IEEE Transactions on Parallel and Distributed Systems.

We also worked on identifying effectiveness of cvf-tolerance (consistency violation faults tolerance) for various distributed programs. We found that many programs are highly tolerant to cvfs (consistency violation faults). This implies that executing them under weakened concurrency is significantly likely to benefit in their performance.

We also focused on developing monitors for distributed systems to identify concurrency bugs. We developed a two-layer monitor that improves the efficiency by over 90%. In this monitor, the first layer is efficient but inaccurate, whereas the second layer is accurate but relatively inefficient. By filtering provided by the first layer, we can reduce the cost of verification substantially.

The work has resulted in training of graduate students Duong Nguyen, Vidhya Tekken Valapil, Mohammad Roohitavaf, Gabe Appleton, Sorrachai Yingchareonthawornchai. It has resulted in several conference and journal publications listed above. Also, the research from this work was incorporated in the classroom at graduate and undergraduate students.

Last Modified: 11/17/2021
Modified by: Sandeep S Kulkarni

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error