Award Abstract # 1724898
CIF21 DIBBs: PD: OneDataShare: A Universal Data Sharing Building Block for Data-Intensive Applications

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: THE RESEARCH FOUNDATION FOR THE STATE UNIVERSITY OF NEW YORK
Initial Amendment Date: June 9, 2017
Latest Amendment Date: September 15, 2022
Award Number: 1724898
Award Instrument: Standard Grant
Program Manager: Alejandro Suarez
alsuarez@nsf.gov
 (703)292-7092
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2017
End Date: August 31, 2022 (Estimated)
Total Intended Award Amount: $497,773.00
Total Awarded Amount to Date: $616,469.00
Funds Obligated to Date: FY 2017 = $584,469.00
FY 2018 = $32,000.00
History of Investigator:
  • Tevfik Kosar (Principal Investigator)
    tevfikkosar@gmail.com
  • Jaroslaw Zola (Former Principal Investigator)
  • Tevfik Kosar (Former Principal Investigator)
Recipient Sponsored Research Office: SUNY at Buffalo
520 LEE ENTRANCE STE 211
AMHERST
NY  US  14228-2577
(716)645-2634
Sponsor Congressional District: 26
Primary Place of Performance: SUNY at Buffalo
338 Davis Hall
Buffalo
NY  US  14260-2500
Primary Place of Performance
Congressional District:
26
Unique Entity Identifier (UEI): LMCJKRFW5R81
Parent UEI: GMZUKXFDJMA9
NSF Program(s): Data Cyberinfrastructure
Primary Program Source: 01001718DB NSF RESEARCH & RELATED ACTIVIT
01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 068P, 7433, 8048, 9251, 9290
Program Element Code(s): 772600
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Applications in scientific, industrial, and personal spaces now generate more data than ever before. As data become more abundant and data resources become more heterogeneous, the accessing, sharing and disseminating of data sets becomes a bigger challenge. Existing technologies for transferring and sharing data suffer from serious shortcomings, including low transfer performance, inflexibility, restricted protocol support, and poor scalability. This project develops a universal data sharing building block for data-intensive applications, dubbed OneDataShare, with three major goals: (1) optimization of end-to-end data transfers and reduction of the time to delivery of the data; (2) interoperation across heterogeneous and incompatible data resources; and (3) predicting the data delivery time and decreasing the uncertainty in real-time decision-making processes.

OneDataShare deliverables include: (1) design and implementation of novel algorithms for application-layer optimization of the data transfer protocol parameters to achieve optimal end-to-end data transfer throughput; (2) development of a universal interface specification for heterogeneous data storage endpoints and a framework for on-the-fly data transfer protocol translation; (3) instrumentation of end-to-end data transfer time prediction capability, and feeding it into real-time scheduling and decision-making processes for advanced provisioning, high-level planning, and co-scheduling of resources; (4) deployment of these capabilities as stand-alone OneDataShare cloud-hosted services to end users; and (5) integration of these capabilities with widely used data scheduling and workflow management tools, and validation in specific applications. OneDataShare services and tools are developed at the application level, and they do not require any changes to the existing infrastructure, nor to the low-level networking stack, although they increase the end-to-end performance of the data movement tasks substantially. These efficient and high-performance data transfer techniques will help the scientific community, industry, and end-users to save significant time and effort in transferring and sharing data.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 16)
Arslan, Engin and Kosar, Tevfik "High-Speed Transfer Optimization Based on Historical Analysis and Real-Time Tuning" IEEE Transactions on Parallel and Distributed Systems , v.29 , 2018 10.1109/TPDS.2018.2790948 Citation Details
Arslan, Engin and Pehlivan, Bahadir A. and Kosar, Tevfik "Big data transfer optimization through adaptive parameter tuning" Journal of Parallel and Distributed Computing , v.120 , 2018 10.1016/j.jpdc.2018.05.003 Citation Details
Asif Imran and Tevfik Kosar "Qualitative analysis of the relationship between design smells and software engineering challenges" Proceedings of the 2022 European Symposium on Software Engineering (ESSE) , 2022 Citation Details
Di Tacchio, Luigi and Nine, MD S and Kosar, Tevfik and Bulut, Muhammed Fatih and Hwang, Jinho "Cross-Layer Optimization of Big Data Transfer Throughput and Energy Consumption" 2019 IEEE 12th International Conference on Cloud Computing (CLOUD) , 2019 10.1109/CLOUD.2019.00017 Citation Details
Guner, Kemal and Nine, MD S and Bulut, M. Fatih and Kosar, Tevfik "FastHLA: Energy-Efficient Mobile Data Transfer Optimization Based on Historical Log Analysis" MobiWac'18 Proceedings of the 16th ACM International Symposium on Mobility Management and Wireless Access , 2018 10.1145/3265863.3265871 Citation Details
Imran, A and Kosar, T "Qualitative Analysis of the Relationship Between Design Smells and Software Engineering Challenges" Proceedings of ACM European Symposium on Software Engineering (ESSE 2022) , 2022 Citation Details
Imran, A and Kosar, T "URegM: A Unified Prediction Model of Resource Consumption for Refactoring Software Smells in Open Source Cloud" Proceedings of ACM European Symposium on Software Engineering (ESSE 2022) , 2022 Citation Details
Imran, Asif and Nine, Md S. and Guner, Kemal and Kosar, Tevfik "OneDataShare - A Vision for Cloud-hosted Data Transfer Scheduling and Optimization as a Service [OneDataShare - A Vision for Cloud-hosted Data Transfer Scheduling and Optimization as a Service]" Proceedings of the 8th International Conference on Cloud Computing and Services Science , v.1 , 2018 10.5220/0006793506160625 Citation Details
Jamil, Hasibul and Rodolph, Lavone and Goldverg, Jacob and Kosar, Tevfik "Energy-Efficient Data Transfer Optimization via Decision-Tree Based Uncertainty Reduction" 2022 International Conference on Computer Communications and Networks (ICCCN) , 2022 https://doi.org/10.1109/ICCCN54977.2022.9868866 Citation Details
Kosar, Tevfik and Alan, Ismail and Bulut, M. Fatih "Energy-aware data throughput optimization for next generation internet" Information Sciences , v.476 , 2019 10.1016/j.ins.2018.09.065 Citation Details
Nine, Md S. and Guner, Kemal and Huang, Ziyun and Wang, Xiangyu and Xu, Jinhui and Kosar, Tevfik "Big data transfer optimization based on offline knowledge discovery and adaptive sampling" 2017 IEEE International Conference on Big Data , 2017 10.1109/BigData.2017.8257959 Citation Details
(Showing: 1 - 10 of 16)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Intellectual Merit:

This project resulted in a proof-of-concept prototype implementation of a universal data access and sharing building block for data-intensive applications, called OneDataShare. It provides optimization of end-to-end data transfer performance with novel application-layer models and algorithms, interoperability between heterogeneous data resources and endpoints, and the ability to accurately predict the data delivery time to decrease the uncertainty in real-time decision-making processes. The project resulted in over 15 peer-reviewed publications in major venues. Some of the most impactful innovative components of OneDataShare include:

(1) A novel dynamic parameter tuning algorithm based on historical data analysis and real-time background traffic probing, called HARP. Most of the previous work in this area is solely based on real-time network probing or static parameter tuning, which either result in an excessive sampling overhead or fails to accurately predict the optimal transfer parameters. Combining historical data analysis with real-time sampling lets HARP tune the application-layer data transfer parameters accurately and efficiently to achieve close-to-optimal end-to-end data transfer throughput with very low overhead. Instead of one-time parameter estimation, HARP uses a feedback loop to adjust the parameter values to changing network conditions in real time. The experimental analyses over various network settings show that HARP outperforms existing solutions by up to 50% in the achieved data transfer throughput. This work was published in IEEE Transactions on Parallel and Distributed Systems (TPDS) in 2018.

(2) Three novel algorithms for application-layer parameter tuning and transfer scheduling to maximize transfer throughput in wide-area networks. These algorithms use heuristic methods to tune the level of control channel pipelining (for small file optimization), the number of parallel data streams per file (for large file optimization), and the number of concurrent file transfers to increase I/O throughput (a technique useful for all types of files). The proposed algorithms improve the transfer throughput by up to 10x compared to the baseline and 7x compared to the state-of-the-art solutions. These algorithms were published in the Journal of Parallel and Distributed Computing (JPDC) in 2018.

(3) A novel two-phase application-layer throughput optimization model for big data transfers based on offline knowledge discovery and adaptive real-time tuning to ensure continuous performance guarantee and fairness among the contending transfers in the shared network. In the offline analysis phase of this model, historical transfer logs are mined to perform knowledge discovery about the transfer characteristics. The online phase uses the discovered knowledge from the offline analysis and real-time investigation of the network condition to optimize the protocol parameters. This novel approach is tested over different networks with different datasets and outperformed its closest competitor by 1.7x and the default case by 5x. It also achieved up to 93% accuracy compared with the optimal achievable throughput possible on those networks. This work was initially presented at IEEE BigData 2017 conference, and an extended version was published in the IEEE Transactions on Parallel and Distributed Systems (TPDS) in 2020.

(4) A novel solution for efficient and scalable metadata access for distributed applications across wide-area networks. This novel solution, dubbed SMURF, combines novel pipelining and concurrent transfer mechanisms with reliability, provides distributed continuum caching and semantic locality-aware prefetching strategies to sidestep fetching latency, and achieves scalable and high-performance metadata fetch/prefetch services in the Cloud. It incorporates the phenomenon of semantic locality awareness for increased prefetch prediction rate using real-life application I/O traces from Yahoo! Hadoop audit logs and proposes a novel prefetch predictor. By effectively caching and prefetching metadata based on the access patterns, this novel continuum caching and prefetching mechanism significantly improves the local cache hit rate and reduces the average fetching latency. During approximately 20 million metadata access operations from real audit traces, SMURF achieved 90% accuracy for prefetch prediction and reduced the average fetch latency by 50% compared to the state-of-the-art mechanisms. This work was published in IEEE Transactions on Parallel and Distributed Systems (TPDS) in 2022.

Broader Impacts:

This project resulted in the training and professional development of 31 graduate students (7 Ph.D. and 24 M.S.) and 11 undergraduate students through active participation in the research and development tasks of the project. In a half-day workshop, the basics of cloud computing, data-intensive computing, and the state-of-the-art OneDataShare data-sharing technologies were introduced to a group of high school students (mostly low-income inner-city kids) interested in STEM education. OneDataShare`s novel data transfer optimization, interoperability, and prediction services will drastically increase the end-to-end performance of data transfers and data-intensive applications in different domains which depend on data movement. The cloud hosted OneDataShare data transfer scheduling and optimization service has the potential to become a key component of the national data access and sharing infrastructure.


Last Modified: 01/04/2023
Modified by: Tevfik Kosar

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page