
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | June 9, 2017 |
Latest Amendment Date: | September 15, 2022 |
Award Number: | 1724898 |
Award Instrument: | Standard Grant |
Program Manager: |
Alejandro Suarez
alsuarez@nsf.gov (703)292-7092 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2017 |
End Date: | August 31, 2022 (Estimated) |
Total Intended Award Amount: | $497,773.00 |
Total Awarded Amount to Date: | $616,469.00 |
Funds Obligated to Date: |
FY 2018 = $32,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
520 LEE ENTRANCE STE 211 AMHERST NY US 14228-2577 (716)645-2634 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
338 Davis Hall Buffalo NY US 14260-2500 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Data Cyberinfrastructure |
Primary Program Source: |
01001819DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Applications in scientific, industrial, and personal spaces now generate more data than ever before. As data become more abundant and data resources become more heterogeneous, the accessing, sharing and disseminating of data sets becomes a bigger challenge. Existing technologies for transferring and sharing data suffer from serious shortcomings, including low transfer performance, inflexibility, restricted protocol support, and poor scalability. This project develops a universal data sharing building block for data-intensive applications, dubbed OneDataShare, with three major goals: (1) optimization of end-to-end data transfers and reduction of the time to delivery of the data; (2) interoperation across heterogeneous and incompatible data resources; and (3) predicting the data delivery time and decreasing the uncertainty in real-time decision-making processes.
OneDataShare deliverables include: (1) design and implementation of novel algorithms for application-layer optimization of the data transfer protocol parameters to achieve optimal end-to-end data transfer throughput; (2) development of a universal interface specification for heterogeneous data storage endpoints and a framework for on-the-fly data transfer protocol translation; (3) instrumentation of end-to-end data transfer time prediction capability, and feeding it into real-time scheduling and decision-making processes for advanced provisioning, high-level planning, and co-scheduling of resources; (4) deployment of these capabilities as stand-alone OneDataShare cloud-hosted services to end users; and (5) integration of these capabilities with widely used data scheduling and workflow management tools, and validation in specific applications. OneDataShare services and tools are developed at the application level, and they do not require any changes to the existing infrastructure, nor to the low-level networking stack, although they increase the end-to-end performance of the data movement tasks substantially. These efficient and high-performance data transfer techniques will help the scientific community, industry, and end-users to save significant time and effort in transferring and sharing data.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Intellectual Merit:
This project resulted in a proof-of-concept prototype implementation of a universal data access and sharing building block for data-intensive applications, called OneDataShare. It provides optimization of end-to-end data transfer performance with novel application-layer models and algorithms, interoperability between heterogeneous data resources and endpoints, and the ability to accurately predict the data delivery time to decrease the uncertainty in real-time decision-making processes. The project resulted in over 15 peer-reviewed publications in major venues. Some of the most impactful innovative components of OneDataShare include:
(1) A novel dynamic parameter tuning algorithm based on historical data analysis and real-time background traffic probing, called HARP. Most of the previous work in this area is solely based on real-time network probing or static parameter tuning, which either result in an excessive sampling overhead or fails to accurately predict the optimal transfer parameters. Combining historical data analysis with real-time sampling lets HARP tune the application-layer data transfer parameters accurately and efficiently to achieve close-to-optimal end-to-end data transfer throughput with very low overhead. Instead of one-time parameter estimation, HARP uses a feedback loop to adjust the parameter values to changing network conditions in real time. The experimental analyses over various network settings show that HARP outperforms existing solutions by up to 50% in the achieved data transfer throughput. This work was published in IEEE Transactions on Parallel and Distributed Systems (TPDS) in 2018.
(2) Three novel algorithms for application-layer parameter tuning and transfer scheduling to maximize transfer throughput in wide-area networks. These algorithms use heuristic methods to tune the level of control channel pipelining (for small file optimization), the number of parallel data streams per file (for large file optimization), and the number of concurrent file transfers to increase I/O throughput (a technique useful for all types of files). The proposed algorithms improve the transfer throughput by up to 10x compared to the baseline and 7x compared to the state-of-the-art solutions. These algorithms were published in the Journal of Parallel and Distributed Computing (JPDC) in 2018.
(3) A novel two-phase application-layer throughput optimization model for big data transfers based on offline knowledge discovery and adaptive real-time tuning to ensure continuous performance guarantee and fairness among the contending transfers in the shared network. In the offline analysis phase of this model, historical transfer logs are mined to perform knowledge discovery about the transfer characteristics. The online phase uses the discovered knowledge from the offline analysis and real-time investigation of the network condition to optimize the protocol parameters. This novel approach is tested over different networks with different datasets and outperformed its closest competitor by 1.7x and the default case by 5x. It also achieved up to 93% accuracy compared with the optimal achievable throughput possible on those networks. This work was initially presented at IEEE BigData 2017 conference, and an extended version was published in the IEEE Transactions on Parallel and Distributed Systems (TPDS) in 2020.
(4) A novel solution for efficient and scalable metadata access for distributed applications across wide-area networks. This novel solution, dubbed SMURF, combines novel pipelining and concurrent transfer mechanisms with reliability, provides distributed continuum caching and semantic locality-aware prefetching strategies to sidestep fetching latency, and achieves scalable and high-performance metadata fetch/prefetch services in the Cloud. It incorporates the phenomenon of semantic locality awareness for increased prefetch prediction rate using real-life application I/O traces from Yahoo! Hadoop audit logs and proposes a novel prefetch predictor. By effectively caching and prefetching metadata based on the access patterns, this novel continuum caching and prefetching mechanism significantly improves the local cache hit rate and reduces the average fetching latency. During approximately 20 million metadata access operations from real audit traces, SMURF achieved 90% accuracy for prefetch prediction and reduced the average fetch latency by 50% compared to the state-of-the-art mechanisms. This work was published in IEEE Transactions on Parallel and Distributed Systems (TPDS) in 2022.
Broader Impacts:
This project resulted in the training and professional development of 31 graduate students (7 Ph.D. and 24 M.S.) and 11 undergraduate students through active participation in the research and development tasks of the project. In a half-day workshop, the basics of cloud computing, data-intensive computing, and the state-of-the-art OneDataShare data-sharing technologies were introduced to a group of high school students (mostly low-income inner-city kids) interested in STEM education. OneDataShare`s novel data transfer optimization, interoperability, and prediction services will drastically increase the end-to-end performance of data transfers and data-intensive applications in different domains which depend on data movement. The cloud hosted OneDataShare data transfer scheduling and optimization service has the potential to become a key component of the national data access and sharing infrastructure.
Last Modified: 01/04/2023
Modified by: Tevfik Kosar
Please report errors in award information by writing to: awardsearch@nsf.gov.