
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | July 1, 2020 |
Latest Amendment Date: | October 25, 2024 |
Award Number: | 2019163 |
Award Instrument: | Standard Grant |
Program Manager: |
Deepankar Medhi
dmedhi@nsf.gov (703)292-2935 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2020 |
End Date: | March 31, 2025 (Estimated) |
Total Intended Award Amount: | $470,384.00 |
Total Awarded Amount to Date: | $479,484.00 |
Funds Obligated to Date: |
FY 2024 = $9,100.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
601 S HOWES ST FORT COLLINS CO US 80521-2807 (970)491-6355 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
200 W Lake St Fort Collins CO US 80521-4593 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Special Projects - CNS, CISE Research Resources |
Primary Program Source: |
01002425DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Scientific data transfers have gotten so large that previously rare transmission errors in the Internet are causing some scientific data transfers to be corrupted. The Internet's error checking mechanisms were designed at a time when a megabyte was a large file. Now files can contain terabytes. The old error checking mechanisms are in danger of being overwhelmed. This project seeks to find new error checking mechanisms for the Internet to safely move tomorrow's scientific data efficiently and without errors.
This project addresses two fundamental issues. First, the Internet's checksums and message digests are too small (32-bits) and probably are poorly tuned to today's error patterns. It is a little-known fact that checksums can (and typically should) be designed to reliably catch specific errors. A good checksum is designed to protect against errors that it will actually encounter. So the first step in this project is to collect information about the kinds of transmission errors currently happening in the Internet for a comprehensive study. Second, today's file transfer protocols, if they find a file has been corrupted in transit, simply discard the file and transfer it again. In a world in which the file is huge (tens of terabytes or even petabytes long), that's a tremendous waste. Rather, the file transfer protocol should seek to repair the corrupted parts of the file. As the project collects data about errors, it will also design a new file transfer protocol that can incrementally verify and repair files.
This project will improve the Internet's ability to support big data transfers, both for science and commerce, for decades to come. Users will be able to transfer big files with confidence that the data will be accurately and efficiently copied over the network. This work will further NSF's Blueprint for a National Cyberinfrastructure Ecosystem by ensuring a world in which networks work efficiently to deliver trustworthy copies of big data to anyone who needs it. Additional information on the project is available at: www.hipft.net
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The central theme of this effort was to better understand file transfer errors in the Internet. There were two hypotheses about the source of errors: one hypothesis attributed the errors to file system errors that took place during the file transfer; the other hypothesis attributed errors to network errors that were not detected by the TCP checksum.
Our effort focused identifying network errors. A secondary goal, if the network was the source of the errors, was to develop file transfer protocols that were able to detect and mitigate the errors.
During this project, we made multiple findings:
- We were able to detect file transfer errors and show that both file system errors and network errors are occuring.
- File transfer errors are much less common than prior studies suggested: we only detected a handful.
- The types of errors occurring on today's Internet are different from the errors that were present (and planned for) when the Internet was being developed in the 1970s.
- We can learn a considerable mount of information simply by capturing checksums that do not match their data by looking at the Hamming Distance between the expected and actual checksum.
- We can improve our file transfer protocols to be faster by better utilizing cases where there are multiple repositories storing the same data.
These results are encouraging. We know far more about the frequency (lower than expected) and source(s) of errors (both disks and networks). We have demonstrated that it may be possible to use multiple repositories as a way to mitigate errors, without a performance penalty.
Because errors are much less common than predicted, we are caught in the predicament of (a) knowing that a problem (undetected) errors are happening, but (b) not being able to capture enough errors to properly analyze them and determine how best to mitigate their effects.
Finally, we observed that traditional data collection pipelines and testbeds do not lend themselves well for large scale measurements, such as ours. We concluded that a fundamentally different framework is needed to perform network measurement and observations at scale in the exascale era.
Last Modified: 05/23/2025
Modified by: Craig Partridge
Please report errors in award information by writing to: awardsearch@nsf.gov.