Award Abstract # 1931297
Collaborative Research: Elements: Advancing Data Science and Analytics for Water (DSAW)

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: UTAH STATE UNIVERSITY
Initial Amendment Date: September 9, 2019
Latest Amendment Date: April 3, 2020
Award Number: 1931297
Award Instrument: Standard Grant
Program Manager: Varun Chandola
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2019
End Date: September 30, 2024 (Estimated)
Total Intended Award Amount: $568,496.00
Total Awarded Amount to Date: $568,496.00
Funds Obligated to Date: FY 2019 = $568,496.00
History of Investigator:
  • Jeffery Horsburgh (Principal Investigator)
    jeff.horsburgh@usu.edu
  • Alfonso Torres-Rua (Co-Principal Investigator)
  • Brian Crookston (Co-Principal Investigator)
  • Tianfang Xu (Co-Principal Investigator)
Recipient Sponsored Research Office: Utah State University
1000 OLD MAIN HL
LOGAN
UT  US  84322-1000
(435)797-1226
Sponsor Congressional District: 01
Primary Place of Performance: Utah State University
1415 Old Main Hill
Logan
UT  US  84322-1415
Primary Place of Performance
Congressional District:
01
Unique Entity Identifier (UEI): SPE2YDWHDYU4
Parent UEI:
NSF Program(s): Hydrologic Sciences,
Special Initiatives,
EnvS-Environmtl Sustainability,
Software Institutes,
EarthCube
Primary Program Source: 01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 026Z, 077Z, 7923, 8004
Program Element Code(s): 157900, 164200, 764300, 800400, 807400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Scientific challenges in hydrology and water resources such as understanding impacts of variable climate, sustainability of water supply with population growth and land use change, and impacts of hydrologic change on ecosystems and humans are increasingly data intensive. The volume of data produced by environmental scientists to study hydrologic systems requires advanced software tools for effective data visualization, analysis, and modeling. Scientists spend much of their time accessing, organizing, and preparing datasets for analyses, which can be a barrier to efficient analyses and hinders scientific inquiries and advances. This project will develop new software that will enhance scientists' ability to apply advanced data visualization and analysis methods (collectively referred to as "data science" methods) in the hydrology and water resources domain. The project will promote standardized software tools and data formats to help scientists enhance the consistency, share-ability, and reproducibility of the analyses they perform - all of which are important in building trust in scientific results. The software developed in the project will make data loading and organization for analysis easier, reducing the time spent by scientists in choosing appropriate data structures and writing computer code to read and parse data. It will enable users to automatically retrieve data from the HydroShare system, which is a hydrology domain data repository, as well as from important national water data sources like the United States Geological Survey's National Water Information System. The software will automatically load data from these sources into standardized and high performance data structures targeted to specific scientific data types and that integrate with visualization, analysis, and other data science capabilities commonly used by scientists in the hydrology and water resources domains. The project will also reduce the technical burden for water scientists associated with creating a computational environment within which to execute their analyses by installing and maintaining the Python packages developed within the Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) HydroShare-linked JupyterHub environment. Finally, the project will demonstrate the functionality and use of the software by producing a set of educational modules based on real water-data science applications that provide a specific mechanism for delivering the software to the community and promoting its use in classroom and research environments.

Scientific and related management challenges in the water domain are inherently multi-disciplinary, requiring synthesis of data of multiple types from multiple domains. Many data manipulation, visualization, and analysis tasks performed by water scientists are difficult because (1) datasets are becoming larger and more complex; (2) standard data formats for common data types are not always agreed upon, and, when they are, they are not always mapped to an efficient structure for visualization and/or analysis within an analytical environment; and (3) water scientists generally lack training in data intensive scientific methods that would enable them to use new and existing tools to efficiently tackle large and complex datasets. This project will advance Data Science and Analytics for Water (DSAW) by developing: (1) an advanced object data model that maps common water-related data types to high performance data structures within the object-oriented Python language and analytical environment based upon standard file, data, and content types established by the Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) HydroShare system; (2) two new Python packages that enable users to write Python code for automating retrieval of desired water data, loading it into high performance memory objects specified by the object data model designed in the project, and performing analysis in a reproducible way that can be shared, collaborated around, and formally published for reuse. The project will use domain-specific data science applications to demonstrate how the new Python packages can be paired with the powerful data science capabilities of existing Python packages like Pandas, numpy, and scikit-learn to develop advanced analytical workflows within cloud and desktop environments. The project aims to extend the data access, collaboration, and archival capabilities of the HydroShare data and model repository and promote its use as a platform for reproducible water-data science. The project also aims to overcome barriers associated with accessing, organizing, and preparing datasets for data science intensive analyses. Overcoming these barriers will be an enabler for transforming scientific inquiries and advancing application of data science methods in the hydrology and water resources domains.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Hodson, Timothy O and DeCicco, Laura A and Hariharan, Jayaram A and Stanish, Lee F and Black, Scott and Horsburgh, Jeffery S "Reproducibility Starts at the Source: R, Python, and Julia Packages for Retrieving USGS Hydrologic Data" Water , v.15 , 2023 https://doi.org/10.3390/w15244236 Citation Details
Jones, Amber Spackman and Horsburgh, Jeffery S. and Bastidas Pacheco, Camilo J. and Flint, Courtney G. and Lane, Belize A. "Advancing Hydroinformatics and Water Data Science Instruction: Community Perspectives and Online Learning Resources" Frontiers in Water , v.4 , 2022 https://doi.org/10.3389/frwa.2022.901393 Citation Details
Jones, Amber Spackman and Jones, Tanner Lex and Horsburgh, Jeffery S. "Toward automating post processing of aquatic sensor data" Environmental Modelling & Software , v.151 , 2022 https://doi.org/10.1016/j.envsoft.2022.105364 Citation Details
Xu, Tianfang and Liang, Feng "Machine learning for hydrologic sciences: An introductory overview" WIREs Water , v.8 , 2021 https://doi.org/10.1002/wat2.1533 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The last decade has seen a tremendous advance in the availability of hydrologic and water-related data - both through generation of new datasets and through sharing and publication of existing datasets. Scientific and related management challenges in the water domain are increasingly data intensive, requiring synthesis of data of multiple types from multiple domains using advanced data science tools and methods. However, many data manipulation, visualization, and analysis tasks performed by water scientists and engineers are difficult because datasets have become larger, more numerous, and more complex. This project aimed to advance "water data science" - i.e., application of data science methods in the hydrology and water resources domain - through development of software tools aimed at extracting meaningful information from large and diverse data while lowering barriers for entry and use by existing water domain scientists and a new generation of water data scientists. Major goals of the project included: 

1. Enable water-data scientists to more easily share and collaborate around data and analyses.

2. Provide data management, visualization, and analysis tools that advance water-data scientists' data science capabilities.

3. Promote more consistent data workflows, data reuse, and reproducibility of scientific results.

We developed two new Python code packages as software products along with a Python-based object data model to help water data scientists work more efficiently with hydrologic data. The first Python package called "dataretrieval" was collaboratively developed with the United States Geological Survey (USGS) and enables users to efficiently access all of the data stored in USGS' National Water Information System (NWIS), which is one of the most important operational hydrologic data repositories in the U.S. It also has capabilities for efficiently retrieving and working with data from the Water Quality Portal, which is a cooperative service between USGS and the U.S. Environmental Protection Agency (EPA) that integrates publicly available water quality data from USGS, EPA, and over 400 state, federal, tribal, and local agencies. This package was developed to reduce the amount of time and effort required to find, access, and work with data from these important data sources.

The second Python code package is called "hsclient" and provides functionality for developing code to integrate water data science workflows (e.g., visualization, analysis, and modeling) with the HydroShare online repository. hsclient enables scientists to more easily share data and reproducible data science analyses in the trusted HydroShare repository where others can access and use their data and scientific results. It enables users to automatically retrieve data from HydroShare, load data into highly performant data structures keyed to specific scientific data types and that integrate with existing visualization, analysis, and data science capabilities available in Python, and then write analysis results back to HydroShare for collaborative sharing and eventual publication. Thus, the hsclient Python package extends the HydroShare repository to simplify common data management tasks, speed up data loading and organization to reduce time to analysis, and enable reusable high performance data analytics for water-data scientists.

Finally, we developed a set of water data science use cases to demonstrate how these software tools can be used by the community of water data scientists, hydrologists, students, and practitioners to get started with analysis right away rather than spending time figuring out how to organize and load their data. These use cases resulted in a set of reusable educational materials that we shared within the HydroLearn online platform for use by instructors who teach water data science concepts in their courses. We also held multiple community engagement workshops where we demonstrated the capabilities of the software we developed and engaged learners in teaching them how to apply the packages we developed to their own use cases. All of the source code for this project is freely and openly available under a liberal open-source license.


Last Modified: 11/15/2024
Modified by: Jeffery S Horsburgh

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page