Award Abstract # 1740633
Collaborative Proposal: EarthCube Integration: Pangeo: An Open Source Big Data Climate Science Platform

NSF Org: OCE
Division Of Ocean Sciences
Recipient: UNIVERSITY CORPORATION FOR ATMOSPHERIC RESEARCH
Initial Amendment Date: August 21, 2017
Latest Amendment Date: August 29, 2022
Award Number: 1740633
Award Instrument: Standard Grant
Program Manager: Sean Kennan
skennan@nsf.gov
 (703)292-7575
OCE
 Division Of Ocean Sciences
GEO
 Directorate for Geosciences
Start Date: September 1, 2017
End Date: August 31, 2022 (Estimated)
Total Intended Award Amount: $466,882.00
Total Awarded Amount to Date: $466,882.00
Funds Obligated to Date: FY 2017 = $466,882.00
History of Investigator:
  • Ryan May (Principal Investigator)
    rmay@ucar.edu
  • Joseph Hamman (Co-Principal Investigator)
  • Kevin Paul (Co-Principal Investigator)
  • Davide Del Vento (Co-Principal Investigator)
  • Kevin Paul (Former Principal Investigator)
  • Ryan May (Former Co-Principal Investigator)
Recipient Sponsored Research Office: University Corporation For Atmospheric Res
3090 CENTER GREEN DR
BOULDER
CO  US  80301-2252
(303)497-1000
Sponsor Congressional District: 02
Primary Place of Performance: National Center for Atmospheric Research
1850 Table Mesa Drive
Boulder
CO  US  80305-5602
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): YEZEE8W5JKA3
Parent UEI:
NSF Program(s): OCE-Ocean Sciences Research,
EarthCube
Primary Program Source: 01001718DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 026Z, 7433
Program Element Code(s): 689900, 807400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.050

ABSTRACT



Climate, weather, and ocean simulations (Earth System Models; ESMs) are crucial tools for the study of the Earth system, providing both scientific insight into fundamental dynamics as well as valuable practical predictions about Earth's future. Continuous increases in ESM spatial resolution have led to more realistic, more detailed physical representations of Earth system processes, while the proliferation of statistical ensembles of simulations has greatly enhanced understanding of uncertainty and internal variability. Hand in hand with this progress has come the generation of Petabytes of simulation data, resulting in huge downstream challenges for geoscience researchers. The task of mining ESM output for scientific insights has now itself become a serious Big Data problem. Existing Big Data tools cannot easily be applied to the analysis of ESM data, leading to a building crisis across a wide range of geoscience fields. This is exactly the sort of problem EarthCube was conceived to address. The project will integrate a suite of open-source software tools (the "Pangeo Platform") which together can tackle petabyte-scale ESM datasets. Additionally, training and educational materials for these tools will be developed, distributed widely online, and integrated into existing educational curricula at Columbia. A workshop at NCAR in the final year will help inform the broader community about Pangeo. Collaborators at other US climate modeling centers will encourage adoption and participation in the Pangeo project by their scientists. Beyond climate and related fields, multidimensional numeric arrays are common in many fields of science (e.g. astronomy, materials science, microscopy). However, the dominant Big Data software stack (Hadoop) is oriented towards tabular text-based data structures and cannot easily ingest petabyte scale multidimensional numeric arrays. The proposed work thus has potential to transform Data Science itself, enabling analysis of such datasets via a novel, highly scalable, highly flexible tool with a syntax familiar to disciplinary researchers.

The core technologies are the python packages Dask, a flexible parallel computing library which provides dynamic task scheduling, and XArray, a wrapper layer over Dask data structures which provides user-friendly metadata tracking, indexing, and visualization. These tools interface with netCDF datasets and understand CF conventions. They will be brought to bear on four high impact Geoscience Use Cases in atmospheric science, land-surface hydrology, and physical oceanography. Disciplinary scientists will define workflows for each use case and interact with computational scientists to demonstrate, benchmark, and optimize the software. The resulting software improvements will be contributed back to the upstream open source projects, ensuring long-term sustainability of the platform. The end result will be a robust new software toolkit for climate science and beyond. This toolkit will enhance the Data Science aspect of EarthCube. Implementation of these tools on the cloud will also be tested, taking advantage of agreement between commercial cloud service providers and NSF for the BIGDATA solicitation.





PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

May, Ryan M. and Goebbert, Kevin H. and Thielen, Jonathan E. and Leeman, John R. and Camron, M. Drew and Bruick, Zachary and Bruning, Eric C. and Manser, Russell P. and Arms, Sean C. and Marsh, Patrick T. "MetPy: A Meteorological Python Library for Data Analysis and Visualization" Bulletin of the American Meteorological Society , v.103 , 2022 https://doi.org/10.1175/BAMS-D-21-0125.1 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The Pangeo project is a novel interdisciplinary collaboration among scientists and software developers aimed at improving the way scientists work with very large datasets. The ultimate goal of this work is to enable transformative new scientific discoveries, enhance reproducibility, and make the scientific process more fun and less frustrating. Over the four years since its inception, Pangeo, and this award specifically (the first federal funding for Pangeo), have helped transformed geoscience informatics and cyber-infrastructure.

Improvements to the Pangeo Platform have already lead to scientific use cases (found on the Geoscience Use Cases page of the project website: https://pangeo.io/use_cases/) that demonstrate how to conduct exploratory, interactive analysis of large datasets that could not be accomplished by prior serial toolsets. Over the last year, these use cases have grown in number and complexity.  Many use cases have been moved to a tested gallery (https://gallery.pangeo.io).  This technology has demonstrated how useful and flexible it is to conduct reproducible science in the cloud, and not just on HPC systems, such as NCAR's NWSC.  Demonstrated deployments of the Pangeo platform exist on Google Cloud Platform, AWS, as well as Microsoft Azure.  This technology has the potential to be transformative, and allow scientists the ability to conduct truly interactive analyses with data that required specialty, batch-job-style analysis to conduct in the past.

Part of this work has included transforming how data is stored, cataloged, and distributed over the web. In particular, our work has helped introduce a new cloud-native data format "Zarr" into the geoscience space. Zarr is now officially supported as an underlying storage format for the NetCDF storage library by Unidata (https://www.unidata.ucar.edu/blogs/developer/en/entry/overview-of-zarr-support-in). Zarr was also recently accepted as an OGC Community Standard (https://www.ogc.org/standards/community).

We have contributed to the widespread adoption of Python and Xarray in particular as the go-to tool for analyzing environmental data. This is evident, for example, from NCAR's Pivot to Python (https://www.ncl.ucar.edu/Document/Pivot_to_Python/). The reach of these impacts have been felt through contributions to the entirety of the scientific Python ecosystem from generic tools like Dask and Xarray, to more domain-specific tools like MetPy.

In addition to these main areas of impact, our software tools have supported hundreds of peer-reviewed publications from scientists around the world.

 

 


Last Modified: 01/20/2023
Modified by: Ryan May

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page