Award Abstract # 1261715
CIF21 DIBBs: Long Term Access to Large Scientific Data Sets: The SkyServer and Beyond

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: THE JOHNS HOPKINS UNIVERSITY
Initial Amendment Date: September 12, 2013
Latest Amendment Date: July 12, 2019
Award Number: 1261715
Award Instrument: Cooperative Agreement
Program Manager: Amy Walton
awalton@nsf.gov
 (703)292-4538
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2013
End Date: March 31, 2020 (Estimated)
Total Intended Award Amount: $9,549,659.00
Total Awarded Amount to Date: $10,449,659.00
Funds Obligated to Date: FY 2013 = $7,603,723.00
FY 2016 = $1,268,997.00

FY 2017 = $1,076,939.00

FY 2019 = $500,000.00
History of Investigator:
  • Alexander Szalay (Principal Investigator)
    aszalay1@jhu.edu
  • Charles Meneveau (Co-Principal Investigator)
  • Aniruddha Thakar (Co-Principal Investigator)
  • Randal Burns (Co-Principal Investigator)
  • Michael Rippin (Co-Principal Investigator)
  • Steven Salzberg (Former Co-Principal Investigator)
Recipient Sponsored Research Office: Johns Hopkins University
3400 N CHARLES ST
BALTIMORE
MD  US  21218-2608
(443)997-1898
Sponsor Congressional District: 07
Primary Place of Performance: Johns Hopkins University
3400 North Charles Street
Baltimore
MD  US  21218-2608
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): FTMTDMBR29C7
Parent UEI: GS4PNKTRNKL3
NSF Program(s): Data Cyberinfrastructure
Primary Program Source: 01001617DB NSF RESEARCH & RELATED ACTIVIT
01001718DB NSF RESEARCH & RELATED ACTIVIT

01001920DB NSF RESEARCH & RELATED ACTIVIT

01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 8084, 8048
Program Element Code(s): 772600
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

The Project aims to create a sustainable collaborative ecosystem built around several large scientific data sets for the broader science community. Based upon the expertise developed for the Sloan Digital Sky Survey (SDSS) SkyServer and the associated projects the Project will formalize the main system components and reengineer them to be much more reusable.

The Project will take full ownership of the Sloan Digital Sky Survey archive and will provide a robust environment for its continued operations, using an economy of scale enabled by common, shared building blocks derived from the existing SDSS SkyServer framework, based upon a large, scalable database system.

Using these building blocks, the team will build and operate open data archives from large observations and numerical simulations, including computational fluid dynamics, ocean circulation and astrophysics, reaching PB scales. The Project will further extend the tools to life sciences, like large-scale, next-generation genome sequencing experiments, as well as high-throughput neuroscience imaging data. The resulting distributed, parallel database framework will be linked to small, user-created data sets that can be used also collaboratively, in conjunction with each other and the large data collections.

The Project will work with selected communities to help deploying and serving data using our building blocks, demonstrating portability, generality and economies of scale; will help and encourage other institutions and communities to use the tools, while seeking collaborations that result in disruptive changes, and will build tools that accelerate the timescale to deploy new services and applications and rapidly test new ideas.

The Project will enable individual users to bring their "small data" and analyze it collaboratively in the context of the large data.
Our particular goals are:

(i) Take full ownership of the SDSS Archive (database and flat files) and ensure a scalable and robust environment for its continued operation;

(ii) Build upon our decade-long effort on SDSS and its ad-hoc spinoffs, through reengineering its components into portable and general building blocks;

(iii) Systematically address curation issues arising from using a service-oriented architecture (SOA), and the resulting service life-cycle;

(iv) Work with projects from additional scientific domains to help deploying and serving data using our building blocks, demonstrating portability, generality and economies of scale;

(v) Develop scalable extensions to our database cluster in order to deal with large numerical simulations scaling up to petabytes, and turn them into open numerical laboratories;

(vi) Use our CasJobs Collaborative Environment to address the problem of small but complex data in the "Long Tail" of science.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 11)
Almansi, M., Haine, T. W. N, Pickart, R. S., Magaldi, M. G., Gelderloos, R., & Mastropole, D. "High-Frequency Variability in the Circulation and Hydrography of the Denmark Strait Overflow from a High-Resolution Numerical Model" Journal of Physical Oceanography , v.47 , 2017 , p.2999 doi:10.1175/JPO-D-17-0129.1
Collado-Torres, L., Nellore, A., Kammers, K., Ellis, S. E., Taub, M. A., Hansen, K. D., Jaffe, A. E., Langmead, B., & Leek, J. T. "Reproducible RNA-seq analysisusing recount2" Nature Biotechnology , v.35 , 2017 , p.319 doi:10.1038/nbt.3838
Danish, M. & Meneveau, C. "Multi-scale analysis of the invariants of velocity gradient tensor in isotropic turbulence" Physical Review Fluids , v.3 , 2018 , p.044604 doi:10.1103/PhysRevFluids.3.044604
D. KIM, V. LEKI, B. MÉNARD, D. BARON, M. TAGHIZADEH-POPP "Sequencing seismograms: A panoptic view of scattering in the core-mantle boundary region" Science , 2020 , p.1123
Elsas, J. H., Szalay, A. S., & Meneveau, C. "Geometry and scaling laws of excursion and iso-sets of enstrophy and dissipation in isotropic turbulence" Journal of Turbulence , v.19 , 2018 , p.297 doi:10.1080/14685248.2018.1424995
Fraser, N.J., Inall, M.E., Magaldi, M.G., Haine, T.W.N., & Jones, S.C. "Wintertime Fjord?Shelf Interaction and Ice Sheet Melting in Southeast Greenland" Journal of Geophysical Research Oceans , v.123 , 2018 , p.9156 doi:10.1029/2018JC014435
Medvedev, D., Lemson, G., & Rippin, M. "SciServer Compute: Bringing Analysis Close to the Data" Proceedings of Astronomical Data Analysis Software and Systems (ADASS) XXVI , 2017
Nummelin, A. "Statistical Inversion of Surface Ocean Kinematics from Sea Surface Temperature Observations" Journal of Atmospheric & Oceanic Technology , v.35 , 2018 , p.1913 doi:10.1175/JTECH-D-18-0057.1
Rippin, M., Lemson, G., Thakar, A., Medvedev, D., & Taghizadeh-Popp, M. "SciServer: Collaborative Science Platform" Gateways 2018: The 13th Gateway Computing Environments Conference , 2018
Taghizadeh-Popp, M., Lemson, G., Kim, J.-W., & Rippin, M. "SciServer: a collaborative workspace for data analysis, sharing andstorage in the cloud" Proceedings of Astronomical Data Analysis Software and Systems (ADASS) XXVII , 2018
Zhao, W., Lee, J., Meneveau, C., & Zaki, T. "Application of a self-organizing map to identify the turbulent-boundary-layerinterface in a transitional flow" Physical Review Fluids , v.4 , 2019 , p.023902-1 doi:10.1103/PhysRevFluids.4.023902
(Showing: 1 - 10 of 11)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Over the last 20 years, the amount of available scientific data has doubled approximately every year - every year adds more measurements of the universe than existed in all of human history. Computers and information technology have struggled to keep pace with this data avalanche, and researchers have had to develop new skills to make sense of it all. New approaches are needed to store, analyze, visualize, and publish data at such high volumes. Furthermore, data sets have grown too large to be downloaded and processed locally - science must be done online, close to the data, if it is to be done at all.

Building on 20+ years of experience that our team at Johns Hopkins University?s Institute for Data-Intensive Engineering and Science (IDIES) has garnered hosting and serving large astronomy data sets, we have created SciServer (www.sciserver.org), a full-featured ?science platform? that enables scientists to conduct data-driven research with the largest scientific datasets, entirely online.

In 2001, we debuted SkyServer (skyserver.sdss.org), the first online portal for data-intensive astronomy. SkyServer gave the entire world free access to Terabytes of data from the Sloan Digital Sky Survey (SDSS), an ongoing worldwide effort to make a three-dimensional map of the universe. SkyServer has been one of the most widely used public science portals, resulting in thousands of research publications and millions of students and citizen scientists learning science by doing science with real data.

SciServer greatly increases the capabilities of SkyServer and expands its content from astronomy to all sciences (hence the change in name).

SciServer provides a complete online environment for working with big scientific data sets, with features for researchers, educators, and data providers - all accessible through a single web destination. This includes:

  • A versatile user interface accessible to novices and professional researchers alike (see Figs 1-2)

  • Components supporting query, computation, visualization, and analysis of data sets in any field of science

  • Server-side access to Petabytes of data, covering topics such as astronomy, oceanography, genomics, and materials science.

  • The ability to store personal data sets online

  • Mechanisms to support sharing of public and private datasets - promoting public datasets while keeping private datasets securely accessible only by designated members of a research team

  • Educational tools for the next generation of data scientists and their instructors, designed to make sharing of datasets and activities as simple as possible for both in-person and virtual classroom environments

Like our previous projects, SciServer gives users the ability to submit free-form queries to Terabyte-scale databases, and to save data of interest to a personal database. SciServer Compute extends this functionality greatly with tools that allow users to write programs in a range of computer languages, which can access powerful scientific software libraries, without having to download, install, or configure anything. These programs run online on computers that have fast access to data servers, and all scripts can be executed in both interactive and batch mode. See Figure 3 for some examples.

Each SciServer user receives 10 GB of stable, backed-up private storage, plus access to Terabytes of shared scratch space for short-term storage of larger data products. Curators of large datasets who wish to disseminate their data publicly through SciServer can contact us and we will assist them to share their data, either with all SciServer users or with invited users or groups, with custom read/write/share permissions.

The full SciServer system launched in June 2018, and has grown steadily in breadth and scope.  We support more than 50 research projects across many different scientific disciplines as illustrated in Figure 4. New data sets include extensions to catalogues from SDSS and other astronomical surveys, and also data from the fields of medicine and bioinformatics, social sciences and even literature. SciServer also provides access to truly virtual data, produced by large computer simulations of the universe, ocean circulation and of materials science for example, that can be compared to "real" data on the same platform.

Separate instances of SciServer have been deployed at the Max-Planck Institute for Extraterrestrial Physics in Munich, where it supports data from the eROSITA X-Ray satellite, as well as at the National Institute for Standards and Technology (NIST), where it is being used in to investigate the effects of Hurricane Maria on Puerto Rico in 2017.

SciServer has been used in courses at JHU and at other sites such as the University of St. Andrews in Scotland.  An exciting new capability is to spin up a temporary SciServer instance in a commercial cloud, which can be used for courses or schools to manage their own users and data sets.

Most importantly, we support a community of over 8,000 users, including many educators who are using SciServer to teach data science and computational thinking to the next generation of scientists.


 

 


Last Modified: 08/12/2020
Modified by: Michael Rippin

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page