Award Abstract # 1040114
MRI: Development of Data-Scope - A Multi-Petabyte Generic Data Analysis Environment for Science

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: THE JOHNS HOPKINS UNIVERSITY
Initial Amendment Date: September 21, 2010
Latest Amendment Date: September 21, 2010
Award Number: 1040114
Award Instrument: Standard Grant
Program Manager: Amy Walton
awalton@nsf.gov
 (703)292-4538
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2010
End Date: September 30, 2014 (Estimated)
Total Intended Award Amount: $2,087,760.00
Total Awarded Amount to Date: $2,087,760.00
Funds Obligated to Date: FY 2010 = $2,087,760.00
History of Investigator:
  • Alexander Szalay (Principal Investigator)
    aszalay1@jhu.edu
  • Charles Meneveau (Co-Principal Investigator)
  • Andreas Terzis (Co-Principal Investigator)
  • Scott Zeger (Co-Principal Investigator)
  • Kenneth Church (Co-Principal Investigator)
Recipient Sponsored Research Office: Johns Hopkins University
3400 N CHARLES ST
BALTIMORE
MD  US  21218-2608
(443)997-1898
Sponsor Congressional District: 07
Primary Place of Performance: Johns Hopkins University
3400 N CHARLES ST
BALTIMORE
MD  US  21218-2608
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): FTMTDMBR29C7
Parent UEI: GS4PNKTRNKL3
NSF Program(s): Major Research Instrumentation
Primary Program Source: 01001011DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1189
Program Element Code(s): 118900
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

1040114
Szalay
This award funds the development and deployment of the Data-Scope, a computational instrument specifically designed to enable data analysis tasks that are simply not possible today. The instrument?s unprecedented capabilities combine approximately five Petabytes of storage with a sequential IO bandwidth close to 500GBytes/sec, and 600 Teraflops of GPU computing.The need to keep acquisition costs and power consumption low, while maintaining high performance and storage capacity introduces difficult tradeoffs. The Data-Scope will provide extreme data analysis performance over PB-scale datasets at the expense of generic features such as fault tolerance and ease of management. This is however acceptable since the Data-Scope is a research instrument rather than a traditional computational facility.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Eyink, Gregory; Vishniac, Ethan; Lalescu, Cristian; Aluie, Hussein; Kanov, Kalin; Bürger, Kai; Burns, Randal; Meneveau, Charles; Szalay, Alexander "Flux-freezing breakdown in high-conductivity magnetohydrodynamic turbulence" Nature , v.497 , 2013 , p.466
Eyink, Gregory; Vishniac, Ethan; Lalescu, Cristian; Aluie, Hussein; Kanov, Kalin; Bürger, Kai; Burns, Randal;Meneveau, Charles; Szalay, Alexander "Flux-freezing breakdown in high-conductivity magnetohydrodynamicturbulence" Nature , v.497 , 2013 , p.466 10.1038/nature12128
Koszalka, Inga, Thomas W. N. Haine, Marcello G. Magaldi "Fates and Travel Times of Denmark Strait Overflow Water in the Irminger Basin" J. Phys. Oceanogr , v.43 , 2013 , p.2611 http://dx.doi.org/10.1175/JPO-D-13-023.1
M. Treib, K. Burger, F. Reichl, C. Meneveau, A.Szalay, R.Westermann "Turbulence Visualization at the Terascale on Desktop PCs" IEEE Transactions on Visualization and Computer Graphics , v.18 , 2012 , p.2169 http://doi.ieeecomputersociety.org/10.1109/TVCG.2012.274
Tamas Budavari, Laszlo Dobos, Alexander S. Szalay "SkyQuery: Federating Astronomy Archives" Computing in Science and Engineering , v.15 , 2013 , p.12
W-J. von Appen, I. Koszalka, R. S. Pickart, T. W. N. Haine, D. Mastopole, M. G. Magaldi, H. Valdimarsson, J. Girton, K. Jochumsen, G. Krahmann "The East Greenland Spill Jet as an important component of the Meridional Overturning Circulation" Deep Sea Res. I , v.92 , 2014 , p.75 .1016/j.dsr.2014.06.002
Yang, L., Silk, J., Szalay, A., Wyse, R., Bozek, B., Madau, P. "Dark matter contribution to Galactic diffusegamma ray emission" Physical Review D , v.89 , 2013 , p.063530 10.1103/PhysRevD.89.063530

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The nature of science is changing – it is increasingly limited by our ability to analyze the large amounts of complex data generated by our instruments and simulations: we see the emergence of Jim Gray’s “Fourth Paradigm” of science. Computers themselves are becoming the source of a lot of new data – the sizes of the largest numerical simulations of nature today are on par with the experimental data sets. This is not simply a computational problem, but rather requires a fresh look, and a holistic approach. We need to combine scalable algorithms and statistical tools with novel hardware and software solutions, like a deep integration of GPU computing with database indexing and fast spatial search capabilities. We propose to build a new kind of instrument, a ‘Data-Scope’, that is capable of observing very large amounts of scientific data, with unique features in its design.

In sciences today tackling data-intensive problems at the 5-10TB scales is easy: one can perform these analyses at a typical generic departmental computing facility. 50-100TB problems are quite difficult, but there are about 10-15 universities in the world that can analyze such data sets. When one needs to deal with a petabyte of data, there are less than a handful of places anywhere in the world that can address this challenge. At the same time there are many projects which are crossing over the 100TB boundary today. Astrophysics, High Energy Physics, Environmental Science, Computational Fluid Dynamics, Genomics and Bio­informatics are all encountering data challenges in the several hundred terabyte range and beyond – even within a single university. The large data sets are here, but the off-the-shelf solutions for their analyses are not!

The Data-Scope instrument has unique capabilities: it combines about 6.5 Petabytes of storage with a sequential IO bandwidth exceeding 500GBytes/sec and 120 Teraflops of GPU computing. In order to keep the cost of the instrument down, and its performance and storage capacity very high, all at low power consumption, there must be tradeoffs. The Data-Scope was tuned to provide extreme data analysis performance over petabytes at the expense of some generic features. It is a highly specialized tool to study data, a microscope for data: a “Data-Scope”, which is why we consider this to be more similar to a research instrument rather than a traditional computational facility. Since its commissioning it has enabled certain analysis tasks that would have been extremely difficult otherwise. Two of JHU’s Nobel Laureates and their students are among the early users of the Data-Scope.

This new, data-intensive nature of science is becoming increasingly important by the day.  There is a similar vacuum in our abilities to handle large data sets now as there was in the 90’s when the concept of the BeoWulf cluster emerged. Many universities and scientific disciplines are looking for a new template that would enable them to address PB scale data analysis problems. In providing an inexpensive hardware and software architecture, we feel that we can substantially accelerate the development of data-intensive science in the whole country. In order to accelerate the acceptance of the proposed approach we will collaborate with researchers across many different disciplines and across many different institutions nationwide (Los Alamos, Oak Ridge, UCSC, NMSU, UW, UC, UIC, UIUC). The Data-Scope is hosting public services on some of the largest data sets in astronomy, and fluid mechanics, both observational and simulated. Our public turbulence database services (close to 500 Terabytes) have delivered over 10 trillion data points to the world. Students and postdoctoral fellows using the Data-Scope are gaining a substantial career advantage – these will be the job skills of the 21st<...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page