
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | September 21, 2010 |
Latest Amendment Date: | September 21, 2010 |
Award Number: | 1040114 |
Award Instrument: | Standard Grant |
Program Manager: |
Amy Walton
awalton@nsf.gov (703)292-4538 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2010 |
End Date: | September 30, 2014 (Estimated) |
Total Intended Award Amount: | $2,087,760.00 |
Total Awarded Amount to Date: | $2,087,760.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
3400 N CHARLES ST BALTIMORE MD US 21218-2608 (443)997-1898 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
3400 N CHARLES ST BALTIMORE MD US 21218-2608 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Major Research Instrumentation |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
1040114
Szalay
This award funds the development and deployment of the Data-Scope, a computational instrument specifically designed to enable data analysis tasks that are simply not possible today. The instrument?s unprecedented capabilities combine approximately five Petabytes of storage with a sequential IO bandwidth close to 500GBytes/sec, and 600 Teraflops of GPU computing.The need to keep acquisition costs and power consumption low, while maintaining high performance and storage capacity introduces difficult tradeoffs. The Data-Scope will provide extreme data analysis performance over PB-scale datasets at the expense of generic features such as fault tolerance and ease of management. This is however acceptable since the Data-Scope is a research instrument rather than a traditional computational facility.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The nature of science is changing – it is increasingly limited by our ability to analyze the large amounts of complex data generated by our instruments and simulations: we see the emergence of Jim Gray’s “Fourth Paradigm” of science. Computers themselves are becoming the source of a lot of new data – the sizes of the largest numerical simulations of nature today are on par with the experimental data sets. This is not simply a computational problem, but rather requires a fresh look, and a holistic approach. We need to combine scalable algorithms and statistical tools with novel hardware and software solutions, like a deep integration of GPU computing with database indexing and fast spatial search capabilities. We propose to build a new kind of instrument, a ‘Data-Scope’, that is capable of observing very large amounts of scientific data, with unique features in its design.
In sciences today tackling data-intensive problems at the 5-10TB scales is easy: one can perform these analyses at a typical generic departmental computing facility. 50-100TB problems are quite difficult, but there are about 10-15 universities in the world that can analyze such data sets. When one needs to deal with a petabyte of data, there are less than a handful of places anywhere in the world that can address this challenge. At the same time there are many projects which are crossing over the 100TB boundary today. Astrophysics, High Energy Physics, Environmental Science, Computational Fluid Dynamics, Genomics and Bioinformatics are all encountering data challenges in the several hundred terabyte range and beyond – even within a single university. The large data sets are here, but the off-the-shelf solutions for their analyses are not!
The Data-Scope instrument has unique capabilities: it combines about 6.5 Petabytes of storage with a sequential IO bandwidth exceeding 500GBytes/sec and 120 Teraflops of GPU computing. In order to keep the cost of the instrument down, and its performance and storage capacity very high, all at low power consumption, there must be tradeoffs. The Data-Scope was tuned to provide extreme data analysis performance over petabytes at the expense of some generic features. It is a highly specialized tool to study data, a microscope for data: a “Data-Scope”, which is why we consider this to be more similar to a research instrument rather than a traditional computational facility. Since its commissioning it has enabled certain analysis tasks that would have been extremely difficult otherwise. Two of JHU’s Nobel Laureates and their students are among the early users of the Data-Scope.
This new, data-intensive nature of science is becoming increasingly important by the day. There is a similar vacuum in our abilities to handle large data sets now as there was in the 90’s when the concept of the BeoWulf cluster emerged. Many universities and scientific disciplines are looking for a new template that would enable them to address PB scale data analysis problems. In providing an inexpensive hardware and software architecture, we feel that we can substantially accelerate the development of data-intensive science in the whole country. In order to accelerate the acceptance of the proposed approach we will collaborate with researchers across many different disciplines and across many different institutions nationwide (Los Alamos, Oak Ridge, UCSC, NMSU, UW, UC, UIC, UIUC). The Data-Scope is hosting public services on some of the largest data sets in astronomy, and fluid mechanics, both observational and simulated. Our public turbulence database services (close to 500 Terabytes) have delivered over 10 trillion data points to the world. Students and postdoctoral fellows using the Data-Scope are gaining a substantial career advantage – these will be the job skills of the 21st<...
Please report errors in award information by writing to: awardsearch@nsf.gov.