NSF Award Search: Award # 1261721

Award Abstract # 1261721

CIF21 DIBBS: The Data Exacell

NSF Org:	OAC Office of Advanced Cyberinfrastructure (OAC)
Recipient:	CARNEGIE MELLON UNIVERSITY
Initial Amendment Date:	September 26, 2013
Latest Amendment Date:	May 9, 2018
Award Number:	1261721
Award Instrument:	Cooperative Agreement
Program Manager:	Amy Walton awalton@nsf.gov (703)292-4538 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering
Start Date:	October 1, 2013
End Date:	September 30, 2018 (Estimated)
Total Intended Award Amount:	$7,600,000.00
Total Awarded Amount to Date:	$8,914,035.00
Funds Obligated to Date:	FY 2013 = $4,902,601.00 FY 2014 = $1,246,850.00 FY 2015 = $2,697,399.00 FY 2016 = $67,185.00
History of Investigator:	Nicholas Nystrom (Principal Investigator) nystrom@psc.edu Ralph Roskies (Co-Principal Investigator) Jason Sommerfield (Co-Principal Investigator) Michael Levine (Former Principal Investigator) Joseph Scott (Former Principal Investigator) Nicholas Nystrom (Former Co-Principal Investigator) Joseph Scott (Former Co-Principal Investigator) James Taylor (Former Co-Principal Investigator)
Recipient Sponsored Research Office:	Carnegie-Mellon University 5000 FORBES AVE PITTSBURGH PA US 15213-3890 (412)268-8746
Sponsor Congressional District:	12
Primary Place of Performance:	Carnegie-Mellon University PA US 15213-3815
Primary Place of Performance Congressional District:	12
Unique Entity Identifier (UEI):	U3NKNFLNQ613
Parent UEI:	U3NKNFLNQ613
NSF Program(s):	Information Technology Researc, Data Cyberinfrastructure
Primary Program Source:	01001314DB NSF RESEARCH & RELATED ACTIVIT 01001415DB NSF RESEARCH & RELATED ACTIVIT 01001415RB NSF RESEARCH & RELATED ACTIVIT 01001516DB NSF RESEARCH & RELATED ACTIVIT 01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7433, 8048, 8237
Program Element Code(s):	164000, 772600
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The Pittsburgh Supercomputing Center (PSC) will carry out an accelerated, development pilot project to create, deploy and test software building blocks and hardware implementing functionalities specifically designed to support data-analytic capabilities for data intensive scientific research. Building on the successful Data Supercell (DSC) technology which replaced a conventional tape-based archive with a disk-based system to economically provide the much lower latency and higher bandwidth data success necessary for data-intensive activities, PSC will implement and bring to production quality additional functionalities important to such work. These include improved local performance, additional abilities for remote data access and storage, enhanced data integrity, data tagging and improved manageability. PSC will work with partners in diverse fields of science, initially chosen from biology, astronomy and computer science, who will provide scientific and technology drivers and system validation. The project will leverage current NSF/CI investments in data analytics systems at PSC. Those investments include DSC, Blacklight (an SGI UV1000 with 2×16TB of hardware-enabled cache-coherent shared memory), and Sherlock (a YarcData ?Urika? graph-analytic appliance which also supports a globally accessible shared memory), both very capable for data analytic applications. Their tight coupling to the pilot storage system will allow synergistic development of analytical capabilities with development of increasingly sophisticated mechanisms for data handling. Working with the new, multi-petabyte data store, they will constitute a system specifically optimized for data intensive work as contrasted with conventional HPC systems. Blacklight will be upgraded with more powerful technology, specifically architected to satisfy the more demanding needs of data analytics in years 3,4. When successful, PSC will engage the NSF to consider larger-scale deployment aiming at exascale capacity.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cooper, G. F., Bahar, I., Becich, M. J., Benos, P. V., Berg, J., Espino, J., Glymour, C., Crowley Jacobson, R., Kienholz, M., Lee., A. V., Lu, X., and Scheines, R. "The Center for Causal Discovery of Biomedical Knowledge from Big Data" J. American Medical Informatics Association , 2015 DOI= http://dx.doi.org/10.1093/jamia/ocv059

Lu, S., Lu., K. N., Cheng, S.-Y., Hu, B., Ma, X., Nystrom, N., and Lu, S. "Identifying Driver Genomic Alterations in Cancers by Searching Minimum-Weight, Mutually Exclusive Sets" PLOS Computational Biology , 2015 DOI= 10.1371/journal.pcbi.1004257

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The Data Exacell (DXC) was a research pilot project at the Pittsburgh Supercomputing Center (PSC) to create, deploy, and test software and hardware data infrastructure building blocks to enable the coupling of data analytics with innovative storage for scientific research. The DXC project developed and disseminated building blocks and engaged deeply with the research community, particularly with researchers from fields that traditionally have not used high performance computing (HPC), to create valuable new applications. The building blocks developed by the DXC were reused to create production infrastructure of great national value, and many of the pilot applications became important community applications that are now in widespread use.

The key outcomes of the Data Exacell project were the data infrastructure building blocks it generated, which were subsequently and successfully reused in other prominent infrastructure and research projects and made available through open source licenses and public repositories. Specific building blocks with high potential for reuse were:

SLASH2 wide-area, network-friendly, distributed filesystem,
Weldable Overlay Knack File System (WOKFS) toolkit,
psync multiple-stream, rsync-compatible file transport tool,
DXC optimizations to the AdaptFS filesystem for in-filesystem analytics,
Virtual machine (VM) support for distributed applications,
LDAP VM authentication for group-specific hosts and VMs, and
heterogeneous scheduling for regular CPU nodes, large-memory CPU nodes, GPUs, and Hadoop.

Of those, building blocks 1–4 are distributed under an open source license on GitHub, and SLASH2 was deployed at PSC, the Texas Advanced Computing Center (TACC), the National Radio Astronomy Observatory (NRAO), the Minnesota Supercomputing Institute (MSI), and the University of Wyoming. Building blocks 5–7 were reused to design and implement PSC’s Bridges system.

Pilot applications in data-intensive research areas were used to motivate, test, demonstrate, and improve the DXC building blocks. The pilot projects spanned diverse research areas including genomics, data integration and fusion, machine learning for multimedia data, radio astronomy, causal analysis for biomedical big data, analysis of streaming social media data with specific application to epidemiology, computational notebooks for teaching digital scholarship, and prototyping the Brain Image Library. The pilot applications were selected according to their ability to advance research through: high data volume, variety, and/or velocity; novel approaches to data management or organization; novel approaches to data integration or fusion; integration of data analytic components into workflows; and complementarity to other DXC pilot applications.

The Data Exacell project made a profound impact on the principal discipline of converged architecture for high-performance data analytics (HPDA) leveraging novel storage. This was shown most clearly by transitioning ideas proven in the Data Exacell to the large-scale production architecture of Bridges, which went on to serve 15,179 users at 726 institutions working on over 1,883 projects (statistics as of January 2019). Spanning a heterogeneous collection of compute nodes with a parallel filesystem proved to be a game-changing advance that enabled applications and gateways and has been replicated in other prominent HPC+HPDA systems worldwide.

The Data Exacell also positively impacted other disciplines by providing a unique hardware and software architecture on which applications could be built, especially those requiring data-intensive computing and different kinds of processing. Three prominent and highly successful examples that arose from Data Exacell pilot applications are the following:

Genomics, Causal Discovery, and Machine Learning: For genomics, including application and democratization of causal discovery algorithms and application of machine learning, the Pittsburgh Genome Resource Repository (PGRR) and TCGA Expedition were built using the Data Exacell.
Bioinformatics and Data-Intensive Workflows: Galaxy, a workflow framework serving an extremely popular gateway for bioinformatics, was extended using the Data Exacell to allow launching genome sequence assembly jobs on the DXC’s large-memory nodes.
Neuroscience: The Brain Image Library (BIL) was piloted on the Data Exacell. The Brain Image Library is accepting unique, high-resolution, confocal fluorescence microscopy data for mouse, rat, and marmoset brains. The BIL project team is curating the data and ensuring that accurate, useful metadata are maintained, so that datasets and data products can be provided on demand to researchers needing them.

The Data Exacell project also included diverse outreach activities to broaden engagement and workforce development. PSC delivered 15 XSEDE HPC Monthly Workshops on Big Data that featured topics involving high-performance data analytics directly related to the Data Exacell. Topics included graph analytics with RDF and SPARQL, Hadoop, Spark, and artificial intelligence. All the workshops consisted of interactive, HD video instruction, originated at PSC and telecast to remote locations, with hands-on exercises to ensure that participants emerged ready to apply skills learned to their own research problems. The workshops reached 5,015 participants at 64 institutions.

The Data Exacell also supported undergraduate and high school internships through which students developed skills in big data, system monitoring, filesystems, wide-area networking, and HPC.

The Data Exacell also supported undergraduate and high school internships through which students developed skills in big data, system monitoring, filesystems, wide-area networking, and high performance computing.

Last Modified: 03/28/2019
Modified by: Nicholas Nystrom

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error