Skip to feedback

Award Abstract # 1418195
Harnessing Scalable Libraries for Statistical Computing on Modern Architectures and Bringing Statistics to Large Scale Computing

NSF Org: DMS
Division Of Mathematical Sciences
Recipient: UNIVERSITY OF TENNESSEE
Initial Amendment Date: August 14, 2014
Latest Amendment Date: August 11, 2016
Award Number: 1418195
Award Instrument: Continuing Grant
Program Manager: Christopher Stark
DMS
 Division Of Mathematical Sciences
MPS
 Directorate for Mathematical and Physical Sciences
Start Date: August 15, 2014
End Date: July 31, 2019 (Estimated)
Total Intended Award Amount: $600,000.00
Total Awarded Amount to Date: $600,000.00
Funds Obligated to Date: FY 2014 = $440,000.00
FY 2015 = $85,000.00

FY 2016 = $75,000.00
History of Investigator:
  • George Ostrouchov (Principal Investigator)
    ostrouchovg@utk.edu
Recipient Sponsored Research Office: University of Tennessee Knoxville
201 ANDY HOLT TOWER
KNOXVILLE
TN  US  37996-0001
(865)974-3466
Sponsor Congressional District: 02
Primary Place of Performance: University of Tennessee Knoxville
1 Circle Park
Knoxville
TN  US  37996-0003
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): FN2YCS2YAUW3
Parent UEI: LXG4F9K8YZK5
NSF Program(s): OFFICE OF MULTIDISCIPLINARY AC,
CI REUSE,
CDS&E-MSS
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
01001516DB NSF RESEARCH & RELATED ACTIVIT

01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 8084, 9150, 9263
Program Element Code(s): 125300, 689200, 806900
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.049

ABSTRACT

This project aims to increase participation in high performance computing (HPC) on medium- to large-scale platforms by the statistics community. Theoretical statisticians potentially have strong contributions to science where big data and HPC are involved, yet in implementation on large platforms they face low-level programming languages, libraries, and runtime environments that pose a high enough barrier to prevent most from entering. This project is centered on enabling exactly this community to experiment at a large scale by bridging most of the barriers while using state-of-the-art approaches from the HPC community. Broader impacts of this research include opening a new avenue for HPC scalable software reuse by the statistics and the data science communities, thus providing additional and more data-oriented feedback to HPC software research. Further, an HPC-engaged statistics community can bring statistical science to modern issues in supercomputing that are increasingly in need of statistical thinking for quantifying uncertainty.

The open source R programming language and environment for statistical computing is an ideal vehicle for the project as it currently dominates new work in statistics and it is widely used and rising in popularity in many other data-enabled science communities. This project will connect the R language to highly scalable HPC libraries at interfaces that make long-term sense and in a way that in most cases requires no change from current programming practice. In addition, ease-of-use components will be developed inside R for intuitive use of these libraries for big data input and data manipulation on large computing platforms and to bridge HPC runtime environments. Outreach consisting of documentation, examples, a schedule of tutorials at a number of key conferences, and workshops will be used to bring the results of this project to the statistics and other data-enabled science communities.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Drew Schmidt, Wei-Chen Chen, and George Ostrouchov "Introducing a New Client/Server Framework for Big Data Analytics with the R Language" XSEDE16 , 2016
Drew Schmidt, Wei-Chen Chen, Michael Matheson, and George Ostrouchov "Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes" Big Data Research , 2017 2214-5796
Wei-Chen Chen, Drew Schmidt, and George Ostrouchov "Interactive Terabytes with pbdR" The R User Conference 2016 , 2016

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project made notable contributions to open source software for statistical computing research on medium to large supercomputing systems. Several software packages were developed and jointly released as pbdR-1.0. The latest package versions and documentation are available on the web site https://pbdr.org, which links to open source repositories on GitHub and CRAN as well as to build scripts for Docker and Singularity.

An independent evaluation and comparison with other linear algebra (LA) based analytics software on medium systems [1] notes that pbdR "... outperformed all the other systems in almost all cases on dense data."  and that "Overall, pbdR is best suited to users who want to rapidly prototype new LA based analysis algorithms at scale." Further testament to project success is early technology transfer of pbdR software to the Cray Urika-CS AI and Analytics Suite, the analytics software stack on Cray's supercomputing platforms.

The software contributions include new R packages as well as updates to previously released packages, all available from https://pbdr.org. The new packages, developed specifially under this project, fall broadly into three categories:

A Client-Server Interface: Data analysis is a discovery process that is best prototyped in an interactive computing environment. HPC and all pbdR infrastructure so far have been for batch computing. To reconcile the two approaches, we built packages for an interactive client-server environment that is powered by pbdR scalable statistical computing infrastructure. It has three major components that (1) allow one R session to control another R session (remoter), (2) high-level asynchronous messaging for distributed applications (pbdZMQ), and (3) the client-server framework that starts a collection of cooperating R sessions and uses the first two components to control and communicate with the server sessions. The high-level messaging component (pbdZMQ) was also adopted by the immensely popular Jupyter notebook for its connection to R.

Parallel Data Input: The package pbdIO provides chunking options for reading large arrays or large collections of files from parallel file systems into the collective distributed memory of a cluster computer. The ability to read different files or portions of files by several processors simultaneously speeds up what is often the slowest step in the analysis of big data. Two more packages intended for specific binary data formats common in simulation science were developed: pbdNCDF4 is for reading NetCDF4 format binary files, which are common in climate and environmental sciences; pbdADIOS reads ADIOS bp format binary files, which are popular in several simulation science applications on supercomputing platforms.

pbdR Installer and Containers: While complex software installation is usually handled by systems administrators, statistical computing software is often not familiar to systems administrators. We developed a pbdR installer that automates this process on most platforms. We also developed Docker containers that enable installation of the full pbdR environment on Docker enabled platforms. These are usually cloud computing or personal computing platforms where the goals are primarily development and training in building scalable R analytics. We also provided Singularity scripts, where the goals are scaling and production on supercomputing platforms.

One of the goals of the project was software reuse. The new packages pbdZMQ, pbdNCDF4, and pbdADIOS all incoporate scalable software developed by other communities, which we make available with R convenience and a simplified syntax that is made possible by R intelligence (ability to infer parameters from context and metadata). These packages are new additions and continue the project philosophy of not reinventing the wheel and introducing HPC standards when this makes sense.

Outreach to the statistics and other data science communities over the five year span of the project took the form of 11 half-day to full-day tutorials and 12 regular presentations at national and international venues including the Joint Statistical Meetings, International Statistical Institute World Statistics Congress, and useR! - International R User Conference. Most presentations were invited and included keynotes at the National Institute of Standards and Technology and at the First Workshop for High Performance Technical Computing in Dynamic Languages. A presentation at the Intel Developer Conference, held in conjunction with Supercomputing 2016, won the People's Choice Award in the Technical Computing Track.

The project also had a mentoring component, which provided support for several graduate students through a graduate research assitantship and several summer interships. The students had access to large computing systems where pbdR project specific as well as other parallel statistical computing components could be used. The students received instruction on the use of these systems and mentoring for their individual research directions.


[1] Anthony Thomas, Arun Kumar: A Comparative Evaluation of Systems for Scalable Linear Algebra-based Analytics. Proceedings of the VLDB Endowment, Volume 11, No. 13, September 2018, p. 2168-2182


Last Modified: 11/25/2019
Modified by: George Ostrouchov

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page