NSF Award Search: Award # 1317131 - Collaborative Research: Efficient Parallel Iterative Monte Carlo Methods for Statistical Analysis of Big Data

Award Abstract # 1317131

Collaborative Research: Efficient Parallel Iterative Monte Carlo Methods for Statistical Analysis of Big Data

NSF Org:	DMS Division Of Mathematical Sciences
Recipient:	TEXAS A & M UNIVERSITY
Initial Amendment Date:	June 26, 2013
Latest Amendment Date:	June 26, 2013
Award Number:	1317131
Award Instrument:	Standard Grant
Program Manager:	Andrew Pollington adpollin@nsf.gov (703)292-4878 DMS Division Of Mathematical Sciences MPS Directorate for Mathematical and Physical Sciences
Start Date:	August 1, 2013
End Date:	July 31, 2015 (Estimated)
Total Intended Award Amount:	$220,000.00
Total Awarded Amount to Date:	$220,000.00
Funds Obligated to Date:	FY 2013 = $19,501.00
History of Investigator:	Faming Liang (Principal Investigator) fmliang@purdue.edu Xingfu Wu (Co-Principal Investigator)
Recipient Sponsored Research Office:	Texas A&M University 400 HARVEY MITCHELL PKY S STE 300 COLLEGE STATION TX US 77845-4375 (979)862-6777
Sponsor Congressional District:	10
Primary Place of Performance:	Texas A&M University College Station TX US 77843-3143
Primary Place of Performance Congressional District:	10
Unique Entity Identifier (UEI):	JF6XLNB4CDJ5
Parent UEI:
NSF Program(s):	CDS&E-MSS, CDS&E
Primary Program Source:	01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	9263
Program Element Code(s):	806900, 808400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.049

ABSTRACT

The integration of computer technology into science and daily life has enabled the collection of massive volumes of data. To analyze these data, one may have to resort to parallel and distributed architectures. While the parallel and distributed architectures present new capabilities for storage and manipulation of big data, it is unclear, from the inferential point of view, how the current statistical methodology can be transported to the paradigm of big data. Also, growing data size typically comes together with a growing complexity of data structures and of the models needed to account for the structures. Although iterative Monte Carlo algorithms, such as the Markov chain Monte Carlo (MCMC), stochastic approximation, and expectation-maximization (EM) algorithms, have proven to be very powerful and typically unique computational tools for analyzing data of complex structures, they are infeasible for big data as for which a large number of iterations and a complete scan of the full dataset for each iteration are typically required. Big data have put a great challenge on the current statistical methodology. The investigators propose a general principle for developing Monte Carlo algorithms that are feasible for big data and workable on parallel and distributed architectures; that is, using Monte Carlo averages calculated in parallel from subsamples to approximate the quantities that originally need to calculate from the full dataset. This principle avoids the requirement for repeated scans of full data in algorithm iterations, while enabling the algorithm to produce statistically sensible solutions to the problem under consideration. Under this principle, a general algorithm, the so-called subsampling approximation-based parallel stochastic approximation algorithm, is proposed for parameter estimation for big data problems. Unlike the existing algorithms, such as the bag of little bootstraps, aggregated estimation equation, and split-and-conquer algorithms, the proposed algorithm works for the problems for which the observations are generally dependent. Under the same principle, a subsampling approximation-based parallel Metropolis-Hastings algorithm is proposed for Bayesian analysis of big data, and a subsampling approximation-based parallel Monte Carlo EM algorithm is proposed for parameter estimation for the big data problems with missing observations. In addition to the subsampling approximation-based parallel iterative Monte Carlo algorithms, an embarrassingly parallel MCMC algorithm is proposed for Bayesian analysis of big data based on the popular idea of divide-and-conquer. Various schemes of dataset partition and results aggregation are proposed. The validity of the proposed parallel iterative Monte Carlo algorithms, including both the subsampling approximation-based and embarrassingly parallel ones, will be rigorously studied. The proposed algorithms will be applied to spatio-temporal modeling of satellite climate data, genome-wide association study, and stream data analysis.

The intellectual merit of this project is to propose a general principle for statistical analysis of big data: Using Monte Carlo averages of subsamples to approximate the quantities that originally need to calculate from the full dataset. This principle provides a general strategy for transporting the current statistical methodology to the paradigm of big data. Under this principle, a few subsampling approximation-based parallel iterative Monte Carlo algorithms are proposed. The proposed algorithms address the core problem of big data analysis:how to make a statistically sensible analysis for big data while avoiding repeated scans of the full dataset? This project will have broader impacts because big data are ubiquitous throughout almost all fields of science and technology. A successful research program in theory and methods of parallel iterative Monte Carlo computations can have immense benefit widely throughout science and technology. The research results will be disseminated to the communities of interest, such as atmospheric science, biomedical science, engineering, and social science, via direct collaboration with researchers in these disciplines, conference presentations, books, and papers to be published in academic journals. The project will have also significant impacts on education through direct involvement of graduate students in the project and incorporation of results into undergraduate and graduate courses. In addition, the package Distributed Iterative Statistical Computing (DISC) that will be developed under this project is designed to provide a platform for Ph.D. students and researchers like the investigators with network-connected computers to experiment new ideas of developing efficient iterative Monte Carlo algorithms in parallel or, more exactly, grid computing environments.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cheng, Yichen, Gao, Xin, and Liang, Faming "Bayesian Peak Picking for NMR Spectra" Genomics, Proteomics and Bioinformatics , v.12 , 2014 , p.37 doi: 10.1016/j.gpb.2013.07.003

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error