
NSF Org: |
DMS Division Of Mathematical Sciences |
Recipient: |
|
Initial Amendment Date: | August 1, 2016 |
Latest Amendment Date: | August 16, 2018 |
Award Number: | 1613035 |
Award Instrument: | Continuing Grant |
Program Manager: |
Gabor Szekely
DMS Division Of Mathematical Sciences MPS Directorate for Mathematical and Physical Sciences |
Start Date: | August 1, 2016 |
End Date: | July 31, 2020 (Estimated) |
Total Intended Award Amount: | $200,000.00 |
Total Awarded Amount to Date: | $200,000.00 |
Funds Obligated to Date: |
FY 2017 = $66,485.00 FY 2018 = $69,015.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
1033 MASSACHUSETTS AVE STE 3 CAMBRIDGE MA US 02138-5366 (617)495-5501 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
1 Oxford St, 715 Science Center Cambridge MA US 02138-2901 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | STATISTICS |
Primary Program Source: |
01001718DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): | |
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.049 |
ABSTRACT
With the ever-growing amount of data in many application areas, effective methods for detecting factors influencing the value of a response variable are in high demand. It is of growing importance to develop methods for detecting variables that exert significant nonlinear response. Inspired by the sliced inverse regression method developed in the early 1990s, the PI proposes a general framework for developing effective variable selection strategies in nonlinear systems of high dimension. The PI will further study theoretical properties of these variable selection algorithms. The proposed theoretical investigation will provide theoretical understanding of limitations of existing dimension-reduction techniques when the dimensionality grows with the sample size.
With the ever-growing amount of data in many application areas, effective methods for detecting factors that may influence the value of a target quantity of interest (response variable) are in high demand. The problem is termed as "variable (or feature) selection" in regression modeling and statistical learning, and is a long-standing problem in statistics and machine learning. The PI focuses here on the detection of factors that may exert nonlinear and/or interactive effects on the response variable. Recent studies from the PI's group reveal that the sliced inverse regression (SIR) and inverse modeling strategies provide a powerful framework for developing effective variable selection strategies in nonlinear systems of high dimension. The PI aims at developing more robust and effective tools for detecting such complex relationships and studying theoretical properties of SIR-based algorithms. The proposed method will also be applicable to do robust variable selection for classification problems. The proposed theoretical investigations will provide (a) theoretical understanding of limitations of existing dimension-reduction techniques when the dimensionality grows with the sample size; (b) guidance on the construction of necessary sparsity conditions that can guarantee consistency of variable selections in ultra-high dimensional nonlinear problems; (c) the optimal convergence rate of that the best possible learning algorithm can achieve in such settings; and (d) theoretical justifications whether the proposed algorithms can achieve or are not far from the optimality.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
As massive data are being generated routinely in this ?big data? era, we expect to see growingneed of powerful and reliable statistical learning tools to discover patterns in these data. The proposal focuses on the development of systematic and practical tools for discovering nonlinear and interactive patterns among a set of potentialvariables, which are important steps and tasks in many scientific research areas. While most current efforts have been focused on linear systems in high dimensional problems,the proposed approaches provide a novel way of detecting nonlinear relationships and discoveringhow certain candidate predictors interact with each other to influence the response variable. The research promises to bring the power of these new high-dimensional data analysismethods and theory to bear on many important application areas such as genetics, bioinformatics, Internet commerce, and financial data analysis.
In our theoretical and methodological studies, this grant supports us to finish developing a series theoretical results and methods for analyzing a class of non-linear high-dimensional models, i.e., the index models, which assumes that the response is related to the predictors through a low dimensional projection in a nonlinear fashion. It is a very general class of models, yet has very good interpretability. We find an interesting connection between fitting a semi-parametric index model with the regular Lasso algorithm, leading to a very efficient and effective algorithm of conducting variable selections and fitting index models. We also derive the first minimax optimality result for the sliced inverse regression in high dimensions.We found that many classification models such as multi-category logistic regression models can be unified under the index model framework, thus leading to a robust stepwise variable and interaction selection method. We further investigate how margins, entropy, and feature interactions connect to each other and develop a novel algorithm IMMIGRATE to do feature selections with interactions.
We have also developed a high-dimensional tree-based Bayesian model, called the Bayesian forests, an improved parallelizable multiple-try MCMC algorithm, and a novel way of using the Wang-Landau algorithm to accelerate MCMC computation and Bayesian model and variable selections.
In the application side, the grant supports us to develop a flexible tool (CLIC) for integrating the vast amount of gene-expression microarray data from multiple sources to predict gene-gene interactions, gene-function, and gene-module relationships. We have also developed a Monte Carlo-based tool for exploring conformational space of protein folding. We have also developed a Bayesian method for detecting convergent regulatory evolution regions in the genome and providing new and convincing explanations why and how certain remotely related animals developed very similar traits (such why emu, rhea, and ostrich all lost their flying ability during evolution. See Fig 1 for evolutionary relationships of these three and other relevant birds, and Fig 2 for some Bayesian analysis results).
Last Modified: 12/03/2020
Modified by: Jun S Liu
Please report errors in award information by writing to: awardsearch@nsf.gov.