Award Abstract # 1613035
Variable Selection via Inverse Modeling for Detecting Nonlinear Relationships

NSF Org: DMS
Division Of Mathematical Sciences
Recipient: PRESIDENT AND FELLOWS OF HARVARD COLLEGE
Initial Amendment Date: August 1, 2016
Latest Amendment Date: August 16, 2018
Award Number: 1613035
Award Instrument: Continuing Grant
Program Manager: Gabor Szekely
DMS
 Division Of Mathematical Sciences
MPS
 Directorate for Mathematical and Physical Sciences
Start Date: August 1, 2016
End Date: July 31, 2020 (Estimated)
Total Intended Award Amount: $200,000.00
Total Awarded Amount to Date: $200,000.00
Funds Obligated to Date: FY 2016 = $64,500.00
FY 2017 = $66,485.00

FY 2018 = $69,015.00
History of Investigator:
  • Jun Liu (Principal Investigator)
    jliu@stat.harvard.edu
Recipient Sponsored Research Office: Harvard University
1033 MASSACHUSETTS AVE STE 3
CAMBRIDGE
MA  US  02138-5366
(617)495-5501
Sponsor Congressional District: 05
Primary Place of Performance: President and Fellows of Harvard College
1 Oxford St, 715 Science Center
Cambridge
MA  US  02138-2901
Primary Place of Performance
Congressional District:
05
Unique Entity Identifier (UEI): LN53LCFJFL45
Parent UEI:
NSF Program(s): STATISTICS
Primary Program Source: 01001617DB NSF RESEARCH & RELATED ACTIVIT
01001718DB NSF RESEARCH & RELATED ACTIVIT

01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):
Program Element Code(s): 126900
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.049

ABSTRACT

With the ever-growing amount of data in many application areas, effective methods for detecting factors influencing the value of a response variable are in high demand. It is of growing importance to develop methods for detecting variables that exert significant nonlinear response. Inspired by the sliced inverse regression method developed in the early 1990s, the PI proposes a general framework for developing effective variable selection strategies in nonlinear systems of high dimension. The PI will further study theoretical properties of these variable selection algorithms. The proposed theoretical investigation will provide theoretical understanding of limitations of existing dimension-reduction techniques when the dimensionality grows with the sample size.

With the ever-growing amount of data in many application areas, effective methods for detecting factors that may influence the value of a target quantity of interest (response variable) are in high demand. The problem is termed as "variable (or feature) selection" in regression modeling and statistical learning, and is a long-standing problem in statistics and machine learning. The PI focuses here on the detection of factors that may exert nonlinear and/or interactive effects on the response variable. Recent studies from the PI's group reveal that the sliced inverse regression (SIR) and inverse modeling strategies provide a powerful framework for developing effective variable selection strategies in nonlinear systems of high dimension. The PI aims at developing more robust and effective tools for detecting such complex relationships and studying theoretical properties of SIR-based algorithms. The proposed method will also be applicable to do robust variable selection for classification problems. The proposed theoretical investigations will provide (a) theoretical understanding of limitations of existing dimension-reduction techniques when the dimensionality grows with the sample size; (b) guidance on the construction of necessary sparsity conditions that can guarantee consistency of variable selections in ultra-high dimensional nonlinear problems; (c) the optimal convergence rate of that the best possible learning algorithm can achieve in such settings; and (d) theoretical justifications whether the proposed algorithms can achieve or are not far from the optimality.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 12)
Dai C and Liu JS "The Wang-Landau Algorithm as Stochastic Optimization and its Acceleration" Physical Review E , v.101 , 2020 , p.033301 10.1103/PhysRevE.101.033301
Lin, Qian and Li, Xinran and Huang, Dongming and Liu, Jun S. "On the optimality of sliced inverse regression in high dimensions" The Annals of Statistics , v.49 , 2021 https://doi.org/10.1214/19-AOS1813 Citation Details
Qian Lin, Zhigen Zhao, and Jun S Liu "ON CONSISTENCY AND SPARSITY FOR SLICED INVERSE REGRESSION IN HIGH DIMENSIONS" Annals of Statistics , v.46 , 2018 , p.580 doi:10.1214/17-AOS1561
Qian Lin, Zhigen Zhao, and Jun S Liu "Sparse Sliced Inverse Regression Via Lasso" Journal of the American Statistical Association , v.114 , 2019 , p.1726 https://doi.org/10.1080/01621459.2018.1520115
Shihao Yang, Yang Chen, Espen Bernton, Jun S. Liu "On parallelizable Markov chain Monte Carlo algorithms with waste-recycling" Statistics and Computing , v.28 , 2018 , p.1073 https://doi.org/10.1007/s11222-017-9780-4
Viktoriya Krakovna, Chenguang Dai, Jun S Liu "Interpretable selection and visualization of features and interactions using Bayesian forests" Statistics and Its Interface , v.11 , 2018 , p.503-513 DOI: http://dx.doi.org/10.4310/SII.2018.v11.n3.a12
Xufei Wang and Jun S Liu "Generalized R-squared for detecting dependence" Biometrika , v.104 , 2017 , p.129 https://doi-org.ezp-prod1.hul.harvard.edu/10.1093/biomet/asw071
Yang Li, Alexis A Jourdain, Sarah E Calvo, Jun S Liu, Vamsi K Mootha "CLIC, a tool for expanding biological pathways based on co-expression across thousands of datasets" PLOS Computational Biology , v.13 , 2017 , p.e1005653 https://doi.org/10.1371/journal.pcbi.1005653
Yang Li and Jun S Liu "Robust Variable and Interaction Selection for Logistic Regression and General Index Models" Journal of the American Statistical Association , v.114 , 2019 , p.271 DOI: 10.1080/01621459.2017.1401541
Yang Li, Shaoyang Ning, Sarah E Calvo, Vamsi K Mootha, Jun S Liu "Bayesian Hidden Markov Tree Models for Clustering Genes with Shared Evolutionary History" Annals of Applied Statistics , v.13 , 2019 , p.606 doi:10.1214/18-AOAS1208
Zhao R, Hong P, and Liu JS "IMMIGRATE: A Margin-based Feature Selection Method with Interaction Terms" Entropy , v.22 , 2020 , p.291 https://doi.org/10.3390/e22030291
(Showing: 1 - 10 of 12)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

As massive data are being generated routinely in this ?big data? era, we expect to see growingneed of powerful and reliable statistical learning tools to discover patterns in these data. The proposal focuses on the development of systematic and practical tools for discovering nonlinear and interactive patterns among a set of potentialvariables, which are important steps and tasks in many scientific research areas. While most current efforts have been focused on linear systems in high dimensional problems,the proposed approaches provide a novel way of detecting nonlinear relationships and discoveringhow certain candidate predictors interact with each other to influence the response variable. The research promises to bring the power of these new high-dimensional data analysismethods and theory to bear on many important application areas such as genetics, bioinformatics, Internet commerce, and financial data analysis.

In our theoretical and methodological studies, this grant supports us to finish developing a series theoretical results and methods for analyzing a class of non-linear high-dimensional models, i.e., the index models, which assumes that the response is related to the predictors through a low dimensional projection in a nonlinear fashion. It is a very general class of models, yet has very good interpretability. We find an interesting connection between fitting a semi-parametric index model with the regular Lasso algorithm, leading to a very efficient and effective algorithm of conducting variable selections and fitting index models. We also derive the first minimax optimality result for the sliced inverse regression in high dimensions.We found that many classification models such as multi-category logistic regression models can be unified under the index model framework, thus leading to a robust stepwise variable and interaction selection method. We further investigate how margins, entropy, and feature interactions connect to each other and develop a novel algorithm IMMIGRATE to do feature selections with interactions.

We have also developed a high-dimensional tree-based Bayesian model, called the Bayesian forests, an improved parallelizable multiple-try MCMC algorithm, and a novel way of using the Wang-Landau algorithm to accelerate MCMC computation and Bayesian model and variable selections.

In the application side, the grant supports us to develop a flexible tool (CLIC) for integrating the vast amount of gene-expression microarray data from multiple sources to predict gene-gene interactions, gene-function, and gene-module relationships. We have also developed a Monte Carlo-based tool for exploring conformational space of protein folding. We have also developed a Bayesian method for detecting convergent regulatory evolution regions in the genome and providing new and convincing explanations why and how certain remotely related animals developed very similar traits (such why emu, rhea, and ostrich all lost their flying ability during evolution. See Fig 1 for evolutionary relationships of these three and other relevant birds, and Fig 2 for some Bayesian analysis results).


Last Modified: 12/03/2020
Modified by: Jun S Liu

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page