Award Abstract # 1712943
Scalable Methods for Classification of Heterogeneous High-Dimensional Data

NSF Org: DMS
Division Of Mathematical Sciences
Recipient: TEXAS A & M UNIVERSITY
Initial Amendment Date: May 11, 2017
Latest Amendment Date: May 11, 2017
Award Number: 1712943
Award Instrument: Standard Grant
Program Manager: Gabor Szekely
DMS
 Division Of Mathematical Sciences
MPS
 Directorate for Mathematical and Physical Sciences
Start Date: July 1, 2017
End Date: June 30, 2020 (Estimated)
Total Intended Award Amount: $162,539.00
Total Awarded Amount to Date: $162,539.00
Funds Obligated to Date: FY 2017 = $162,539.00
History of Investigator:
  • Irina Gaynanova (Principal Investigator)
    irinagn@umich.edu
Recipient Sponsored Research Office: Texas A&M University
400 HARVEY MITCHELL PKY S STE 300
COLLEGE STATION
TX  US  77845-4375
(979)862-6777
Sponsor Congressional District: 10
Primary Place of Performance: Texas A&M University Main Campus
College Station
TX  US  77845-4375
Primary Place of Performance
Congressional District:
10
Unique Entity Identifier (UEI): JF6XLNB4CDJ5
Parent UEI:
NSF Program(s): STATISTICS
Primary Program Source: 01001718DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 8083
Program Element Code(s): 126900
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.049

ABSTRACT

Recent technological advances have enabled routine collection of large-scale high-dimensional data in the biomedical fields. For example, in cancer research it is common to use multiple high-throughput technology platforms to measure genotype, gene expression levels, and methylation levels. One of the main challenges in the analysis of such data is the identification of key biological measurements that can be used to classify the subject into a known cancer subtype. While significant progress has been made in the development of computationally efficient classification methods to address this challenge, existing methods do not adequately take into account the heterogeneity across the cancer subtypes and the mixed types of measurements (binary/count/continuous) across technology platforms. As such, existing methods may fail to identify relevant biological patterns. The goal of this project is to develop new classification methods that explicitly take into account the type and heterogeneity of measurements. While the primary focus is on methodology, high priority will be given to computational considerations and software development to encourage dissemination and ensure ease of use for domain scientists.

Regularized linear discriminant methods are commonly used for simultaneous classification and variable selection due to their interpretability and computational efficiency. These methods, however, rely on unrealistic assumptions of equality of group-covariance matrices and normality of measurements. This project aims to address the limitations present in current discriminant approaches, and has three objectives: (1) to develop computationally efficient quadratic classification rules that perform variable selection; (2) to generalize the discriminant analysis framework to non-normal measurements; (3) to develop a classification framework for mixed type data coming from multiple technology platforms collected on the same set of subjects. The key methodological innovation is the combination of sparse low-rank singular value decomposition, which enables computational efficiency, with geometric interpretation of linear discriminant analysis, which allows for the construction of nonlinear classification rules by redefining the space for discrimination.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Gaynanova, Irina "Prediction and estimation consistency of sparse multi-class penalized optimal scoring" Bernoulli , v.26 , 2020 10.3150/19-BEJ1126 Citation Details
Gaynanova, Irina and Wang, Tianying "Sparse quadratic classification rules via linear dimension reduction" Journal of Multivariate Analysis , v.169 , 2019 10.1016/j.jmva.2018.09.011 Citation Details
Lapanowski, Alexander F. and Gaynanova, Irina "Sparse feature selection in kernel discriminant analysis via optimal scoring" Proceedings of Machine Learning Research , v.89 , 2019 Citation Details
Yoon, Grace and Carroll, Raymond J and Gaynanova, Irina "Sparse semiparametric canonical correlation analysis for data of mixed types" Biometrika , 2020 10.1093/biomet/asaa007 Citation Details
Yoon, Grace and Gaynanova, Irina and Müller, Christian L. "Microbial Networks in SPRING - Semi-parametric Rank-Based Correlation and Partial Correlation Estimation for Quantitative Microbiome Data" Frontiers in Genetics , v.10 , 2019 10.3389/fgene.2019.00516 Citation Details
Yuan, Dongbang and Gaynanova, Irina "Double-Matched Matrix Decomposition for Multi-View Data" Journal of Computational and Graphical Statistics , 2022 https://doi.org/10.1080/10618600.2022.2067860 Citation Details
Zhang, Yunfeng and Gaynanova, Irina "Joint association and classification analysis of multiview data" Biometrics , v.78 , 2021 https://doi.org/10.1111/biom.13536 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project has been motivated by the need to develop more flexible classification methods for modern large-scale biological data, e.g. gene expression and microbiome data. The restrictive assumptions of existing classification methods (e.g. linear discrimination boundary and normality of measurements) do not match the complexity of actual data, which may lead to a failure to identify relevant biological patterns. The investigator pursued three topics in this direction: development of computationally efficient quadratic classification rules (relaxation of linearity), generalization of classification framework to non-normal measurements (relaxation of normality) and classification framework for data collected from multiple technology platforms (simultaneous analysis of data from multiple sources).

 

The results achieved in this project are summarized as follows. First, new nonlinear classification rules were developed with higher prediction accuracy and better interpretability (thanks to variable selection). These new methods have dramatically smaller computation times than competitors due to developed optimization algorithms and software design. Secondly, the project has led to an improved theoretical understanding of the similarities and differences between classification and linear regression problems, which provided theoretical support for observed similarities in performance. Finally, the project advanced the methods for joint analyses of multi-view data (data collected on the same subjects from different sources). The project has led to the creation of a new truncated model for zero-inflated data, which are common with advances in modern sequencing technologies. The new model accounts for the limiting sequencing depth (many observed zeros are due to censoring rather than the actual zero values), and allows to jointly analyze mixed types of measurements (binary, continuous, zero-inflated). The proposed estimation framework leads to more accurate characterization of underlying associations between the measurements than existing methods that assume normality.  The project also resulted in the creation of a new method for simultaneous non-gaussian component analysis, which is useful for finding joint structure in multiple neuroimaging modalities with highly non-gaussian measurements. These methods were distributed through refereed publications across statistics, machine learning and genomics communities, and as multiple R packages.

 

As for the broader impacts of the project, the investigator served as a research mentor for one (female) postdoc and four graduate students (one female, three males), therefore supporting the future STEM workforce. The investigator also developed a new core PhD course on Statistical Computing focused on contemporary optimization algorithms, reproducible research practices and R packages; with multiple case studies for the course arising directly from the project.

 


Last Modified: 10/10/2020
Modified by: Irina Gaynanova

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page