
NSF Org: |
DMS Division Of Mathematical Sciences |
Recipient: |
|
Initial Amendment Date: | May 11, 2017 |
Latest Amendment Date: | May 11, 2017 |
Award Number: | 1712943 |
Award Instrument: | Standard Grant |
Program Manager: |
Gabor Szekely
DMS Division Of Mathematical Sciences MPS Directorate for Mathematical and Physical Sciences |
Start Date: | July 1, 2017 |
End Date: | June 30, 2020 (Estimated) |
Total Intended Award Amount: | $162,539.00 |
Total Awarded Amount to Date: | $162,539.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
400 HARVEY MITCHELL PKY S STE 300 COLLEGE STATION TX US 77845-4375 (979)862-6777 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
College Station TX US 77845-4375 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | STATISTICS |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.049 |
ABSTRACT
Recent technological advances have enabled routine collection of large-scale high-dimensional data in the biomedical fields. For example, in cancer research it is common to use multiple high-throughput technology platforms to measure genotype, gene expression levels, and methylation levels. One of the main challenges in the analysis of such data is the identification of key biological measurements that can be used to classify the subject into a known cancer subtype. While significant progress has been made in the development of computationally efficient classification methods to address this challenge, existing methods do not adequately take into account the heterogeneity across the cancer subtypes and the mixed types of measurements (binary/count/continuous) across technology platforms. As such, existing methods may fail to identify relevant biological patterns. The goal of this project is to develop new classification methods that explicitly take into account the type and heterogeneity of measurements. While the primary focus is on methodology, high priority will be given to computational considerations and software development to encourage dissemination and ensure ease of use for domain scientists.
Regularized linear discriminant methods are commonly used for simultaneous classification and variable selection due to their interpretability and computational efficiency. These methods, however, rely on unrealistic assumptions of equality of group-covariance matrices and normality of measurements. This project aims to address the limitations present in current discriminant approaches, and has three objectives: (1) to develop computationally efficient quadratic classification rules that perform variable selection; (2) to generalize the discriminant analysis framework to non-normal measurements; (3) to develop a classification framework for mixed type data coming from multiple technology platforms collected on the same set of subjects. The key methodological innovation is the combination of sparse low-rank singular value decomposition, which enables computational efficiency, with geometric interpretation of linear discriminant analysis, which allows for the construction of nonlinear classification rules by redefining the space for discrimination.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This project has been motivated by the need to develop more flexible classification methods for modern large-scale biological data, e.g. gene expression and microbiome data. The restrictive assumptions of existing classification methods (e.g. linear discrimination boundary and normality of measurements) do not match the complexity of actual data, which may lead to a failure to identify relevant biological patterns. The investigator pursued three topics in this direction: development of computationally efficient quadratic classification rules (relaxation of linearity), generalization of classification framework to non-normal measurements (relaxation of normality) and classification framework for data collected from multiple technology platforms (simultaneous analysis of data from multiple sources).
The results achieved in this project are summarized as follows. First, new nonlinear classification rules were developed with higher prediction accuracy and better interpretability (thanks to variable selection). These new methods have dramatically smaller computation times than competitors due to developed optimization algorithms and software design. Secondly, the project has led to an improved theoretical understanding of the similarities and differences between classification and linear regression problems, which provided theoretical support for observed similarities in performance. Finally, the project advanced the methods for joint analyses of multi-view data (data collected on the same subjects from different sources). The project has led to the creation of a new truncated model for zero-inflated data, which are common with advances in modern sequencing technologies. The new model accounts for the limiting sequencing depth (many observed zeros are due to censoring rather than the actual zero values), and allows to jointly analyze mixed types of measurements (binary, continuous, zero-inflated). The proposed estimation framework leads to more accurate characterization of underlying associations between the measurements than existing methods that assume normality. The project also resulted in the creation of a new method for simultaneous non-gaussian component analysis, which is useful for finding joint structure in multiple neuroimaging modalities with highly non-gaussian measurements. These methods were distributed through refereed publications across statistics, machine learning and genomics communities, and as multiple R packages.
As for the broader impacts of the project, the investigator served as a research mentor for one (female) postdoc and four graduate students (one female, three males), therefore supporting the future STEM workforce. The investigator also developed a new core PhD course on Statistical Computing focused on contemporary optimization algorithms, reproducible research practices and R packages; with multiple case studies for the course arising directly from the project.
Last Modified: 10/10/2020
Modified by: Irina Gaynanova
Please report errors in award information by writing to: awardsearch@nsf.gov.