
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | February 16, 2017 |
Latest Amendment Date: | June 14, 2023 |
Award Number: | 1652943 |
Award Instrument: | Continuing Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | February 15, 2017 |
End Date: | January 31, 2024 (Estimated) |
Total Intended Award Amount: | $409,641.00 |
Total Awarded Amount to Date: | $529,641.00 |
Funds Obligated to Date: |
FY 2018 = $93,873.00 FY 2019 = $99,021.00 FY 2020 = $117,290.00 FY 2021 = $104,443.00 FY 2022 = $16,000.00 FY 2023 = $16,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
1500 ILLINOIS ST GOLDEN CO US 80401-1887 (303)273-3000 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
1500 Illinois St Golden CO US 80401-1887 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
01002324DB NSF RESEARCH & RELATED ACTIVIT 01001718DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT 01001920DB NSF RESEARCH & RELATED ACTIVIT 01002021DB NSF RESEARCH & RELATED ACTIVIT 01002122DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The goal of this CAREER project is to identify and establish a new robust data mining framework for better modeling, understanding and analyzing brain imaging genomics data that combine the concepts of sparsity-induced learning models and new and more efficient computational algorithms. The proposed research in this project is innovative and crucial not only to facilitating the development of new data mining techniques, but also to addressing emerging scientific questions in brain imaging genomics, and to greatly supporting the BRAIN Initiative which has recently been unveiled by the U.S. Government and become a national goal. Integrated with the research in this project are the educational goals to create and broadly disseminate new curricular and K-12 outreach materials that focus both on the challenges of large-scale, heterogeneous-modal and high dimensional data processing and on the principles behind the robust data mining techniques for alleviating them.
This project focuses on designing principled data mining algorithms for analyzing multi-modal brain imaging genomics data to yield mechanistic understanding from gene to brain function and to phenotypic outcomes. Of particular interests are (1) large-scale non-convex sparse learning models with linear convergence algorithms, (2) linear computational cost multi-task multi-dimensional data integration algorithms, and (3) evaluation and validation in large-scale brain imaging genomics studies. The research in this project will enable new computational applications in a large number of research areas. The educational materials developed as part of this project will give K-12 students a taste of some of the many fascinating topics in the machine learning and data mining fields while communicating to students the relevance of their mathematics and science classes to futures in engineering.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Scientific outcomes for intellectual merits:
1. We developed multiple sparse multi-view learning algorithms for identifying biomarkers for early detection of Alzheimer’s disease (AD)
AD is a degenerative brain disease that affects millions of people around the world. As populations in the United States and worldwide age, the prevalence of Alzheimer’s disease will only increase. In turn, the social and financial costs of AD will create a difficult environment for many families and caregivers across the globe. By combining genetic information, brain scans, and clinical data, gathered over time through the Alzheimer’s Disease Neuroimaging Initiative (ADNI), we developed a new joint regression and classification model that has shown great performance in the identification of relevant genetic and phenotypic biomarkers in patients with AD. As shown in Fig.1, our newly proposed method consists of three major components as follows. First, we use the L2,1-norm regularization to effectively associate input features overtime and generate a sparse solution. Second, we utilize a new group L1-norm regularization proposed in our previous works to globally associate the weights of the input imaging and genetic modalities, where a modality indicates a single data grouping (e.g. brain imaging data, genetic data, diagnostic data, etc.). The group L1-norm regularization is able to determine which input modality is most effective at predicting a particular output. Third, we incorporate the trace norm regularization to determine relationships that occur within modalities.
2. We developed several data representation and compression methods and applied them to the analysis of multimodal imaging, biomarker, genomics and transcriptomics data sets.
To aid automatic AD diagnoses, many longitudinal learning models have been proposed to predict clinical outcomes and/or disease status, which, though, often fail to consider missing temporal phenotypic records of the patients that can convey valuable information of AD progressions. Another challenge in AD studies is how to integrate heterogeneous genotypic and phenotypic biomarkers to improve diagnosis prediction. To cope with these challenges, as illustrated in Fig.2 we proposed a longitudinal multi-modal method to learn enriched genotypic and phenotypic biomarker representations in the format of fixed-length vectors that can simultaneously capture the baseline neuroimaging measurements of the entire dataset and progressive variations of the varied counts of follow-up measurements over time of every participant from different biomarker sources. The learned global and local projections are aligned by a soft constraint and the structured-sparsity norm is used to uncover the multi-modal structure of heterogeneous biomarker measurements. We have conducted extensive experiments on the ADNI data using one genotypic and two phenotypic biomarkers. Empirical results have demonstrated that the learned enriched biomarker representations are more effective in predicting the outcomes of various cognitive assessments. Moreover, our model has successfully identified disease-relevant biomarkers supported by existing medical findings that additionally warrant the correctness of our method from the clinical perspective.
3. We enhanced a few machine learning models that build the theoretical foundations of machine learning.
(1) Principal Component Analysis (PCA) is one of the most broadly used methods to analyze high-dimensional data. However, most existing studies on PCA aim to minimize the reconstruction error measured by the Euclidean distance, although in some fields, such as text analysis in information retrieval, analysis using the angle distance is known to be more effective. To this end, we proposed a novel PCA formulation by adding a constraint on the factors to unify the Euclidean distance and the angle distance. (2) Traditional Linear discriminant analysis (LDA) objective aims to minimize the ratio of the squared Euclidean distances that may not perform optimally on noisy datasets. One limitation is that the mean calculations use the squared ℓ2-norm distance to center the data, which is not valid when the objective depends on other distance functions. The second problem is that there is no generalized optimization algorithm to solve different robust LDA objectives. In addition, most existing algorithms can only guarantee the solution to be locally optimal, rather than globally optimal. With these recognitions, we review multiple robust loss functions and propose a new and generalized robust objective for LDA.
Other outcomes for broader impacts:
We have published about 37 full-length papers related to this project in peer-reviewed conference proceedings and journals.
This project supported three Ph.D. students at Colorado School of Mines. Two of them have graduated and the other one is currently a fourth year Ph.D. student in the Department of Computer Science and will graduate in next year with looking for an academic position.
This project also supported sixteen undergraduate REU students. The work from these students (together with his graduate student mentor supported by this project) has led to more than 10 manuscripts published and submitted to a top-tier peer-reviewed journal.
The research materials produced in this project are used in teaching several undergraduate and graduate courses at Colorado School of Mines.
Last Modified: 05/20/2024
Modified by: Hua Wang
Please report errors in award information by writing to: awardsearch@nsf.gov.