
NSF Org: |
DMS Division Of Mathematical Sciences |
Recipient: |
|
Initial Amendment Date: | June 20, 2020 |
Latest Amendment Date: | June 20, 2020 |
Award Number: | 2013905 |
Award Instrument: | Standard Grant |
Program Manager: |
Yong Zeng
yzeng@nsf.gov (703)292-7299 DMS Division Of Mathematical Sciences MPS Directorate for Mathematical and Physical Sciences |
Start Date: | August 1, 2020 |
End Date: | July 31, 2023 (Estimated) |
Total Intended Award Amount: | $225,000.00 |
Total Awarded Amount to Date: | $225,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
201 OLD MAIN UNIVERSITY PARK PA US 16802-1503 (814)865-1372 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
University Park PA US 16802-7000 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
STATISTICS, MSPA-INTERDISCIPLINARY |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.049 |
ABSTRACT
Rapid technology advances in devices and computer systems continue to grow our capacity to collect and store data. Clustering is often the first stage analysis performed to discover patterns, gain insights, and extract knowledge from massive amount of data routinely faced in science, engineering, and commercial domains. For instance, in biomedical studies, clustering is used to reveal pathological subgroups and help researchers form new hypothesis for in-depth investigation. It is thus imperative to develop new clustering methods to meet the ever-increasing challenges of data with high complexity, huge volume, and from distributed sources. In this project, novel statistical and optimization-based approaches and software packages will be developed to address these challenges. Graduate students will be trained to conduct research at the forefront of machine learning. The research results will be used to enrich courses and outreach educational materials in data science.
A prominent statistical paradigm for clustering is based on mixture models, which is objective, parsimonious, not biased for known clusters, and has a probabilistic framework that can be extended and interpreted in standard ways. For high-dimensional large-scale data, existing mixture-model based methods have fundamental limitations. Furthermore, a big data environment can require the integration of clustering results at distributed sites, a problem called multi-source clustering. This research will advance cluster analysis from multiple aspects. First, hidden Markov model on variable blocks (HMM-VB), a special Gaussian mixture model (GMM), is developed to tackle high dimensionality. The estimation of HMM-VB will be enhanced by computationally efficient methods to identify the latent variable block structure and by mixture factor analyzers. Second, leveraging the latent states of HMM-VB, a new variable selection approach will be developed for clustering high-dimensional data. Third, the emerging topic of multi-source clustering will be studied. New methods based on optimal transport and Wasserstein barycenter will be developed for aggregating clustering results from multiple sources. Applications in biomedical areas will be pursued.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
We have published 7 journal papers on a variety of research topics covered by this project. The methodologies developed address fundamental challenges in clustering for multi-source, multi-modality, and distributed data. Experiments show that our new methods have achieved state-of-the-art performance. The clustering and classification methods developed in this project have wide potential applications in science and engineering. In our experiments, we focused on applications in biomedical areas. In addition, our work to improve the robustness of surrogate DNN models has wide potential impacts because surrogate models are used to approximate complex simulation models in many science and engineering fields, for instance, weather simulation models, physical simulation systems for materials. Algorithms developed with support from this grant have been implemented and made available to the public through Github. We describe several of the research results below.
We have extended the framework of optimal transport (OT) to optimal transport with relaxed marginal constraints (OT-RMC). We demonstrated how this extension can bring much flexibility in various matching problems. In particular, we have developed a new method based on OT-RMC to align clusters obtained from multiple sources. We consider the case when the clusters from different sources are not all identical. In other words, each source may contain clusters unique to itself. Conventional OT excludes such cases. We have conducted experiments on several real-world datasets and found the algorithm competitive in both accuracy and speed.
Multi-view data can be created from different sources, by several technologies, and in multiple modalities. Information integrated from multi-view data has set a frontier for making discoveries in many fields. Various methods have been developed for multi-view clustering. However, these methods have important limitations such as the requirement of pooling data in multiple views, the lack of choices for clustering algorithms used within each view, and the neglect of complementary information across views. We have developed a new approach for multi-view clustering, namely, covering point set merge (CPSmerge) analysis, without pooling data or concatenating variables across the views. The main idea is to maximize clustering stability by merging clusters formed by the Cartesian product of clustering labels acquired in individual views. Our method also quantifies the contribution of each view for the formation of any cluster. The method can be readily applied and incorporated with existing clustering pipelines because the algorithm adopted for any view is unrestricted. This flexibility, lacking by many multi-view clustering methods, enables us to leverage advanced single-view clustering algorithms. Importantly, our method accounts for both consensus and complementary effects between different views. In contrast, existing ensemble algorithms focus on seeking a consensus for clustering results obtained in different views, implicitly assuming that these results are variations of one clustering structure.
We have developed a new approach to integrate clustering results from multiple sources by computing the MAW barycenter of Gaussian mixture models. The potential usage of this algorithm goes beyond integrating clustering results because the algorithm for computing the Minimized Aggregated Wasserstein (MAW) barycenter of Gaussian Mixture Models (GMMs) is a general approach to find an "average" of GMMs.Our proposed algorithm for clustering integration scales well with the data dimension and the number of mixture components, with complexity independent of data size.We demonstrate that the new method achieves better clustering results on several single-cell RNA-seq data sets than some other popular methods.
By combining clustering techniques and DNN, we developed a new approach to enhance the interpretability of complex supervised learning models. Although DNN models have become very popular in machine learning, the fact they are not interpretable has limited their usages in mission-critical applications. In this work, we bridge DNNs with traditional statistical models such that we have a mechanism to tradeoff interpretability with accuracy. We proposed the novel idea of "co-supervision" by DNN models. We have successfully trained mixture of linear models that achieve accuracy close to DNN while maintaining interpretability.
Three graduate students have conducted research in this project. This project provides them a unique opportunity to explore topics at the interface of mixture modeling, deep learning, and optimal transport. They were also exposed to applications in biomedical areas. This project also inspired the undergraduate honors thesis research topics of two students. One student studied the robustness of an algorithm that explains the decision of a DNN locally using linear models. The other student studied how to feedback some explanation for the DNN's decision on image classification to enhance the robustness of DNN. The graduate and undergraduate students include two female and one Hispanic students.
Last Modified: 08/01/2023
Modified by: Jia Li
Please report errors in award information by writing to: awardsearch@nsf.gov.