Award Abstract # 2013905
Cluster Analysis for High-Dimensional and Multi-Source Data

NSF Org: DMS
Division Of Mathematical Sciences
Recipient: THE PENNSYLVANIA STATE UNIVERSITY
Initial Amendment Date: June 20, 2020
Latest Amendment Date: June 20, 2020
Award Number: 2013905
Award Instrument: Standard Grant
Program Manager: Yong Zeng
yzeng@nsf.gov
 (703)292-7299
DMS
 Division Of Mathematical Sciences
MPS
 Directorate for Mathematical and Physical Sciences
Start Date: August 1, 2020
End Date: July 31, 2023 (Estimated)
Total Intended Award Amount: $225,000.00
Total Awarded Amount to Date: $225,000.00
Funds Obligated to Date: FY 2020 = $225,000.00
History of Investigator:
  • Jia Li (Principal Investigator)
  • Lynn Lin (Co-Principal Investigator)
Recipient Sponsored Research Office: Pennsylvania State Univ University Park
201 OLD MAIN
UNIVERSITY PARK
PA  US  16802-1503
(814)865-1372
Sponsor Congressional District: 15
Primary Place of Performance: Pennsylvania State Univ University Park
University Park
PA  US  16802-7000
Primary Place of Performance
Congressional District:
Unique Entity Identifier (UEI): NPM2J7MSCF61
Parent UEI:
NSF Program(s): STATISTICS,
MSPA-INTERDISCIPLINARY
Primary Program Source: 01002021DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 068Z, 079Z
Program Element Code(s): 126900, 745400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.049

ABSTRACT

Rapid technology advances in devices and computer systems continue to grow our capacity to collect and store data. Clustering is often the first stage analysis performed to discover patterns, gain insights, and extract knowledge from massive amount of data routinely faced in science, engineering, and commercial domains. For instance, in biomedical studies, clustering is used to reveal pathological subgroups and help researchers form new hypothesis for in-depth investigation. It is thus imperative to develop new clustering methods to meet the ever-increasing challenges of data with high complexity, huge volume, and from distributed sources. In this project, novel statistical and optimization-based approaches and software packages will be developed to address these challenges. Graduate students will be trained to conduct research at the forefront of machine learning. The research results will be used to enrich courses and outreach educational materials in data science.

A prominent statistical paradigm for clustering is based on mixture models, which is objective, parsimonious, not biased for known clusters, and has a probabilistic framework that can be extended and interpreted in standard ways. For high-dimensional large-scale data, existing mixture-model based methods have fundamental limitations. Furthermore, a big data environment can require the integration of clustering results at distributed sites, a problem called multi-source clustering. This research will advance cluster analysis from multiple aspects. First, hidden Markov model on variable blocks (HMM-VB), a special Gaussian mixture model (GMM), is developed to tackle high dimensionality. The estimation of HMM-VB will be enhanced by computationally efficient methods to identify the latent variable block structure and by mixture factor analyzers. Second, leveraging the latent states of HMM-VB, a new variable selection approach will be developed for clustering high-dimensional data. Third, the emerging topic of multi-source clustering will be studied. New methods based on optimal transport and Wasserstein barycenter will be developed for aggregating clustering results from multiple sources. Applications in biomedical areas will be pursued.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Li, Jia and Lin, Lin "Optimal Transport With Relaxed Marginal Constraints" IEEE Access , v.9 , 2021 https://doi.org/10.1109/ACCESS.2021.3072613 Citation Details
Lin, Lin and Shi, Wei and Ye, Jianbo and Li, Jia "Multisource Single-Cell Data Integration by MAW Barycenter for Gaussian Mixture Models" Biometrics , v.79 , 2022 https://doi.org/10.1111/biom.13630 Citation Details
Seo, Beomseok and Lin, Lin and Li, Jia "Mixture of Linear Models Co-supervised by Deep Neural Networks" Journal of Computational and Graphical Statistics , v.31 , 2022 https://doi.org/10.1080/10618600.2022.2107533 Citation Details
Zhang, Jingxuan and Li, Jia and Lin, Lin "Statistical and machine learning methods for immunoprofiling based on single-cell data" Human Vaccines & Immunotherapeutics , v.19 , 2023 https://doi.org/10.1080/21645515.2023.2234792 Citation Details
Zhang, Lixiang and Li, Jia "Robust deep neural network surrogate models with uncertainty quantification via adversarial training" Statistical Analysis and Data Mining: The ASA Data Science Journal , v.16 , 2023 https://doi.org/10.1002/sam.11610 Citation Details
Zhang, Lixiang and Lin, Lin and Li, Jia "Multi-view clustering by CPS-merge analysis with application to multimodal single-cell data" PLOS Computational Biology , v.19 , 2023 https://doi.org/10.1371/journal.pcbi.1011044 Citation Details
Zhang, Lixiang and Lin, Lin and Li, Jia "VtNet: A neural network with variable importance assessment" Stat , v.10 , 2021 https://doi.org/10.1002/sta4.325 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

We have published 7 journal papers on a variety of research topics covered by this project. The methodologies developed address fundamental challenges in clustering for multi-source, multi-modality, and distributed data. Experiments show that our new methods have achieved state-of-the-art performance. The clustering and classification methods developed in this project have wide potential applications in science and engineering. In our experiments, we focused on applications in biomedical areas. In addition, our work to improve the robustness of surrogate DNN models has wide potential impacts because surrogate models are used to approximate complex simulation models in many science and engineering fields, for instance, weather simulation models, physical simulation systems for materials. Algorithms developed with support from this grant have been implemented and made available to the public through Github. We describe several of the research results below.

We have extended the framework of optimal transport (OT) to optimal transport with relaxed marginal constraints (OT-RMC). We demonstrated how this extension can bring much flexibility in various matching problems. In particular, we have developed a new method based on OT-RMC to align clusters obtained from multiple sources. We consider the case when the clusters from different sources are not all identical. In other words, each source may contain clusters unique to itself. Conventional OT excludes such cases. We have conducted experiments on several real-world datasets and found the algorithm competitive in both accuracy and speed.

Multi-view data can be created from different sources, by several technologies, and in multiple modalities. Information integrated from multi-view data has set a frontier for making discoveries in many fields. Various methods have been developed for multi-view clustering. However, these methods have important limitations such as the requirement of pooling data in multiple views, the lack of choices for clustering algorithms used within each view, and the neglect of complementary information across views. We have developed a new approach for multi-view clustering, namely, covering point set merge (CPSmerge) analysis, without pooling data or concatenating variables across the views. The main idea is to maximize clustering stability by merging clusters formed by the Cartesian product of clustering labels acquired in individual views. Our method also quantifies the contribution of each view for the formation of any cluster. The method can be readily applied and incorporated with existing clustering pipelines because the algorithm adopted for any view is unrestricted. This flexibility, lacking by many multi-view clustering methods, enables us to leverage advanced single-view clustering algorithms. Importantly, our method accounts for both consensus and complementary effects between different views. In contrast, existing ensemble algorithms focus on seeking a consensus for clustering results obtained in different views, implicitly assuming that these results are variations of one clustering structure.

We have developed a new approach to integrate clustering results from multiple sources by computing the MAW barycenter of Gaussian mixture models. The potential usage of this algorithm goes beyond integrating clustering results because the algorithm for computing the Minimized Aggregated Wasserstein (MAW) barycenter of Gaussian Mixture Models (GMMs) is a general approach to find an "average" of GMMs.Our proposed algorithm for clustering integration scales well with the data dimension and the number of mixture components, with complexity independent of data size.We demonstrate that the new method achieves better clustering results on several single-cell RNA-seq data sets than some other popular methods.

By combining clustering techniques and DNN, we developed a new approach to enhance the interpretability of complex supervised learning models. Although DNN models have become very popular in machine learning, the fact they are not interpretable has limited their usages in mission-critical applications. In this work, we bridge DNNs with traditional statistical models such that we have a mechanism to tradeoff interpretability with accuracy. We proposed the novel idea of "co-supervision" by DNN models. We have successfully trained mixture of linear models that achieve accuracy close to DNN while maintaining interpretability.

Three graduate students have conducted research in this project. This project provides them a unique opportunity to explore topics at the interface of mixture modeling, deep learning, and optimal transport. They were also exposed to applications in biomedical areas. This project also inspired the undergraduate honors thesis research topics of two students. One student studied the robustness of an algorithm that explains the decision of a DNN locally using linear models. The other student studied how to feedback some explanation for the DNN's decision on image classification to enhance the robustness of DNN. The graduate and undergraduate students include two female and one Hispanic students.

 


Last Modified: 08/01/2023
Modified by: Jia Li

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page