
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 24, 2016 |
Latest Amendment Date: | August 24, 2016 |
Award Number: | 1617583 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2016 |
End Date: | August 31, 2021 (Estimated) |
Total Intended Award Amount: | $499,361.00 |
Total Awarded Amount to Date: | $499,361.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
5000 FORBES AVE PITTSBURGH PA US 15213-3815 (412)268-8746 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
5000 Forbes Avenue Pittsburgh PA US 15213-3890 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Latent variable models (LVMs), which extract hidden information, such as topics, themes, or disease patterns, from raw data, play an important role in electronic health record (EHR) management and applications. With the dramatic increase of the volume and complexity of EHR data, current LVMs face several new challenges, including inadequacy in capturing rare patterns existing in only small number of patients in a population (also known as long tail patterns), redundancy amongst patterns being discovered, and low computational efficiency, which all seriously impair the value of EHR data in driving high-quality personalized medicine. There is a critical need in developing new methods to transform conventional LVMs to ones that can circumvent such limitations so that the EHR data can be more effectively and reliably used for healthcare applications. This project addresses this need and develops a new technique known as "diversity-inducing machine learning models", which promote rare patterns and condense redundant patterns, at high computational efficiency, to enable more effective pattern discovery and knowledge extraction from complex and heterogeneous (e.g., textual, image, and time series) EHR data.
Specifically, this project contains the following research components: 1. Develop a new regularized LVM learning framework that allows the basis of the latent space to favor a more diversity-inducing geometry and less redundancy, thereby accomplish long-tail pattern coverage and better interpretability for both Euclidean and Hilbert space settings. 2. Develop a diversity-promoting Bayesian LVM learning framework that enables efficient inference of posteriors probability distributions to facilitate quantization of uncertainty and alleviate over fitting. 3. Theoretically analyze the diversity-inducing techniques proposed in 1 and 2 to understand how these techniques affect the generalization errors in supervised LVMs, posterior contraction rate in unsupervised LVMs, and the information geometry of the distributions induced by LVMs. 4. Apply the diversified LVMs to healthcare applications. This project also provides rich opportunities for multi-disciplinary education and research training, at both undergraduate, graduate, and professional levels.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Project outcome report:
Small: A New Approach to Latent Space Learning with Diversity-Inducing Regularization and Applications to Healthcare Data Analytics
We aim to develop generic, mathematically sound and computationally efficient diversity-promoting techniques for LVMs to address these challenges, and thereby facilitate long-tail and more interpretable pattern and knowledge discovery from healthcare and generic big data with high statistical and computational efficiency. Our plan consists of the following thrusts:
1: Developing a diversity-promoting regularized latent space estimation framework
2: Developing a diversity-promoting Bayesian LVM learning framework
3: Theoretical analysis of the diversity-inducing techniques proposed in 1 and 2
4. Applying the diversified LVMs to healthcare applications
Over the past 4 years, we have been following the above outlined plan and achieved a rich body of scientific and software development outcomes. Bellow is a brief summary of these results:
For the core methodological development for Latent Space Learning with Diversity-Inducing Regularization, we have make the following innovations:
- Learning latent space models with angular constraints
- A new diversity-promoting regularizer based on uncorrelation and evenness
- Orthogonality-promoting distance metric learning: convex relaxation and theoretical analysis
- Near-orthogonality regularization in kernel methods
- Diversity-promoting Bayesian learning of latent variable models
- Tradeoffs of Linear Mixed Models in Genome-wide Association Studies
- A Network-structured High-dimension Variable Selection Method with P-values, for Gene Set Prioritization with Transcriptome Association Study Guided by Regulatory Network
On applications to healthcare problems, here are the major outcomes:
- Developed A Generalized Zero-Shot Text Classification Framework for ICD Coding
- Developing A Multimodal Machine Learning Framework for Automated ICD Coding
- Developing A Neural Architecture for Automated ICD Coding
- Automatic Generation of Medical Imaging Reports
- Effective Use of Bidirectional Language Model for Medical Named Entity Recognition
- Nonoverlap-promoting variable selection
- Unsupervised Pseudo-Labeling for Extractive Summarization on Electronic Health Records
Overall, our work led to 14 publications in top machine learning and healthcare conferences, and at least 4 graduate students haven been supported partly from this grant, including Dr. Pengtao Xie, who graduated in 2019 and is now assistant professor at UCSD.
Regarding broader impact, for the first time that diversity-promoting learning is systematically studied. The study is conducted in both frequentist statistics and Bayesian statistics, covering various regularizers, Bayesian priors, optimization algorithms, Bayesian inference algorithms, theoretical analysis, and extensive empirical evaluations. This study lays a solid foundation for a potentially prominent new paradigm of learning: diversity-promoting learning, which enables us to address several fundamental issues in machine learning, including 1) how to better capture infrequent patterns; 2) how to reduce model size without sacrificing modeling power; 3) how to improve generalization error; 4) how to improve interpretability. The techniques developed in our work are widely adopted in CV, NLP, and Healthcare applications, and we believe we have filled our goal outlined in the original proposal.
Last Modified: 01/31/2022
Modified by: Eric P Xing
Please report errors in award information by writing to: awardsearch@nsf.gov.