Award Abstract # 1650080
EAGER: Novel sampling algorithms for scaling up spectral methods for unsupervised learning

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: GEORGE WASHINGTON UNIVERSITY (THE)
Initial Amendment Date: August 23, 2016
Latest Amendment Date: August 23, 2016
Award Number: 1650080
Award Instrument: Standard Grant
Program Manager: Weng-keen Wong
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 15, 2016
End Date: July 31, 2018 (Estimated)
Total Intended Award Amount: $90,000.00
Total Awarded Amount to Date: $90,000.00
Funds Obligated to Date: FY 2016 = $90,000.00
History of Investigator:
  • Claire Monteleoni (Principal Investigator)
    cmontel@colorado.edu
Recipient Sponsored Research Office: George Washington University
1918 F ST NW
WASHINGTON
DC  US  20052-0042
(202)994-0728
Sponsor Congressional District: 00
Primary Place of Performance: George Washington University
800 22nd Street NW
Washington
DC  US  20052-0058
Primary Place of Performance
Congressional District:
00
Unique Entity Identifier (UEI): ECR5E2LU5BL6
Parent UEI:
NSF Program(s): Robust Intelligence
Primary Program Source: 01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7495, 7916
Program Element Code(s): 749500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

In the era of big data, unsupervised learning has become increasingly important. At a high-level, unsupervised learning serves to reduce the data size, while capturing its important underlying structure. For a powerful and widely-used family of unsupervised learning techniques (those based on spectral methods), scaling up to large data sets poses significant computational challenges. This research project will develop extremely simple and lightweight sampling techniques for scaling up this family of unsupervised learning methods. Since big data is ubiquitous, these research advances are likely to be transformative to a range of fields. This project will benefit society through the research team's ongoing collaborations in climate science, agriculture, and finance. The team will also continue to engage the computer science community in this endeavor, by training students, developing tutorials, and broadening the participation of women and minorities in computing.

This project will advance machine learning research by scaling up spectral methods for the analysis of large data sets. While spectral methods for the unsupervised learning tasks of clustering and embedding have found wide success in a variety of practical applications, scaling them up to large data sets poses significant computational challenges. In particular, the storage and computation needed to handle the affinity matrix (a matrix of pairwise similarities between data points) can be prohibitive. An approach that has found promise is to instead approximate this matrix in some sense. The goal of this project is to provide simple approximation techniques that manage the tradeoff between their space and time complexity vs. the quality of the approximation. The proposed approach involves sampling techniques that address this goal by exploiting latent structure in a data set, in order to minimize the amount of information that needs to be stored to (approximately) represent it. This leads to techniques that speed up the computation and reduce the memory requirements of spectral methods, while simultaneously providing better approximations. The project will also continue the team's momentum on leveraging advances in machine learning for data-driven discovery.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Mohan, Mahesh and Monteleoni, Claire "Beyond the Nyström approximation: Speeding up spectral clustering using uniform sampling and weighted kernel k-means" Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017) , 2017 Citation Details
Mohan, Mohan and Monteleoni, Claire "Exploiting sparsity to improve the accuracy of Nyström-based large-scale spectral clustering" 2017 International Joint Conference on Neural Networks (IJCNN) , 2017 10.1109/IJCNN.2017.7965829 Citation Details
Tang, Cheng and Monteleoni, Claire "Convergence Rate of Stochastic k-means" Proceedings of the 20th International Conference on Artificial Intelligence and Statistics , v.PMLR 54 , 2017 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Many types of data are abundantly available in raw form, i.e. prior to being labeled for any classification or regression task (tasks typically referred to as supervised learning). For example, the number of photographs available online exceeds (astronomically) the number that have been labeled with meaningful text, let alone any text at all. When labels are not readily available, the resulting machine learning problem is known as unsupervised learning. The goal is typically to extract some latent structure in the data, such as features or clusters, in order to summarize the data, to make sense of it, or to reduce its size before further stages of the machine learning pipeline.

 

Spectral methods for the unsupervised learning tasks of clustering and embedding data have found wide success in a variety of practical applications, particularly on data that can be represented as a graph. However scaling these methods up to large data sets poses significant computational challenges. In particular, the storage and computation needed to handle the affinity matrix (a matrix of pairwise similarities between data points) can be prohibitive. Our work on this project has contributed to scaling up spectral methods to big data. We have developed two light-weight algorithms that approximate the affinity matrix. Our approach helps to better manage the tradeoff between the computational burden and the quality of the approximation needed for finding meaningful clusters in the data.

 

To gain insight in our clustering research, we also analyzed clustering heuristics that have demonstrated strong empirical performance on a variety of applications, but lacked solid theoretical foundations. In particular, we analyzed the convergence of stochastic k-means algorithm variants, e.g., online k-means and mini-batch k-means, which have enjoyed wide-spread success and deployment in several popular machine learning software packages.

 

This project made impact on the fields of machine learning and artificial intelligence, and on the training of graduate students.

 


Last Modified: 01/08/2019
Modified by: Claire Monteleoni

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page