
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 23, 2016 |
Latest Amendment Date: | August 23, 2016 |
Award Number: | 1650080 |
Award Instrument: | Standard Grant |
Program Manager: |
Weng-keen Wong
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 15, 2016 |
End Date: | July 31, 2018 (Estimated) |
Total Intended Award Amount: | $90,000.00 |
Total Awarded Amount to Date: | $90,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
1918 F ST NW WASHINGTON DC US 20052-0042 (202)994-0728 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
800 22nd Street NW Washington DC US 20052-0058 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Robust Intelligence |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
In the era of big data, unsupervised learning has become increasingly important. At a high-level, unsupervised learning serves to reduce the data size, while capturing its important underlying structure. For a powerful and widely-used family of unsupervised learning techniques (those based on spectral methods), scaling up to large data sets poses significant computational challenges. This research project will develop extremely simple and lightweight sampling techniques for scaling up this family of unsupervised learning methods. Since big data is ubiquitous, these research advances are likely to be transformative to a range of fields. This project will benefit society through the research team's ongoing collaborations in climate science, agriculture, and finance. The team will also continue to engage the computer science community in this endeavor, by training students, developing tutorials, and broadening the participation of women and minorities in computing.
This project will advance machine learning research by scaling up spectral methods for the analysis of large data sets. While spectral methods for the unsupervised learning tasks of clustering and embedding have found wide success in a variety of practical applications, scaling them up to large data sets poses significant computational challenges. In particular, the storage and computation needed to handle the affinity matrix (a matrix of pairwise similarities between data points) can be prohibitive. An approach that has found promise is to instead approximate this matrix in some sense. The goal of this project is to provide simple approximation techniques that manage the tradeoff between their space and time complexity vs. the quality of the approximation. The proposed approach involves sampling techniques that address this goal by exploiting latent structure in a data set, in order to minimize the amount of information that needs to be stored to (approximately) represent it. This leads to techniques that speed up the computation and reduce the memory requirements of spectral methods, while simultaneously providing better approximations. The project will also continue the team's momentum on leveraging advances in machine learning for data-driven discovery.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Many types of data are abundantly available in raw form, i.e. prior to being labeled for any classification or regression task (tasks typically referred to as supervised learning). For example, the number of photographs available online exceeds (astronomically) the number that have been labeled with meaningful text, let alone any text at all. When labels are not readily available, the resulting machine learning problem is known as unsupervised learning. The goal is typically to extract some latent structure in the data, such as features or clusters, in order to summarize the data, to make sense of it, or to reduce its size before further stages of the machine learning pipeline.
Spectral methods for the unsupervised learning tasks of clustering and embedding data have found wide success in a variety of practical applications, particularly on data that can be represented as a graph. However scaling these methods up to large data sets poses significant computational challenges. In particular, the storage and computation needed to handle the affinity matrix (a matrix of pairwise similarities between data points) can be prohibitive. Our work on this project has contributed to scaling up spectral methods to big data. We have developed two light-weight algorithms that approximate the affinity matrix. Our approach helps to better manage the tradeoff between the computational burden and the quality of the approximation needed for finding meaningful clusters in the data.
To gain insight in our clustering research, we also analyzed clustering heuristics that have demonstrated strong empirical performance on a variety of applications, but lacked solid theoretical foundations. In particular, we analyzed the convergence of stochastic k-means algorithm variants, e.g., online k-means and mini-batch k-means, which have enjoyed wide-spread success and deployment in several popular machine learning software packages.
This project made impact on the fields of machine learning and artificial intelligence, and on the training of graduate students.
Last Modified: 01/08/2019
Modified by: Claire Monteleoni
Please report errors in award information by writing to: awardsearch@nsf.gov.