Award Abstract # 1422830
AF: Small: Geometry and High-dimensional Inference

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: OHIO STATE UNIVERSITY, THE
Initial Amendment Date: July 25, 2014
Latest Amendment Date: April 25, 2017
Award Number: 1422830
Award Instrument: Standard Grant
Program Manager: Rahul Shah
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 1, 2014
End Date: January 31, 2018 (Estimated)
Total Intended Award Amount: $450,000.00
Total Awarded Amount to Date: $450,000.00
Funds Obligated to Date: FY 2014 = $450,000.00
History of Investigator:
  • Mikhail Belkin (Principal Investigator)
    mbelkin@ucsd.edu
  • Luis Rademacher (Former Principal Investigator)
  • Mikhail Belkin (Former Co-Principal Investigator)
Recipient Sponsored Research Office: Ohio State University
1960 KENNY RD
COLUMBUS
OH  US  43210-1016
(614)688-8735
Sponsor Congressional District: 03
Primary Place of Performance: Ohio State University
OH  US  43210-1063
Primary Place of Performance
Congressional District:
03
Unique Entity Identifier (UEI): DLWBSLWAJWR1
Parent UEI: MN4MDDMN8529
NSF Program(s): Algorithmic Foundations
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7926
Program Element Code(s): 779600
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Data analysis is ubiquitous in a broad range of application fields, from computer graphics to geographic information systems, from sensor networks to social networks, from economics to medicine. It represents a fundamental problem in computational science. The project will advance the theoretical understanding of fundamental issues behind data analysis, and develop practical algorithms that will be useful for a broad range of problems in science and engineering.

The project addresses the fundamental problem of reconstructing structure of probability distributions from sampled data. It will investigate the use of tensor-based and other higher order methods, in particular those that allow for efficient optimization. The project lies at the interface of theoretical computer science, machine learning, signal processing and statistics and will have potential impact in all of these fields. In recent years there has been a resurgence of interest in tensor methods in data analysis and inference, particularly in theoretical computer science. These methods will prove useful in a variety of applications in machine learning, signal processing and other fields.

The project will develop algorithms for solving a range of problems including blind source separation, spectral clustering, inference in mixture models and estimating geometry of distributions. It will analyze the complexity of these and related problems. In particular, it will strive to understand the computational efficiency  and dependence on the dimension of the space, studying "the curses and blessings of dimensionality". It will also address a somewhat mysterious discrepancy between sample and algorithmic complexity in our understanding of many high dimensional inference problems.

The results of this work will be disseminated to the broad scientific community through publications in journals, conferences and presentations in various venues, including tutorials. The goals of this project include to implement the practical algorithms and to make the software available online. The research results will also be incorporated in the curriculum of graduate classes taught by the PI and the co-PI. Graduate students supported by this project will receive extensive training in theory, algorithm development and applications.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

J. Eldridge, M. Belkin, Y. Wang "Unperturbed: spectral analysis beyond Davis-Kahan" Algorithmic Learning Theory (ALT) 2018 , 2018
Joseph Anderson, Navin Goyal, Anupama Nandi and Luis Rademacher "Heavy-Tailed Analogues of the Covariance Matrix for ICA" Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA , 2017
Justin Eldridge, Mikhail Belkin, Yusu Wang "Graphons, mergeons, and so on!" NIPS 2016 , 2016
Mikhail Belkin, Luis Rademacher, James Voss "Eigenvectors of Orthogonally Decomposable Functions" SIAM Journal on Computing , 2018
Siyuan Ma, Mikhail Belkin "Diving into the shallows: a computational perspective on large-scale shallow learning" Neural Information Processing (NIPS) 2017 , 2017

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The amount of data that we generate and collect grows at an accelerated pace. If we want to extract useful information from this data, we need better and better tools. One source of difficulty is the growing number of attributes or dimensionality, which makes data analysis computationally expensive. This research project started with the aim of developing better tools for the analysis of high-dimensional data. The proposed method is to use geometry as a way of understanding the structure of data, namely, to answer the question: What is the shape of our data?


A central problem in data analysis is to find an insightful representation of the data. The geometry of this problem can be seen in the following analogous situation: In order to understand the shape of a three-dimensional object by looking at it from a particular angle, we may need to choose a good orientation of the object. In real-world data the number of attributes is commonly many more than three. The representation problem tends to become more difficult and relevant as the number of attributes grows. One of the main outcomes of this project is a unified framework to discover a good representation of a dataset. Among the representations considered are representations where the attributes behave independently (known as Independent Component Analysis). We provided and analyzed simple, yet general methods that have strong theoretical guarantees even when the data is corrupted by noise. We have also developed practical algorithms for the important problem of clustering, i.e., grouping objects together based on the similarities between their attributes.


Last Modified: 06/11/2018
Modified by: Mikhail Belkin

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page