
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | June 3, 2013 |
Latest Amendment Date: | June 3, 2013 |
Award Number: | 1319788 |
Award Instrument: | Standard Grant |
Program Manager: |
Tracy Kimbrel
CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | June 1, 2013 |
End Date: | May 31, 2016 (Estimated) |
Total Intended Award Amount: | $471,875.00 |
Total Awarded Amount to Date: | $471,875.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
615 W 131ST ST NEW YORK NY US 10027-7922 (212)854-6851 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
2960 Broadway New York, NY NY US 10027-6902 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | ALGORITHMS |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
A long and successful line of research in machine learning deals with algorithms that learn from "labeled" data, where a target function is assumed to provide a label for each data point. A major focus of theoretical work has been to develop efficient algorithms for learning different classes of target functions. Recent years have witnessed a data explosion across many domains of science and society, but much of this newly available data consists simply of example points (DNA sequences, sensor readings, smartphone user locations, etc) without any labels. A natural model of such scenarios is that data points are generated according to some unknown probability distribution (typically over an extremely large domain). The goal of the proposed work is to study the learnability of different classes of probability distributions given access to samples drawn from the distributions. This is closely analogous to the framework of learning from labeled data sketched above, but with probability distributions playing the role of functions as the objects to be learned.
In this project, the PI will perform theoretical research on developing computationally efficient algorithms for learning and testing various natural types of probability distributions over extremely large domains. (Testing algorithms are algorithms which, instead of trying to accurately model an unknown distribution, have the more modest goal of testing whether or not the distribution has some property of interest.) Specific problems the PI will address include: (1) Developing efficient algorithms to learn and test univariate probability distributions that satisfy various natural kinds of "shape constraints" on the underlying probability density function. Preliminary results suggest that dramatic improvements in efficiency may be possible for algorithms that are designed to exploit this type of structure. (2) Developing efficient algorithms for learning and testing complex distributions that result from the aggregation of many independent simple sources of randomness.
The algorithms that the PI will work to develop can provide useful modelling tools in data-rich environments and may serve as a "computational substrate" on which large-scale machine learning applications can be developed for real-world problems spanning a broad range of application areas. Other important focuses of the grant are to train graduate students through research collaboration, disseminate research results through seminar talks, survey articles and other publications, and to continue ongoing outreach activities aimed at increasing interest in theoretical computer science topics in elementary school students.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
A well-known class of problems in statistics, usually referred to as "density estimation" problems, involve inferring an unknown probability distribution on the basis of samples drawn from the distribution. The main goal of this project was to study problems of this sort from the vantage point of theoretical computer science. The statistics perspective typically focuses on how much data is required to perform the relevant inference tasks; in theoretical computer science, the emphasis is not only on data requirements, but on the computational efficiency of algorithms to solve the relevant problems. This project aimed to develop provably correct and efficient algorithms for inferring various types of probability distributions. Below we describe two of the main results achieved in the course of this project. One significant result was a new and general approach for learning "univariate" probability distributions (these are simply distributions where each draw from the distribution is a single number; a very simple example of such a distribution is the outcome obtained from rolling a die, since the possible distinct outcomes are simply the numbers one through six). The PI developed a general algorithm that can efficiently learn any univariate distribution whose probability density function can be approximated by a collection of polynomial curves over different portions of the domain (i.e. any distribution that can be approximated by a piecewise polynomial function). The algorithm obtained is provably best possible in terms of the amount of data that it requires in order to come up with a hypothesis distribution achieving a given accuracy level. The PI also showed that a wide range of different well-studied types of probability distributions (such as k-modal distributions, monotone hazard rate distributions, various types of mixture distributions, etc) can be approximated by these piecewise polynomial functions, so the new general method has broad applicability. Another significant result was for learning probability distributions that are obtained by aggregating many simple independent distributions. As a concrete example, suppose that N (a very large number) of people each separately roll their own k-sided die, where each die is labeled with the numbers 1,...,k but may be skewed in an unknown and arbitrary way (one die may output 2 one-quarter of the time and 4 three-quarters of the time; another may be fair and output all values 1,...,k equiprobably; etc). The total of all die rolls is added up, and this comprises a single draw from the aggregate distribution. Is it possible to learn this aggregate distribution efficiently? The PI gave an efficient algorithm to learn this aggregate distribution to high accuracy from an essentially optimal (minimal) number of samples --- perhaps surprisingly, the running time and amount of data required by the algorithm is completely independent of the number N of component independent distributions! Ongoing work is underway to extend this result to even more general types of aggregated distributions. The efficient algorithms described above for learning different types of probability distributions may serve as a useful "computational substrate" for systems that do large-scale analysis of probabilistic data. In terms of broader impacts, besides the scientific impact of the work, this award contributed to the development of human resources for the STEM and academic workforce. A number of Ph.D. students at Columbia University received partial support and research training as a result of this grant award.
Last Modified: 08/23/2016
Modified by: Rocco A Servedio
Please report errors in award information by writing to: awardsearch@nsf.gov.