Award Abstract # 1319788
AF: Small: Learning and Testing Classes of Distributions

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: THE TRUSTEES OF COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Initial Amendment Date: June 3, 2013
Latest Amendment Date: June 3, 2013
Award Number: 1319788
Award Instrument: Standard Grant
Program Manager: Tracy Kimbrel
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: June 1, 2013
End Date: May 31, 2016 (Estimated)
Total Intended Award Amount: $471,875.00
Total Awarded Amount to Date: $471,875.00
Funds Obligated to Date: FY 2013 = $471,875.00
History of Investigator:
  • Rocco Servedio (Principal Investigator)
    rocco@cs.columbia.edu
Recipient Sponsored Research Office: Columbia University
615 W 131ST ST
NEW YORK
NY  US  10027-7922
(212)854-6851
Sponsor Congressional District: 13
Primary Place of Performance: Columbia University
2960 Broadway
New York, NY
NY  US  10027-6902
Primary Place of Performance
Congressional District:
13
Unique Entity Identifier (UEI): F4N1QNPB95M4
Parent UEI:
NSF Program(s): ALGORITHMS
Primary Program Source: 01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7926
Program Element Code(s): 792600
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

A long and successful line of research in machine learning deals with algorithms that learn from "labeled" data, where a target function is assumed to provide a label for each data point. A major focus of theoretical work has been to develop efficient algorithms for learning different classes of target functions. Recent years have witnessed a data explosion across many domains of science and society, but much of this newly available data consists simply of example points (DNA sequences, sensor readings, smartphone user locations, etc) without any labels. A natural model of such scenarios is that data points are generated according to some unknown probability distribution (typically over an extremely large domain). The goal of the proposed work is to study the learnability of different classes of probability distributions given access to samples drawn from the distributions. This is closely analogous to the framework of learning from labeled data sketched above, but with probability distributions playing the role of functions as the objects to be learned.

In this project, the PI will perform theoretical research on developing computationally efficient algorithms for learning and testing various natural types of probability distributions over extremely large domains. (Testing algorithms are algorithms which, instead of trying to accurately model an unknown distribution, have the more modest goal of testing whether or not the distribution has some property of interest.) Specific problems the PI will address include: (1) Developing efficient algorithms to learn and test univariate probability distributions that satisfy various natural kinds of "shape constraints" on the underlying probability density function. Preliminary results suggest that dramatic improvements in efficiency may be possible for algorithms that are designed to exploit this type of structure. (2) Developing efficient algorithms for learning and testing complex distributions that result from the aggregation of many independent simple sources of randomness.

The algorithms that the PI will work to develop can provide useful modelling tools in data-rich environments and may serve as a "computational substrate" on which large-scale machine learning applications can be developed for real-world problems spanning a broad range of application areas. Other important focuses of the grant are to train graduate students through research collaboration, disseminate research results through seminar talks, survey articles and other publications, and to continue ongoing outreach activities aimed at increasing interest in theoretical computer science topics in elementary school students.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

C. Canonne, T. Gouleakis, and R. Rubinfeld "Sampling Correctors" 7th Innovations in Theoretical Computer Science (ITCS), 2016. , 2016
C. Canonne "Are Few Bins Enough: Testing Histogram Distributions" 35th ACM Symposium on Principles of Database Systems (PODS), 2016 , 2016
C. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld "Testing Shape Restrictions of Discrete Distributions" 33rd International Symposium on Theoretical Aspects of Computer Science (STACS), 2016. , 2016
Costis Daskalakis and Ilias Diakonikolas and Rocco A. Servedio "Learning Poisson Binomial Distributions" Algorithmica (Special Issue on Computational Learning Theory, by invitation) , v.72 , 2015
Costis Daskalakis, Ilias Diakonikolas, Rocco A. Servedio "Efficient algorithms for density estimation of univariate k-modaldistributions via property testing" Theory of Computing , v.10 , 2014

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

A well-known class of problems in statistics, usually referred to as 
"density estimation" problems, involve inferring an unknown probability 
distribution on the basis of samples drawn from the distribution.  The 
main goal of this project was to study problems of this sort from the 
vantage point of theoretical computer science.  The statistics 
perspective typically focuses on how much data is required to perform 
the relevant inference tasks; in theoretical computer science, the 
emphasis is not only on data requirements, but on the computational 
efficiency of algorithms to solve the relevant problems.  This project 
aimed to develop provably correct and efficient algorithms for inferring 
various types of probability distributions.  Below we describe two of 
the main results achieved in the course of this project.

One significant result was a new and general approach for learning 
"univariate" probability distributions (these are simply distributions 
where each draw from the distribution is a single number; a very simple 
example of such a distribution is the outcome obtained from rolling a 
die, since the possible distinct outcomes are simply the numbers one 
through six).  The PI developed a general algorithm that can efficiently 
learn any univariate distribution whose probability density function can 
be approximated by a collection of polynomial curves over different 
portions of the domain (i.e. any distribution that can be approximated 
by a piecewise polynomial function).  The algorithm obtained is provably 
best possible in terms of the amount of data that it requires in order 
to come up with a hypothesis distribution achieving a given accuracy 
level.  The PI also showed that a wide range of different well-studied 
types of probability distributions (such as k-modal distributions, 
monotone hazard rate distributions, various types of mixture 
distributions, etc) can be approximated by these piecewise polynomial 
functions, so the new general method has broad applicability.

Another significant result was for learning probability distributions 
that are obtained by aggregating many simple independent distributions.  
As a concrete example, suppose that N (a very large number) of people 
each separately roll their own k-sided die, where each die is labeled 
with the numbers 1,...,k but may be skewed in an unknown and arbitrary 
way (one die may output 2 one-quarter of the time and 4 three-quarters 
of the time; another may be fair and output all values 1,...,k 
equiprobably; etc).  The total of all die rolls is added up, and this 
comprises a single draw from the aggregate distribution.  Is it possible 
to learn this aggregate distribution efficiently?  The PI gave an 
efficient algorithm to learn this aggregate distribution to high 
accuracy from an essentially optimal (minimal) number of samples --- 
perhaps surprisingly, the running time and amount of data required by 
the algorithm is completely independent of the number N of component 
independent distributions!  Ongoing work is underway to extend this 
result to even more general types of aggregated distributions.  

The efficient algorithms described above for learning different types of 
probability distributions may serve as a useful "computational 
substrate" for systems that do large-scale analysis of probabilistic 
data.  In terms of broader impacts, besides the scientific impact of the 
work, this award contributed to the development of human resources for 
the STEM and academic workforce.  A number of Ph.D. students at Columbia 
University received partial support and research training as a result of 
this grant award.

 


Last Modified: 08/23/2016
Modified by: Rocco A Servedio

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page