Award Abstract # 1502780
BIGDATA: Mid-Scale: ESCE: Collaborative Research: Discovery and Social Analytics for Large-Scale Scientific Literature

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: THE TRUSTEES OF COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK
Initial Amendment Date: March 9, 2015
Latest Amendment Date: November 15, 2016
Award Number: 1502780
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: July 1, 2014
End Date: December 31, 2017 (Estimated)
Total Intended Award Amount: $643,545.00
Total Awarded Amount to Date: $643,545.00
Funds Obligated to Date: FY 2012 = $643,545.00
History of Investigator:
  • David Blei (Principal Investigator)
    david.blei@columbia.edu
Recipient Sponsored Research Office: Columbia University
615 W 131ST ST
NEW YORK
NY  US  10027-7922
(212)854-6851
Sponsor Congressional District: 13
Primary Place of Performance: Columbia University
NY  US  10027-6902
Primary Place of Performance
Congressional District:
13
Unique Entity Identifier (UEI): F4N1QNPB95M4
Parent UEI:
NSF Program(s): Big Data Science &Engineering
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 7924, 8083
Program Element Code(s): 808300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Big data analytics is, fundamentally, the problem of bringing the massive amounts of data produced today down to human scale. In particular scientists, engineers, physicians, and many others in knowledge-intensive professions face data that is beyond human scale. This data is in the repositories that collect the data and the reports or results in their fields. This project will address the problem of bringing all this knowledge under control by using even more data, namely the individual and social patterns of how these repositories are accessed and used, and user-specific judgments (valuations) of the data. The proposed research will develop novel algorithms and an open-source infrastructure for improving discovery within and access to data repositories. These algorithms will aggregate and analyze the social analytic data, gathered from professional communities of data users, and will motivate them to participate by providing recommendations.

The transformative goal is to develop methods for organizing, and operationalizing the access and preference patterns of users of large repositories, and for integrating those valuations to accelerate discovery within the collections. Diverse human minds interacting with data collections, as they carry out their own research or operational activities, provide a powerful source of information about the value of the data itself. Those data items may be textual documents, numerical datasets, or other kinds of media content. The novel methods for representing, aggregating, organizing and valuating interactions between the users and the items can reveal structures within data collections, which were previously invisible to any individual. This discovery of interrelations within data, driven by the capture of human intelligence, will accelerate the processes of scientific discovery. Users who are permitted to valuate data, and who are motivated by receiving valuable recommendations in return, reveal more about their own interests. This makes it possible to discover relations among the data items and among the users themselves. The educational goals are to: (a) contribute to the education of specific graduate students supported by the project, and undergraduates via the REU mechanism; (b) generate new educational materials related to algorithmic innovations, and to research findings; and (c) improve access to and discovery within specific collections of materials. Research findings will be included in courses at all three collaborating universities.

Additional information about the project (including publication, software, data sets) will be made available through the project web site: http://arxiv_xs.rutgers.edu/.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 14)
A. B. Dieng, C. Wang, J. Gao, and J. Paisley "TopicRNN: A Recurrent Neural Network With Long-Range Semantic Dependency." ICLR - International Conference on Learning Representations , 2017
A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, D. M. Blei "Variational Inference via $\chi$ Upper Bound Minimization." NIPS - Neural Information Processing Systems, , 2017
A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. Blei. "Automatic differentiation variational inference." Journal of Machine Learning Research , 2017
C. Naesseth, F. Ruiz, S.W. Linderman, D. M. Blei. "Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms" AISTATS - International Conference on Artificial Intelligence and Statistics , 2017
D. Blei, A. Kucukelbir, and J. McAuliffe. "Variational inference: A review for statisticians." Journal of the American Statistical Association , v.112 , 2017
D. M. Blei, P. Smyth "Science and data science" Proceedings of the National Academy of Sciences. , v.114 , 2017
D. Tran, M.D. Hoffman, R.A. Saurous, E. Brevdo, K. Murphy, and D.M. Blei. "Deep probabilistic programming." ICLR - International Conference on Learning Representations , 2017
D. Tran, R. Ranganath, D. M. Blei. "Deep and hierarchical implicit models." NIPS - Neural Information Processing Systems , 2017
L.Liu, F. J. R. Ruiz, D. M. Blei. "Context selection for embedding models." NIPS - Neural Information Processing Systems , 2017
M. Rudolph, F. J. R. Ruiz, D. M. Blei. "Structured embedding models for grouped data" NIPS - Neural Information Processing Systems , 2017
P. Gopalan, W. Hao, D. Blei, and J. Storey "Scaling probabilistic models of genetic variation to millions of humans." Nature Genetics , 2016 10.1038/ng.3710
(Showing: 1 - 10 of 14)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

In our BIGDATA grant, we pushed the state of the art in modern recommendation systems.  Recommendation systems take in data about how people consume items (movies, products, or---in this grant---scientific articles) and build a predictor of each user’s future actions i.e what movie will they choose to see next or which article will be most likely to read.

 Classically, recommendation systems were built on matrix factorization methods.  Matrix factorization are machine learning methods that learn “representations” of each user and item which, loosely, can be interpreted as preferences of the users and attributes of the items.  A fitted matrix factorization predicts that a user will like an item when there is an affinity between their representations, when the user’s preferences match the items attributes. However, although powerful, matrix factorization is also limited, and overcoming its inherent limitations was one of our major goals over the life of this project.  To this end, we developed several new recommendation algorithms that perform significantly better than the classical approach.

In one line of work, we developed methods that capture both the content of the items and the various way in which users interact with them.  This was initially motivated by a desire to solve the "cold start problem," i.e., when a new item enters the collection but where there is no information about who has liked it.  Our methods do solve this problem, but they do more as well.  They help form interpretable representations of users---defining their preferences in terms of dimensions that can be visualized and explained---as well as representations of items, a function with interdisciplinary impact. As one point of its broader impact, this work has been used, for example, by sociologists studying the historical preferences of library-users in the 1800s.

In another line of work, we modeled the discovery process of users.  Matrix factorization tacitly biases its estimation of user preferences because it does not account for how people discover items to consume.  As an example, consider watching a movie that your discovery process does not easily find.  Intuitively, a recommendation system should "upweight" your vote on this movie because it was rare, relative to the movies that you typically watch.  We turned this intuition into a mathematical framework for recommendation, significantly improving on classical methods.

We described just two results from our project.  We innovated recommendation systems in several other ways as well.  For example, we built a method to capture how user preferences change over time, a method to model user preferences in the context of complex baskets of goods, a method to involve social networks in building recommendation systems, and a method to scale recommendations systems up to massive sparse datasets.

In its broader impacts, our work has significantly pushed the state of the art in recommendation systems and models of user behavior data.  These models have become core to many technological problems across science, industry, and government.  In its intellectual merit, this work provided new methods in modern machine learning and data science. We and others continue to build on and research the mathematical and algorithmic properties of these methods.

 


Last Modified: 03/22/2018
Modified by: David Blei

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page