
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | March 9, 2015 |
Latest Amendment Date: | November 15, 2016 |
Award Number: | 1502780 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2014 |
End Date: | December 31, 2017 (Estimated) |
Total Intended Award Amount: | $643,545.00 |
Total Awarded Amount to Date: | $643,545.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
615 W 131ST ST NEW YORK NY US 10027-7922 (212)854-6851 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
NY US 10027-6902 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Big Data Science &Engineering |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Big data analytics is, fundamentally, the problem of bringing the massive amounts of data produced today down to human scale. In particular scientists, engineers, physicians, and many others in knowledge-intensive professions face data that is beyond human scale. This data is in the repositories that collect the data and the reports or results in their fields. This project will address the problem of bringing all this knowledge under control by using even more data, namely the individual and social patterns of how these repositories are accessed and used, and user-specific judgments (valuations) of the data. The proposed research will develop novel algorithms and an open-source infrastructure for improving discovery within and access to data repositories. These algorithms will aggregate and analyze the social analytic data, gathered from professional communities of data users, and will motivate them to participate by providing recommendations.
The transformative goal is to develop methods for organizing, and operationalizing the access and preference patterns of users of large repositories, and for integrating those valuations to accelerate discovery within the collections. Diverse human minds interacting with data collections, as they carry out their own research or operational activities, provide a powerful source of information about the value of the data itself. Those data items may be textual documents, numerical datasets, or other kinds of media content. The novel methods for representing, aggregating, organizing and valuating interactions between the users and the items can reveal structures within data collections, which were previously invisible to any individual. This discovery of interrelations within data, driven by the capture of human intelligence, will accelerate the processes of scientific discovery. Users who are permitted to valuate data, and who are motivated by receiving valuable recommendations in return, reveal more about their own interests. This makes it possible to discover relations among the data items and among the users themselves. The educational goals are to: (a) contribute to the education of specific graduate students supported by the project, and undergraduates via the REU mechanism; (b) generate new educational materials related to algorithmic innovations, and to research findings; and (c) improve access to and discovery within specific collections of materials. Research findings will be included in courses at all three collaborating universities.
Additional information about the project (including publication, software, data sets) will be made available through the project web site: http://arxiv_xs.rutgers.edu/.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
In our BIGDATA grant, we pushed the state of the art in modern recommendation systems. Recommendation systems take in data about how people consume items (movies, products, or---in this grant---scientific articles) and build a predictor of each user’s future actions i.e what movie will they choose to see next or which article will be most likely to read.
Classically, recommendation systems were built on matrix factorization methods. Matrix factorization are machine learning methods that learn “representations” of each user and item which, loosely, can be interpreted as preferences of the users and attributes of the items. A fitted matrix factorization predicts that a user will like an item when there is an affinity between their representations, when the user’s preferences match the items attributes. However, although powerful, matrix factorization is also limited, and overcoming its inherent limitations was one of our major goals over the life of this project. To this end, we developed several new recommendation algorithms that perform significantly better than the classical approach.
In one line of work, we developed methods that capture both the content of the items and the various way in which users interact with them. This was initially motivated by a desire to solve the "cold start problem," i.e., when a new item enters the collection but where there is no information about who has liked it. Our methods do solve this problem, but they do more as well. They help form interpretable representations of users---defining their preferences in terms of dimensions that can be visualized and explained---as well as representations of items, a function with interdisciplinary impact. As one point of its broader impact, this work has been used, for example, by sociologists studying the historical preferences of library-users in the 1800s.
In another line of work, we modeled the discovery process of users. Matrix factorization tacitly biases its estimation of user preferences because it does not account for how people discover items to consume. As an example, consider watching a movie that your discovery process does not easily find. Intuitively, a recommendation system should "upweight" your vote on this movie because it was rare, relative to the movies that you typically watch. We turned this intuition into a mathematical framework for recommendation, significantly improving on classical methods.
We described just two results from our project. We innovated recommendation systems in several other ways as well. For example, we built a method to capture how user preferences change over time, a method to model user preferences in the context of complex baskets of goods, a method to involve social networks in building recommendation systems, and a method to scale recommendations systems up to massive sparse datasets.
In its broader impacts, our work has significantly pushed the state of the art in recommendation systems and models of user behavior data. These models have become core to many technological problems across science, industry, and government. In its intellectual merit, this work provided new methods in modern machine learning and data science. We and others continue to build on and research the mathematical and algorithmic properties of these methods.
Last Modified: 03/22/2018
Modified by: David Blei
Please report errors in award information by writing to: awardsearch@nsf.gov.