
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | September 20, 2012 |
Latest Amendment Date: | September 20, 2012 |
Award Number: | 1247637 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | January 1, 2013 |
End Date: | December 31, 2016 (Estimated) |
Total Intended Award Amount: | $1,294,450.00 |
Total Awarded Amount to Date: | $1,294,450.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
341 PINE TREE RD ITHACA NY US 14850-2820 (607)255-5014 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
4130 Upson Hall Ithaca NY US 14853-7501 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Big Data Science &Engineering |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Big data analytics is, fundamentally, the problem of bringing the massive amounts of data produced today down to human scale. In particular scientists, engineers, physicians, and many others in knowledge-intensive professions face data that is beyond human scale. This data is in the repositories that collect the data and the reports or results in their fields. This project will address the problem of bringing all this knowledge under control by using even more data, namely the individual and social patterns of how these repositories are accessed and used, and user-specific judgments (valuations) of the data. The proposed research will develop novel algorithms and an open-source infrastructure for improving discovery within and access to data repositories. These algorithms will aggregate and analyze the social analytic data, gathered from professional communities of data users, and will motivate them to participate by providing recommendations.
The transformative goal is to develop methods for organizing, and operationalizing the access and preference patterns of users of large repositories, and for integrating those valuations to accelerate discovery within the collections. Diverse human minds interacting with data collections, as they carry out their own research or operational activities, provide a powerful source of information about the value of the data itself. Those data items may be textual documents, numerical datasets, or other kinds of media content. The novel methods for representing, aggregating, organizing and valuating interactions between the users and the items can reveal structures within data collections, which were previously invisible to any individual. This discovery of interrelations within data, driven by the capture of human intelligence, will accelerate the processes of scientific discovery. Users who are permitted to valuate data, and who are motivated by receiving valuable recommendations in return, reveal more about their own interests. This makes it possible to discover relations among the data items and among the users themselves. The educational goals are to: (a) contribute to the education of specific graduate students supported by the project, and undergraduates via the REU mechanism; (b) generate new educational materials related to algorithmic innovations, and to research findings; and (c) improve access to and discovery within specific collections of materials. Research findings will be included in courses at all three collaborating universities.
Additional information about the project (including publication, software, data sets) will be made available through the project web site: http://arxiv_xs.rutgers.edu/.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Recommender systems have become a key tool of everyday life, and we routinely use them for tasks ranging from browsing entertainment options to researching products we may want to purchase. In this project, we developed new methods for training recommender systems based on the feedback the users provided both explicitly and implicitly through their actions. In particular, the project focused on designing recommendation systems for scientific literature, where this particular application provided not only a testbed for the general methods we developed, but also explored the design of the next generation of information systems that will further enable the dissemination of scientific results.
The project was a collaborative effort of researchers at Cornell, Princeton, and Rutgers. Focusing on the results on the Cornell side, the project developed new machine learning algorithms for several aspects of the recommendation problem. The project made many contributions to the design of such learning algorithms, but for conciseness of this report we focus on the following two areas of research.
First, recommendation systems need to strike the right balance between exploiting what they already know about the user, and exploring aspects of the users’ interests that the system is not yet confident about. Making the right trade-offs between exploration and exploitation is important, since too much exploration makes the recommendation system look like it does not know the users’ tastes, and too much exploitation may lead to never discovering all the interests a user may have. Approaching this trade-off between exploration and exploration as a multi-armed bandit problem, we have designed new algorithms and their underlying theory for solving this trade-off optimally under various conditions.
Second, we asked the question of how to reuse data that was collected by the recommendation system in the past. The problem here lies in dealing with the biases that were introduced by the historic version of the recommendation system, as well as the biases that are inherent in how humans provide feedback and make choices. For example, if we want to use the set of papers that the user read while using our historic recommendation system as a feedback signal for learning, then it is important to know what papers the recommender system did recommend and how visible this was to the user. Clearly, a paper that was never discovered by the user cannot make it into the set of paper the user read, even if it was very relevant to the user’s interests. To deal with such biases in a principled and provably correct way, we designed learning methods that explicitly correct for selection biases and that scale to large datasets.
Beyond these research contributions in machine learning, the project developed the my.arxiv.org system as a prototype for the next generation of systems that help researchers discover relevant scientific literature. It allowed us to explore different interfaces and how these interfaces interact with the recommendation algorithms. The lessons learned will be incorporated into the design of the next generation of Arxiv.Org, which is the main repository of scientific papers for a wide range of disciplines in science and engineering.
Last Modified: 05/10/2017
Modified by: Thorsten Joachims
Please report errors in award information by writing to: awardsearch@nsf.gov.