Award Abstract # 1247637
BIGDATA: Mid-Scale: ESCE: Collaborative Research: Discovery and Social Analytics for Large-Scale Scientific Literature

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: CORNELL UNIVERSITY
Initial Amendment Date: September 20, 2012
Latest Amendment Date: September 20, 2012
Award Number: 1247637
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: January 1, 2013
End Date: December 31, 2016 (Estimated)
Total Intended Award Amount: $1,294,450.00
Total Awarded Amount to Date: $1,294,450.00
Funds Obligated to Date: FY 2012 = $1,294,450.00
History of Investigator:
  • Thorsten Joachims (Principal Investigator)
    tj@cs.cornell.edu
  • Paul Ginsparg (Co-Principal Investigator)
  • Peter Frazier (Co-Principal Investigator)
Recipient Sponsored Research Office: Cornell University
341 PINE TREE RD
ITHACA
NY  US  14850-2820
(607)255-5014
Sponsor Congressional District: 19
Primary Place of Performance: Cornell University
4130 Upson Hall
Ithaca
NY  US  14853-7501
Primary Place of Performance
Congressional District:
19
Unique Entity Identifier (UEI): G56PUALJ3KT5
Parent UEI:
NSF Program(s): Big Data Science &Engineering
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 7924, 8083
Program Element Code(s): 808300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Big data analytics is, fundamentally, the problem of bringing the massive amounts of data produced today down to human scale. In particular scientists, engineers, physicians, and many others in knowledge-intensive professions face data that is beyond human scale. This data is in the repositories that collect the data and the reports or results in their fields. This project will address the problem of bringing all this knowledge under control by using even more data, namely the individual and social patterns of how these repositories are accessed and used, and user-specific judgments (valuations) of the data. The proposed research will develop novel algorithms and an open-source infrastructure for improving discovery within and access to data repositories. These algorithms will aggregate and analyze the social analytic data, gathered from professional communities of data users, and will motivate them to participate by providing recommendations.

The transformative goal is to develop methods for organizing, and operationalizing the access and preference patterns of users of large repositories, and for integrating those valuations to accelerate discovery within the collections. Diverse human minds interacting with data collections, as they carry out their own research or operational activities, provide a powerful source of information about the value of the data itself. Those data items may be textual documents, numerical datasets, or other kinds of media content. The novel methods for representing, aggregating, organizing and valuating interactions between the users and the items can reveal structures within data collections, which were previously invisible to any individual. This discovery of interrelations within data, driven by the capture of human intelligence, will accelerate the processes of scientific discovery. Users who are permitted to valuate data, and who are motivated by receiving valuable recommendations in return, reveal more about their own interests. This makes it possible to discover relations among the data items and among the users themselves. The educational goals are to: (a) contribute to the education of specific graduate students supported by the project, and undergraduates via the REU mechanism; (b) generate new educational materials related to algorithmic innovations, and to research findings; and (c) improve access to and discovery within specific collections of materials. Research findings will be included in courses at all three collaborating universities.

Additional information about the project (including publication, software, data sets) will be made available through the project web site: http://arxiv_xs.rutgers.edu/.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 34)
A.J. Meltzer, A. Graham, P.H. Connolly, J.K. Karwowski, H.L. Bush, P.I. Frazier and D.B. Schneider "Risk Factors for Early Failure after Peripheral Endovascular Intervention: Application of a Reliability Engineering Approach" Annals of Vascular Surgery , v.27 , 2013 , p.53 10.1016/j.avsg.2012.05.002
A. Swaminathan, T. Joachims "Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization" JMLR Special Issue in Memory of Alexey Chervonenkis , 2015
A. Swaminathan, T. Joachims "Counterfactual Risk Minimization: Learning from Logged Bandit Feedback" International Conference on Machine Learning (ICML) , 2015
A. Swaminathan, T. Joachims "The Self-Normalized Estimator for Counterfactual Learning" Neural Information Processing Systems (NIPS) , 2015
B. Chen and P.I. Frazier "The Bayesian Linear Information Filtering Problem" IEEE International Conference on Tools with Artificial Intelligence (ICTAI) , 2016
Bishan Yang and Claire Cardie and Peter Frazier "A Hierarchical Distance-dependent Bayesian Model for Event Coreference Resolution" Transactions of the Association for Computational Linguistics , v.3 , 2015 , p.517--528 2307-387X
D. Blei "Probabilistic topic models" Communications of the ACM , v.55 , 2012 , p.4
D. Blei "Topic Modeling and Digital Humanities" Journal of Digital Humanities , v.2 , 2013
D. Singhvi and S. Singhvi and P.I. Frazier and S.G. Henderson and E. O'Mahony and D.B. Shmoys and D.B. Woodard "Predicting Bike Usage for New York City's Bike Sharing System" AAAI-15 Workshop on Computational Sustainability , 2015
I.O. Ryzhov, P.I. Frazier, and W.B. Powell "A New Optimal Stepsize Rule for Approximate Dynamic Programming" IEEE Transactions on Automatic Control , v.60 , 2015 , p.743
J. Paisley, C. Wang, and D. Blei "The discrete infinite logistic normal distribution" Bayesian Analysis , 2012
(Showing: 1 - 10 of 34)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

 

Recommender systems have become a key tool of everyday life, and we routinely use them for tasks ranging from browsing entertainment options to researching products we may want to purchase. In this project, we developed new methods for training recommender systems based on the feedback the users provided both explicitly and implicitly through their actions. In particular, the project focused on designing recommendation systems for scientific literature, where this particular application provided not only a testbed for the general methods we developed, but also explored the design of the next generation of information systems that will further enable the dissemination of scientific results.

 

The project was a collaborative effort of researchers at Cornell, Princeton, and Rutgers. Focusing on the results on the Cornell side, the project developed new machine learning algorithms for several aspects of the recommendation problem. The project made many contributions to the design of such learning algorithms, but for conciseness of this report we focus on the following two areas of research.

 

First, recommendation systems need to strike the right balance between exploiting what they already know about the user, and exploring aspects of the users’ interests that the system is not yet confident about. Making the right trade-offs between exploration and exploitation is important, since too much exploration makes the recommendation system look like it does not know the users’ tastes, and too much exploitation may lead to never discovering all the interests a user may have. Approaching this trade-off between exploration and exploration as a multi-armed bandit problem, we have designed new algorithms and their underlying theory for solving this trade-off optimally under various conditions.

 

Second, we asked the question of how to reuse data that was collected by the recommendation system in the past. The problem here lies in dealing with the biases that were introduced by the historic version of the recommendation system, as well as the biases that are inherent in how humans provide feedback and make choices. For example, if we want to use the set of papers that the user read while using our historic recommendation system as a feedback signal for learning, then it is important to know what papers the recommender system did recommend and how visible this was to the user. Clearly, a paper that was never discovered by the user cannot make it into the set of paper the user read, even if it was very relevant to the user’s interests. To deal with such biases in a principled and provably correct way, we designed learning methods that explicitly correct for selection biases and that scale to large datasets.


Beyond these research contributions in machine learning, the project developed the my.arxiv.org system as a prototype for the next generation of systems that help researchers discover relevant scientific literature. It allowed us to explore different interfaces and how these interfaces interact with the recommendation algorithms. The lessons learned will be incorporated into the design of the next generation of Arxiv.Org, which is the main repository of scientific papers for a wide range of disciplines in science and engineering.

 

 


Last Modified: 05/10/2017
Modified by: Thorsten Joachims

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page