Award Abstract # 1247696
BIGDATA: Mid-Scale: ESCE: Collaborative Research: Discovery and Social Analytics for Large-Scale Scientific Literature.

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: RUTGERS, THE STATE UNIVERSITY
Initial Amendment Date: September 20, 2012
Latest Amendment Date: November 22, 2017
Award Number: 1247696
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: January 1, 2013
End Date: December 31, 2018 (Estimated)
Total Intended Award Amount: $996,792.00
Total Awarded Amount to Date: $1,004,784.00
Funds Obligated to Date: FY 2012 = $996,792.00
FY 2017 = $7,992.00
History of Investigator:
  • Rebecca Wright (Principal Investigator)
    rwright@barnard.edu
  • Paul Kantor (Co-Principal Investigator)
  • Paul Kantor (Former Principal Investigator)
Recipient Sponsored Research Office: Rutgers University New Brunswick
3 RUTGERS PLZ
NEW BRUNSWICK
NJ  US  08901-8559
(848)932-0150
Sponsor Congressional District: 12
Primary Place of Performance: Rutgers University New Brunswick
NJ  US  08901-8559
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): M1LVPE5GLSD9
Parent UEI:
NSF Program(s): Big Data Science &Engineering
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
01001718DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 7924, 8083, 9251
Program Element Code(s): 808300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Big data analytics is, fundamentally, the problem of bringing the massive amounts of data produced today down to human scale. In particular scientists, engineers, physicians, and many others in knowledge-intensive professions face data that is beyond human scale. This data is in the repositories that collect the data and the reports or results in their fields. This project will address the problem of bringing all this knowledge under control by using even more data, namely the individual and social patterns of how these repositories are accessed and used, and user-specific judgments (valuations) of the data. The proposed research will develop novel algorithms and an open-source infrastructure for improving discovery within and access to data repositories. These algorithms will aggregate and analyze the social analytic data, gathered from professional communities of data users, and will motivate them to participate by providing recommendations.

The transformative goal is to develop methods for organizing, and operationalizing the access and preference patterns of users of large repositories, and for integrating those valuations to accelerate discovery within the collections. Diverse human minds interacting with data collections, as they carry out their own research or operational activities, provide a powerful source of information about the value of the data itself. Those data items may be textual documents, numerical datasets, or other kinds of media content. The novel methods for representing, aggregating, organizing and valuating interactions between the users and the items can reveal structures within data collections, which were previously invisible to any individual. This discovery of interrelations within data, driven by the capture of human intelligence, will accelerate the processes of scientific discovery. Users who are permitted to valuate data, and who are motivated by receiving valuable recommendations in return, reveal more about their own interests. This makes it possible to discover relations among the data items and among the users themselves. The educational goals are to: (a) contribute to the education of specific graduate students supported by the project, and undergraduates via the REU mechanism; (b) generate new educational materials related to algorithmic innovations, and to research findings; and (c) improve access to and discovery within specific collections of materials. Research findings will be included in courses at all three collaborating universities.

Additional information about the project (including publication, software, data sets) will be made available through the project web site: http://arxiv_xs.rutgers.edu/.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 18)
AJ Meltzer, A Graham, PH Connnolly, JK Karwowski, HL Bush, PI Frazier, DB Schneider "Risk Factors for Early Failure after Peripheral Endovascular Intervention: Application of a Reliability Engineering Approach" Annals of Vascular Surgery , v.27 , 2013 , p.53
C Wang, D Blei "Variational inference in nonconjugate models" Journal of Machine Learning Research , v.NA , 2013
D Blei "Probabilistic topic models" Communications of the ACM , v.55 , 2012 , p.77 0001-0782
D Blei "Topic Modeling and Digital Humanities" Journal of Digital Humanities , v.2 , 2012 , p.NA
D. Blei. "Build, Compute, Critique, Repeat: Data Analysis with Latent" Annual Review of Statistics and Its Application, 1203?232, 2014. , v.1 , 2014 , p.203
Govindan, P., Monemizadeh, M., & Muthukrishnan, S. "Streaming Algorithms for Measuring H-Impact." Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems , 2017 , p.337-346
Han, W., Rajan, P., Frazier, P. I., & Jedynak, B. M. (2017). "Bayesian Group Testing Under Sum Observations: A Parallelizable Two-Approximation for Entropy Loss." IEEE Transactions on Information Theory, , v.63 , 2017 , p.915-933
I.O. Ryzhov, P.I. Frazier, and W.B. Powell ""A New Optimal Stepsize Rule for Approximate Dynamic Programming,"" IEEE Transactions on Automatic Control,to appear, published online Sep 12, 2014. , 2015
J Paisley, C Wang, D Blei "The discrete infinite logistic normal distribution" Bayesian Analysis , v.7 , 2012 , p.235
J. Paisley, C. Wang, D. Blei, and M. Jordan. "A nested HDP for hierarchicaltopic modeling." IEEE Transactions on Pattern Analysis and MachineIntelligence, in press. , 2015
M Hoffman, D Blei, J Paisley, C Wang "Stochastic Variational Inference" Journal of Machine Learning Research , v.NA , 2013
(Showing: 1 - 10 of 18)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project developed and implemented a research system, called my.arXiv, to improve scientific collaboration. We worked with the arXiv resource at Cornell University to develop and debug the system.  In use, the system shows scientific papers matching a search and suggests additional scientific preprints (items that have not yet gone through the lengthy publication process) so that researchers can quickly find the "missing link" for their own research. We worked with outstanding computer scientists, who developed a number of ways to link one item to another. In addition to the key words and subject headings of those items, my.arXiv can use the text of the abstract itself, to find related materials.

Researchers at Cornell examined the possibility that our system may learn more about what a user is looking for by suggesting some items that do not seem to be the best match. These items, instead, are chosen because they help the system to resolve some ambiguity about what the researcher is looking for, in other words, to explore what is sought, rather than simply exploiting what is already known. Furthermore If a researcher was never shown any specific item, there is no record of the fact that she would have found it useful. Other Cornell research developed ways to improve recommendations, and to correct for this "items never seen" problem, and scale up to very large collections.

The research at Princeton and then Columbia examined many innovative ways to identify new or emerging subjects, which may not even have standard names yet, much as bio-chemistry was born nearly 100 years ago at the interface of biology and chemistry.  For an individual user, these methods can adapt and track changing interests, and profile those interests in a meaningful way.

The Rutgers research also explored new ways to combine several possible recommendations, and to learn from the users active responses to those recommendations. As shown in an example window preprints are shown together with an estimate of their relevance (the length of the red bar).  The users may click the active title link to learn more about the item. Or indicate that, whatever its value, it has already been seen; or to advise us that it is not what the user needs. Finally, a mouse-drag can move any an item up or down the screen. This tells the system to consider it more (or less) useful than any item that it moves past.

The system has been developed with an extensive "back room" that can be used to design and conduct experiments. These experiments will lead to new knowledge about which algorithms best match the real thought processes of human researchers. For long term study, the My.arXiv system supports user accounts. A researcher with an account can choose to have her preferences and feedback saved from one session to the next, as the system learns more and more about how to help.

In addition to the search for better ways to make recommendations, this project looked into growing concerns about privacy. A researcher on the trail of a scientific breakthrough wants to be sure that using a collaborative resource will not make it easy for other researchers to steal her ideas. Extensive experiments using years of data from the arXiv history have shown a kind of "bad news / good news" result. The bad news is that an adversary might be able to figure out what a researcher  has been examining or recommending; all he must do is submit the very same search query, just before and just after the researcher does. The good news is that for a system as large arXiv, with over one million items, and more than half a million users per week, the chance of leaking a research interest to this kind of attack is almost astronomically small.


Last Modified: 02/26/2019
Modified by: Rebecca N Wright

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page