
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | September 6, 2012 |
Latest Amendment Date: | May 21, 2014 |
Award Number: | 1216282 |
Award Instrument: | Standard Grant |
Program Manager: |
Maria Zemankova
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2012 |
End Date: | September 30, 2016 (Estimated) |
Total Intended Award Amount: | $499,182.00 |
Total Awarded Amount to Date: | $515,182.00 |
Funds Obligated to Date: |
FY 2014 = $16,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
5000 FORBES AVE PITTSBURGH PA US 15213-3815 (412)268-8746 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
PA US 15213-3890 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
01001415DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The goal of this project addresses the open challenge of Multi-field Hierarchical Discovery and Tracking (mf-HDT) of emerging topics at different granularity levels based on combined evidence in heterogeneous data. The technical approaches consist of a new Bayesian framework with powerful inference algorithms, namely the multi-field Hierarchical Correlated Topic Modeling, for discovering multi-field hierarchies of latent topics, capturing inter-topic and cross-hierarchy correlations, and enabling query-driven threading of topics over a Markov chain of hierarchies. These technical innovations and capabilities go beyond existing Topic Detection and Tracking (TDT) methods and graphical models used to represent relationships between topics, citations, etc. Significant improvements are expected in both effectiveness and scalability over the existing methods, especially in terms of detecting newly emerging topics and tacking time-sensitive impact. The proposed approach will be evaluated on a four large datasets of scientific literature data in a broad range (Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics) as well as news stories, with human-produced queries and relevance judgments and human-assigned topic labels to support task-oriented evaluations.
Productivity of researchers, educational practitioners and students, government agencies supporting research and industries highly depends on the availability of up-to-date big pictures about scientific emergence and co-emergence within and across many fields, along with evidence of the impact of new technologies, and research or development funding. The proposed techniques, if successful, will provide principled and effective solutions with a broad future impact in the applications above and beyond. Web site (http://nyc.lti.cs.cmu.edu/mfhdt/) will provide access to open-source software, of datasets, results and publication in order to enable comparative evaluations and further studies by related research communities. The students involved in the project benefit from direct experience with using and evaluating cutting-edge IT technologies in real-world applications. This is complementary to classroom teaching where the students can observe first-hand the direct implication of choosing various strategies for categorization, active learning and distributed computing.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
A. Scientific Goals
Our goal is to address the open challenge of Multi-factorial Topic Detecting and Tracking over evolving topics, such as scientific trends or political developments at different levels of granularity. We accomplish this goal via Multi-factorial representation for represent co-occurring entities, features (words), links, impact indicators and other meta-levelfeatures in publication records or news-story documents. Specifically, we developed a family of novel Bayesian graphical models, including the temporal and hierarchical versions of von Mises/Fisher (vMF) models with highly scalable algorithms for large-scale inforence.
B. Mayjor Outcomes
1.Temporal vMF modeling: For temporal data streams, analyzing how latent clusters in data evolve over time is naturally desirable. Figure 1 shows the results of our temporal vMF against the flat vMF on a subset of the TDT5 corpus of news stories, by plotting likelihood of the data in the next time-step using the data in the previous time steps for training.
2. Multi-field hierarchical & temporal vMF (MFHT-vMF): We extended our temproal vMF models from single-field clustering to multi-field clustering, and from a flat level to multiple levels of granularity for hierarchically nested topic models. Figure 2 shows the that MFHT-vMF outperform other vMF models (flat, tempral alone or hierarchical alone) on CiteSeer from 1994 to 2004; the average likelihood of the next time-step is used as the metric, and the data from previous time-steps are used for training. A live demo is available at the project website.
3. vMF clustering with limited supervision: One of the common problems with clustering is that the generated clusters often do not match user expectations. We extended our vMF model for leveraging (limited) supervised information for better clustering of unlabeled data. That is, given a small subset of the ground truth clusters, our systmem learns to discover many unknown (new) clusters in a way which is consistent to the ground-truth clusters. Extensive evaluations on popular benchmark data sets (Gopal & Yang, UAI 2014) showed advantegeious performance of our proposed model over several state-of-the-art supervised and unsupervised methods (Figure 3).
4. Graph-based Transductive Learning: We further developed a novel method, namely Transductive Learning over Product Graph (TOP), which extracts multi-type associations from different sources of data, maps heterogeneous types of objects and relations onto a unified product graph, and performs joint inference about topic labels of documents via transductive label propagation over the product graph. This approach is particularly effective in transductive learning scenario where labeled documents are very sparse and unlabeled documents are massively available, and when the manifold structures are highly informative but varying in different fields of co-occurrence data. In our experiments with a subset of DBLP publication records (34K users, 11K papers and 22 venues) and an Enzyme multi-source dataset (445 compounds, 664 proteins), TOP successfully scaled to the large cross-graph inference problem, and outperformed other representative approaches significantly (Figure 4).
C. Participants and Publications
We have published our work in ICML 2016, AISTATS 2016, TKDD2015, UAI2014, ICML2014, KDD2013, ICML2013 and NIPS2012.
D. Intellectual Impact
Our work have benefited the machine learning and information retrieval discipline in multiple ways. Firstly, our preprocessed Citeseer and the arxiv datasets directly help the related research community with immediate access to clean multifaceted data with many fields and meta data. Secondly, our MFHT-vMF model provides comprehensive views of big data with hierarchically structured organization, temporal profiling and multi-aspect latent topics. Our models are generally applicable to a wide variety of data including time evolving news-stories, research publications, book contents, and so on, and thereby having an immediate and direct impact on other related fields.
.
E. Broader Impact
Topic detection and tracking based on multi-factorial evidence in scientific and technical literature is extremely important and not yet addressed by the machine learning and information retrieval communities. The productivity of researchers highly depends on the availability of up-to-date information about related work and a global picture about what’s going on in related fields. Strategic plans and funding decisions by government agencies (such as NSF, NIH, DARPA and IARPA) also depend on informative overviews of scientific emergence and co-emergence within and across many fields of research, along with evidence of their impact. Industries (both large ones such as Google, Microsoft and Yahoo! and small ones such as many start-ups) desperately want mf-TDT techniques in order to effectively assess and predict the impact of new technologies and to dynamically adjust their investment strategies. Furthermore, education in all universities requires instructors and students to have comprehensive and up-to-date understanding about how science and technologies are evolving over time, how multiple fields relate to each other, and which technologies trigger rapid developments of other technologies. The techniques developed in this project provide principled and effective solutions for mf-TDT with a broad future impact in the applications listed above and beyond.
Last Modified: 10/30/2016
Modified by: Yiming Yang
Please report errors in award information by writing to: awardsearch@nsf.gov.