NSF Award Search: Award # 1216282

Award Abstract # 1216282

III: Small: Multi-field Hierarchical Discovery and Tracking (mf-HDT) of Emerging Topics

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	CARNEGIE MELLON UNIVERSITY
Initial Amendment Date:	September 6, 2012
Latest Amendment Date:	May 21, 2014
Award Number:	1216282
Award Instrument:	Standard Grant
Program Manager:	Maria Zemankova IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	October 1, 2012
End Date:	September 30, 2016 (Estimated)
Total Intended Award Amount:	$499,182.00
Total Awarded Amount to Date:	$515,182.00
Funds Obligated to Date:	FY 2012 = $499,182.00 FY 2014 = $16,000.00
History of Investigator:	Yiming Yang (Principal Investigator) yiming@cs.cmu.edu
Recipient Sponsored Research Office:	Carnegie-Mellon University 5000 FORBES AVE PITTSBURGH PA US 15213-3815 (412)268-8746
Sponsor Congressional District:	12
Primary Place of Performance:	Carnegie-Mellon University PA US 15213-3890
Primary Place of Performance Congressional District:	12
Unique Entity Identifier (UEI):	U3NKNFLNQ613
Parent UEI:	U3NKNFLNQ613
NSF Program(s):	Info Integration & Informatics
Primary Program Source:	01001213DB NSF RESEARCH & RELATED ACTIVIT 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7364, 7923, 9251
Program Element Code(s):	736400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The goal of this project addresses the open challenge of Multi-field Hierarchical Discovery and Tracking (mf-HDT) of emerging topics at different granularity levels based on combined evidence in heterogeneous data. The technical approaches consist of a new Bayesian framework with powerful inference algorithms, namely the multi-field Hierarchical Correlated Topic Modeling, for discovering multi-field hierarchies of latent topics, capturing inter-topic and cross-hierarchy correlations, and enabling query-driven threading of topics over a Markov chain of hierarchies. These technical innovations and capabilities go beyond existing Topic Detection and Tracking (TDT) methods and graphical models used to represent relationships between topics, citations, etc. Significant improvements are expected in both effectiveness and scalability over the existing methods, especially in terms of detecting newly emerging topics and tacking time-sensitive impact. The proposed approach will be evaluated on a four large datasets of scientific literature data in a broad range (Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics) as well as news stories, with human-produced queries and relevance judgments and human-assigned topic labels to support task-oriented evaluations.

Productivity of researchers, educational practitioners and students, government agencies supporting research and industries highly depends on the availability of up-to-date big pictures about scientific emergence and co-emergence within and across many fields, along with evidence of the impact of new technologies, and research or development funding. The proposed techniques, if successful, will provide principled and effective solutions with a broad future impact in the applications above and beyond. Web site (http://nyc.lti.cs.cmu.edu/mfhdt/) will provide access to open-source software, of datasets, results and publication in order to enable comparative evaluations and further studies by related research communities. The students involved in the project benefit from direct experience with using and evaluating cutting-edge IT technologies in real-world applications. This is complementary to classroom teaching where the students can observe first-hand the direct implication of choosing various strategies for categorization, active learning and distributed computing.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Hanxiao Liu and Yiming Yang "Bipartite Edge Prediction via Transductive Learning over Product Graphs" International Conference on Machine Learning , 2015

Hanxiao Liu and Yiming Yang "Cross-Graph Learning of Multi-Relational Associations" International Conference on Machine Learning , 2016

Hanxiao Liu and Yiming Yang "Semi-supervised Learning with Adaptive Spectral Transform" Aitificial Intelligence and Statistics (AISTATS) , 2016

Siddharth Gopal and Yiming Yang "Hierarchical Bayesian Inference and Recursive Regularization for Large-scale Classification" Transactions in Knowledge Discovery and Datamining , 2015

Siddharth Gopal and Yiming Yang "Recursive Regularization for Large-scale classification with Hierarchical and Graphical Dependencies" Special Interest Group in Knowledge Discovery & Datamining , 2013

Siddharth Gopal, Yiming Yang "Distributed Training for large-scale Logistic Models" International Conference on Machine Learning , 2013

Siddharth Gopal, Yiming Yang "Transformation-based Probabilistic Clustering with Supervision" Uncertainty in Artificial Intelligence , 2014

Siddharth Gopal, Yiming Yang "Von Mises-Fisher Clustering Models" International Conference on Machine Learning , 2014

Siddharth Gopal, Yiming Yang , Bing Bai, Alexandru Niculescu-Mizil "Bayesian Models for Large-scale Hierarchical Classification" Neural Information Processing Systems , 2012

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

A. Scientific Goals

Our goal is to address the open challenge of Multi-factorial Topic Detecting and Tracking over evolving topics, such as scientific trends or political developments at different levels of granularity. We accomplish this goal via Multi-factorial representation for represent co-occurring entities, features (words), links, impact indicators and other meta-levelfeatures in publication records or news-story documents. Specifically, we developed a family of novel Bayesian graphical models, including the temporal and hierarchical versions of von Mises/Fisher (vMF) models with highly scalable algorithms for large-scale inforence.

B. Mayjor Outcomes

1.Temporal vMF modeling: For temporal data streams, analyzing how latent clusters in data evolve over time is naturally desirable. Figure 1 shows the results of our temporal vMF against the flat vMF on a subset of the TDT5 corpus of news stories, by plotting likelihood of the data in the next time-step using the data in the previous time steps for training.

2. Multi-field hierarchical & temporal vMF (MFHT-vMF): We extended our temproal vMF models from single-field clustering to multi-field clustering, and from a flat level to multiple levels of granularity for hierarchically nested topic models. Figure 2 shows the that MFHT-vMF outperform other vMF models (flat, tempral alone or hierarchical alone) on CiteSeer from 1994 to 2004; the average likelihood of the next time-step is used as the metric, and the data from previous time-steps are used for training. A live demo is available at the project website.

3. vMF clustering with limited supervision: One of the common problems with clustering is that the generated clusters often do not match user expectations. We extended our vMF model for leveraging (limited) supervised information for better clustering of unlabeled data. That is, given a small subset of the ground truth clusters, our systmem learns to discover many unknown (new) clusters in a way which is consistent to the ground-truth clusters. Extensive evaluations on popular benchmark data sets (Gopal & Yang, UAI 2014) showed advantegeious performance of our proposed model over several state-of-the-art supervised and unsupervised methods (Figure 3).

4. Graph-based Transductive Learning: We further developed a novel method, namely Transductive Learning over Product Graph (TOP), which extracts multi-type associations from different sources of data, maps heterogeneous types of objects and relations onto a unified product graph, and performs joint inference about topic labels of documents via transductive label propagation over the product graph. This approach is particularly effective in transductive learning scenario where labeled documents are very sparse and unlabeled documents are massively available, and when the manifold structures are highly informative but varying in different fields of co-occurrence data. In our experiments with a subset of DBLP publication records (34K users, 11K papers and 22 venues) and an Enzyme multi-source dataset (445 compounds, 664 proteins), TOP successfully scaled to the large cross-graph inference problem, and outperformed other representative approaches significantly (Figure 4).

C. Participants and Publications

We have published our work in ICML 2016, AISTATS 2016, TKDD2015, UAI2014, ICML2014, KDD2013, ICML2013 and NIPS2012.

D. Intellectual Impact

Our work have benefited the machine learning and information retrieval discipline in multiple ways. Firstly, our preprocessed Citeseer and the arxiv datasets directly help the related research community with immediate access to clean multifaceted data with many fields and meta data. Secondly, our MFHT-vMF model provides comprehensive views of big data with hierarchically structured organization, temporal profiling and multi-aspect latent topics. Our models are generally applicable to a wide variety of data including time evolving news-stories, research publications, book contents, and so on, and thereby having an immediate and direct impact on other related fields.

E. Broader Impact

Topic detection and tracking based on multi-factorial evidence in scientific and technical literature is extremely important and not yet addressed by the machine learning and information retrieval communities. The productivity of researchers highly depends on the availability of up-to-date information about related work and a global picture about what’s going on in related fields. Strategic plans and funding decisions by government agencies (such as NSF, NIH, DARPA and IARPA) also depend on informative overviews of scientific emergence and co-emergence within and across many fields of research, along with evidence of their impact. Industries (both large ones such as Google, Microsoft and Yahoo! and small ones such as many start-ups) desperately want mf-TDT techniques in order to effectively assess and predict the impact of new technologies and to dynamically adjust their investment strategies. Furthermore, education in all universities requires instructors and students to have comprehensive and up-to-date understanding about how science and technologies are evolving over time, how multiple fields relate to each other, and which technologies trigger rapid developments of other technologies. The techniques developed in this project provide principled and effective solutions for mf-TDT with a broad future impact in the applications listed above and beyond.

Last Modified: 10/30/2016
Modified by: Yiming Yang

Images (1 of 4)

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error