
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | January 5, 2012 |
Latest Amendment Date: | June 10, 2016 |
Award Number: | 1149789 |
Award Instrument: | Continuing Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | January 1, 2012 |
End Date: | December 31, 2017 (Estimated) |
Total Intended Award Amount: | $496,640.00 |
Total Awarded Amount to Date: | $496,640.00 |
Funds Obligated to Date: |
FY 2013 = $95,167.00 FY 2014 = $99,087.00 FY 2015 = $103,247.00 FY 2016 = $107,668.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
2550 NORTHWESTERN AVE # 1100 WEST LAFAYETTE IN US 47906-1332 (765)494-1055 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
305 N. University Street IN US 47907-2107 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
01001314DB NSF RESEARCH & RELATED ACTIVIT 01001415DB NSF RESEARCH & RELATED ACTIVIT 01001516DB NSF RESEARCH & RELATED ACTIVIT 01001617DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
CAREER: Machine Learning Methods and Statistical Analysis Tools for Single Network Domains
Machine learning researchers focus on two distinct learning scenarios for structured network data (i.e., where there are statistical dependencies among the attributes of linked nodes). In the first scenario, the domain consists of a population of structured examples (e.g., chemical compounds) and we can reason about learning algorithms asymptotically, as the number of structured examples increases. In the second scenario, the domain consists of a single, potentially infinite-sized network (e.g., the World Wide Web). In these "single network" domains, an increase in data corresponds to acquiring a larger portion of the underlying network. Even when there are a set of network samples available for learning and prediction, they correspond to subnetworks drawn from the same underlying network and thus may be dependent.
Although estimation and inference methods from the field of statistical relational learning have been successfully applied in single-network domains, the algorithms were initially developed for populations of networks, and thus the theoretical foundation for learning and inference in single networks is scant. This work focuses on the development of robust statistical methods for single network domains -- since many large network datasets about complex systems rarely have more than a few subnetworks available for model estimation and evaluation. Specifically, the aims of the project include (1) strengthening the theoretical foundation for learning in single network domains, (2) creating accurate methods for determining the significance of discovered patterns and features, (3) formulating novel model selection and evaluation methods, and (4) developing improved approaches for network learning and prediction based on the unique characteristics of single network domains.
The research will enhance our understanding of the mechanisms that influence the performance of network analysis methods and drive the development novel methods for complex network domains. Expanding the applicability of machine learning techniques for single network domains could have a transformational impact across a broad range of areas (e.g., psychology, communications, education, political science) where current methods limit research to the investigation of processes in dyad or small group settings. Also, the project results will serve as an example application of computer science in the broader network science context, which will attract and retain students that might not otherwise be engaged by conventional CS topics. For more details see:
http://www.cs.purdue.edu/homes/neville/research-nsf-career.html
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Although estimation and inference methods from the field of statistical relational learning have been successfully applied in single-network domains, the theoretical foundation for learning in single network domains is scant. In addition, the development of predictive models has mainly focused on increasing classification accuracy and relatively little attention has been paid to the issues of feature selection and hypothesis testing. Lastly, the structure of a network dataset is affected by both the sampling process and the availability of node/link information. In single network domains, due to the complex dependencies among instances, these factors may have a large impact on the quality of learning/inference. This project aimed to explore these issues to strengthen the foundation for statistical relational learning.
Intellectual merit: We investigated these issues from several angles---analytical, empirical, and algorithmic---while trying to bridge the gap between structured machine learning methods and statistical network analysis. We thought carefully about the underlying assumptions, problem formulations, and methodology used for learning over networks, and used it to develop theoretical characterizations of network learning and inference methods. This analysis informed the development of diverse algorithms spanning the full machine learning spectrum, including sampling, learning, inference, hypothesis testing, model selection, descriptive modeling, and evaluation. We discuss two illustrative efforts from the project below.
Much of the work in relational learning has been based on an (implicit) assumption of the availability of a fully labeled network for learning. In this project, we focused on semi-supervised learning in partially-labeled networks. Although relational EM methods can significantly improve predictive performance in densely labeled networks, they could not achieve the same gains in sparsely labeled networks and often can performed worse than methods that ignore the relational information all together. We showed that the fixed-point methods that relational EM uses for approximate learning and inference result in errors that prevent convergence in sparsely labeled networks. A central finding is the tie between the stability of relational inference and the maximal eigenvalue of the inference solution, which can be used to determine if a particular network is prone to convergence issues based on the structure of label availability. As a solution, we develop a relational data augmentation method that represents uncertainty over label predictions and parameter estimates to ensure better convergence in all types of networks. To develop a more practical, scalable solution for massive networks, we had the key idea to augment the inference step of the fixed point EM algorithm to include a maximum entropy constraint, where the predicted label proportions (i.e., percentage predicted positive vs. negative) are forced to align with the proportions observed in the training set. The method adjusts the label predictions at every step of collective inference so they adhere to maximum entropy constraints.
Another important task in network analysis is the detection of anomalous events in a network time series. Previous work has focused on detecting anomalies through the use of network statistics that summarize the behavior of the network. However, we show that many common network statistics are dependent on confounding factors like the total number of nodes or edges, and use of these statistics leads to incorrect conclusions (i.e., both false positives and false negatives). We formalize the use of network statistics in graphs of varying sizes and shows that detection errors are due to the use of size inconsistent or size divergent statistics. To remedy the issue, we introduce the statistical property of size consistency for network statistics and shows theoretically how the property ensures accurate detection of changes or anomalies in dynamic networks. We then propose three size consistent network statistics: mass shift, degree shift, and triangle probability, which are provably size consistent. With extensive synthetic and semi-synthetic experiments, we demonstrate that anomaly detectors using the new size-consistent statistics exhibit superior detection performance in dynamic networks where the message and edge counts vary.
Broader impact: The work described above, while published in data mining and machine learning venues, is likely to have a broad impact across network-centric fields where researchers study the properties of complex systems by modeling the data as a network or graph. In these fields, a deeper understanding of the statistical and theoretical properties of analytic methods, as well as improved algorithms, will increase the accuracy and robustness of these investigations. Moreover, this project has had a small, but important, impact on the education and training of several URMs students. As these individuals graduate and move to academic and industrial STEM settings, they will serve as role models in the areas of machine learning, graph mining, and network science more broadly.
Last Modified: 06/19/2019
Modified by: Jennifer Neville
Please report errors in award information by writing to: awardsearch@nsf.gov.