Award Abstract # 1149789
CAREER: Machine Learning Methods and Statistical Analysis Tools for Single Network Domains

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: PURDUE UNIVERSITY
Initial Amendment Date: January 5, 2012
Latest Amendment Date: June 10, 2016
Award Number: 1149789
Award Instrument: Continuing Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: January 1, 2012
End Date: December 31, 2017 (Estimated)
Total Intended Award Amount: $496,640.00
Total Awarded Amount to Date: $496,640.00
Funds Obligated to Date: FY 2012 = $91,471.00
FY 2013 = $95,167.00

FY 2014 = $99,087.00

FY 2015 = $103,247.00

FY 2016 = $107,668.00
History of Investigator:
  • Jennifer Neville (Principal Investigator)
Recipient Sponsored Research Office: Purdue University
2550 NORTHWESTERN AVE # 1100
WEST LAFAYETTE
IN  US  47906-1332
(765)494-1055
Sponsor Congressional District: 04
Primary Place of Performance: Purdue University
305 N. University Street
IN  US  47907-2107
Primary Place of Performance
Congressional District:
04
Unique Entity Identifier (UEI): YRXVL4JYCEF5
Parent UEI: YRXVL4JYCEF5
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
01001314DB NSF RESEARCH & RELATED ACTIVIT

01001415DB NSF RESEARCH & RELATED ACTIVIT

01001516DB NSF RESEARCH & RELATED ACTIVIT

01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1045
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

CAREER: Machine Learning Methods and Statistical Analysis Tools for Single Network Domains

Machine learning researchers focus on two distinct learning scenarios for structured network data (i.e., where there are statistical dependencies among the attributes of linked nodes). In the first scenario, the domain consists of a population of structured examples (e.g., chemical compounds) and we can reason about learning algorithms asymptotically, as the number of structured examples increases. In the second scenario, the domain consists of a single, potentially infinite-sized network (e.g., the World Wide Web). In these "single network" domains, an increase in data corresponds to acquiring a larger portion of the underlying network. Even when there are a set of network samples available for learning and prediction, they correspond to subnetworks drawn from the same underlying network and thus may be dependent.

Although estimation and inference methods from the field of statistical relational learning have been successfully applied in single-network domains, the algorithms were initially developed for populations of networks, and thus the theoretical foundation for learning and inference in single networks is scant. This work focuses on the development of robust statistical methods for single network domains -- since many large network datasets about complex systems rarely have more than a few subnetworks available for model estimation and evaluation. Specifically, the aims of the project include (1) strengthening the theoretical foundation for learning in single network domains, (2) creating accurate methods for determining the significance of discovered patterns and features, (3) formulating novel model selection and evaluation methods, and (4) developing improved approaches for network learning and prediction based on the unique characteristics of single network domains.

The research will enhance our understanding of the mechanisms that influence the performance of network analysis methods and drive the development novel methods for complex network domains. Expanding the applicability of machine learning techniques for single network domains could have a transformational impact across a broad range of areas (e.g., psychology, communications, education, political science) where current methods limit research to the investigation of processes in dyad or small group settings. Also, the project results will serve as an example application of computer science in the broader network science context, which will attract and retain students that might not otherwise be engaged by conventional CS topics. For more details see:
http://www.cs.purdue.edu/homes/neville/research-nsf-career.html

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

C. Meng and C. Mouli and B. Ribeiro and J. Neville "Subgraph Pattern Neural Networks for High-Order Graph Evolution Prediction" Proceedings of the 32nd AAAI Conference on Artificial Intelligence , 2018
J. Moore and J. Neville "Deep Collective Inference" Proceedings of the 31st AAAI Conference on Artificial Intelligence , 2017
J. Moore and S. Mussmann and J. Pfeiffer III and J. Neville "Incorporating Assortativity and Degree Dependence into Scalable Network Models" Proceedings of the 29th AAAI Conference on Artificial Intelligence , 2015
J. Pfeiffer III and J. Neville and P. Bennett "Overcoming Relational Learning Biases to Accurately Predict Preferences in Large Scale Networks" Proceedings of the 24th International World Wide Web Conference (WWW) , 2015
J. Yang and B. Ribeiro and J. Neville "Should We Be Confident in Peer Effects Estimated From Partial Crawls of Social Networks?" Proceedings of the 11th International AAAI Conference on Weblogs and Social Media , 2017
J. Yang and V. Rao and J. Neville "Decoupling Homophily and Reciprocity with Latent Space Network Models" Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence , 2017
N. Ahmed and J. Neville and R. Rossi and N. Duffield "Efficient Graphlet Counting for Large Networks" Proceedings of the 15th IEEE International Conference on Data Mining , 2015
N. Ahmed, J. Neville, and R. Kompella "Network Sampling: From Static to Streaming Graphs." Transactions on Knowledge Discovery and Data Mining , 2014
N. Ahmed, J. Neville, R. Rossi, N. Duffield, T. Willke "Graphlet Decomposition: Framework, Algorithms, and Applications" Knowledge and Information Systems , v.50 , 2017
P. Robles, S. Moreno, and J. Neville "Sampling of Attributed Networks from Hierarchical Generative Models" Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2016

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Although estimation and inference methods from the field of statistical relational learning have been successfully applied in single-network domains, the theoretical foundation for learning in single network domains is scant. In addition, the development of predictive models has mainly focused on increasing classification accuracy and relatively little attention has been paid to the issues of feature selection and hypothesis testing. Lastly, the structure of a network dataset is affected by both the sampling process and the availability of node/link information. In single network domains, due to the complex dependencies among instances, these factors may have a large impact on the quality of learning/inference. This project aimed to explore these issues to strengthen the foundation for statistical relational learning.

Intellectual merit: We investigated these issues from several angles---analytical, empirical, and algorithmic---while trying to bridge the gap between structured machine learning methods and statistical network analysis. We thought carefully about the underlying assumptions, problem formulations, and methodology used for learning over networks, and used it to develop theoretical characterizations of network learning and inference methods. This analysis informed the development of diverse algorithms spanning the full machine learning spectrum, including sampling, learning, inference, hypothesis testing, model selection, descriptive modeling, and evaluation. We discuss two illustrative efforts from the project below.

Much of the work in relational learning has been based on an (implicit) assumption of the availability of a fully labeled network for learning. In this project, we focused on semi-supervised learning in partially-labeled networks. Although relational EM methods can significantly improve predictive performance in densely labeled networks, they could not achieve the same gains in sparsely labeled networks and often can performed worse than methods that ignore the relational information all together. We showed that the fixed-point methods that relational EM uses for approximate learning and inference result in errors that prevent convergence in sparsely labeled networks. A central finding is the tie between the stability of relational inference and the maximal eigenvalue of the inference solution, which can be used to determine if a particular network is prone to convergence issues based on the structure of label availability. As a solution, we develop a relational data augmentation method that represents uncertainty over label predictions and parameter estimates to ensure better convergence in all types of networks. To develop a more practical, scalable solution for massive networks, we had the key idea to augment the inference step of the fixed point EM algorithm to include a maximum entropy constraint, where the predicted label proportions (i.e., percentage predicted positive vs. negative) are forced to align with the proportions observed in the training set. The method adjusts the label predictions at every step of collective inference so they adhere to maximum entropy constraints. 

Another important task in network analysis is the detection of anomalous events in a network time series. Previous work has focused on detecting anomalies through the use of network statistics that summarize the behavior of the network. However, we show that many common network statistics are dependent on confounding factors like the total number of nodes or edges, and use of these statistics leads to incorrect conclusions (i.e., both false positives and false negatives). We formalize the use of network statistics in graphs of varying sizes and shows that detection errors are due to the use of size inconsistent or size divergent statistics. To remedy the issue, we introduce the statistical property of size consistency for network statistics and shows theoretically how the property ensures accurate detection of changes or anomalies in dynamic networks. We then propose three size consistent network statistics: mass shift, degree shift, and triangle probability, which are provably size consistent. With extensive synthetic and semi-synthetic experiments, we demonstrate that anomaly detectors using the new size-consistent statistics exhibit superior detection performance in dynamic networks where the message and edge counts vary.

Broader impact: The work described above, while published in data mining and machine learning venues, is likely to have a broad impact across network-centric fields where researchers study the properties of complex systems by modeling the data as a network or graph. In these fields, a deeper understanding of the statistical and theoretical properties of analytic methods, as well as improved algorithms, will increase the accuracy and robustness of these investigations. Moreover, this project has had a small, but important, impact on the education and training of several URMs students. As these individuals graduate and move to academic and industrial STEM settings, they will serve as role models in the areas of machine learning, graph mining, and network science more broadly.

 

 

 


Last Modified: 06/19/2019
Modified by: Jennifer Neville

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page