Award Abstract # 1916378
Scalable Statistical Inference in Small-World Networks

NSF Org: DMS
Division Of Mathematical Sciences
Recipient: UNIVERSITY OF WISCONSIN SYSTEM
Initial Amendment Date: August 9, 2019
Latest Amendment Date: August 9, 2019
Award Number: 1916378
Award Instrument: Standard Grant
Program Manager: Yong Zeng
yzeng@nsf.gov
 (703)292-7299
DMS
 Division Of Mathematical Sciences
MPS
 Directorate for Mathematical and Physical Sciences
Start Date: August 15, 2019
End Date: July 31, 2023 (Estimated)
Total Intended Award Amount: $299,999.00
Total Awarded Amount to Date: $299,999.00
Funds Obligated to Date: FY 2019 = $299,999.00
History of Investigator:
  • Sebastien Roch (Principal Investigator)
    roch@math.wisc.edu
  • Karl Rohe (Co-Principal Investigator)
Recipient Sponsored Research Office: University of Wisconsin-Madison
21 N PARK ST STE 6301
MADISON
WI  US  53715-1218
(608)262-3822
Sponsor Congressional District: 02
Primary Place of Performance: University of Wisconsin-Madison
21 North Park ST,STE 6401
Madison
WI  US  53715-1218
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): LCLSJAGTNZQ7
Parent UEI:
NSF Program(s): STATISTICS
Primary Program Source: 01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):
Program Element Code(s): 126900
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.049

ABSTRACT

Social networks often exhibit "small-world features". For instance, friends typically share many common friends, but most individuals have a limited number of close acquaintances irrespective of the size of the network. Another well-documented feature of such networks is the "six degrees of separation property", whereby most people are a small number of social connections away from one another. Despite their ubiquity, common statistical models of complex networks do not typically generate graphs with these properties. Therefore, the main goal of this project is to address the general lack of plausible and tractable statistical models of small-world networks. Specifically, the PIs will develop a novel framework for the inference of these networks, including statistical models, fast and scalable algorithms, as well as supporting theory. These models and methods will be empirically validated through the development and deployment of techniques that sample large graphs in ways that helps assess them. This new understanding will contribute to ongoing interdisciplinary collaborations in journalism, health care, and law.

Existing probabilistic constructions of small-world networks, i.e., random graphs exhibiting low diameter, sparsity and transitivity, tend to be ad-hoc and, hence, often not suitable for statistical inference. In this project, the PIs will formulate and analyze interpretable statistical models of small-world networks; and develop scalable statistical inference for such models based on both spectral techniques and local sampling. For this purpose, the PIs will develop and explore a family of network models with high-dimensional latent features. The PIs will analyze how traditional algorithms perform in this regime, and will develop and analyze local sampling algorithms, such as respondent-driven sampling. The information-theoretic limit of community detection will also be studied. This grant will support the development of two courses aimed at intermediate undergraduates in UW- Madison's new undergraduate data science degree. These courses will aim to broaden engagement in both data science and social network analysis. This grant will also support the training of PhD students in both Statistics and Mathematics.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Chen, Fan and Rohe, Karl "A New Basis for Sparse Principal Component Analysis" Journal of Computational and Graphical Statistics , 2023 https://doi.org/10.1080/10618600.2023.2256502 Citation Details
Chen, Fan and Zhang, Yini and Rohe, Karl "Targeted sampling from massive block model graphs with personalized PageRank" Journal of the Royal Statistical Society: Series B (Statistical Methodology) , v.82 , 2020 10.1111/rssb.12349 Citation Details
Fan, Wai-Tong Louis and Legried, Brandon and Roch, Sebastien "An impossibility result for phylogeny reconstruction from k-mer counts" The Annals of Applied Probability , v.32 , 2022 https://doi.org/10.1214/22-AAP1805 Citation Details
Fan, Wai-Tong Louis and Legried, Brandon and Roch, Sebastien "Impossibility of Consistent Distance Estimation from Sequence Lengths Under the TKF91 Model" Bulletin of Mathematical Biology , v.82 , 2020 https://doi.org/10.1007/s11538-020-00801-3 Citation Details
Legried, Brandon and Roch, Sebastien "Pairwise sequence alignment at arbitrarily large evolutionary distance" The Annals of Applied Probability , v.34 , 2024 https://doi.org/10.1214/23-AAP2009 Citation Details
Roch, Sebastien "Expanding the Class of Global Objective Functions for Dissimilarity-Based Hierarchical Clustering" Journal of Classification , 2023 https://doi.org/10.1007/s00357-023-09447-x Citation Details
Roch, Sebastien and Wang, Kun-Chieh "Sufficient condition for root reconstruction by parsimony on binary trees with general weights" Electronic Communications in Probability , v.26 , 2021 https://doi.org/10.1214/21-ECP423 Citation Details
Rohe, Karl and Zeng, Muzhe "Vintage factor analysis with Varimax performs statistical inference" Journal of the Royal Statistical Society Series B: Statistical Methodology , v.85 , 2023 https://doi.org/10.1093/jrsssb/qkad029 Citation Details
Tabatabaee, Yasamin and Roch, Sebastien and Warnow, Tandy "QR-STAR: A Polynomial-Time Statistically Consistent Method for Rooting Species Trees Under the Coalescent" Journal of computational biology , v.30 , 2023 Citation Details
Yan, Yuling and Hanlon, Bret and Roch, Sebastien and Rohe, Karl "Asymptotic seed bias in respondent-driven sampling" Electronic Journal of Statistics , v.14 , 2020 10.1214/20-EJS1698 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

In applied multivariate statistics, estimating the number of intrinsic dimensions of a dataset, for example the number of communities in a network, is a hard and fundamental problem. One common diagnostic is known as the scree plot. However, due to overfitting of the data, it is difficult to use as a diagnostic. A key outcome of this collaborative project, involving several graduate students across statistics and mathematics, has been the development of an actionable and inferential notion of model dimension. In particular, it was demonstrated that a simple link resampling procedure for networks (as well as some generalizations) provides a more reliable estimation of the dimension, overcoming the issue of overfitting. Theoretical guarantees were also established; the new approach was tested extensively in simulated and real networks, showing significant improvement over previous approaches.

A complementary line of work seeking to understand high dimensional models involved the study of hierarchical models. Hierarchical social network models have been previously proposed and studied. While these previous models often appear as a very rich class of models, they are often in fact a small subset of the well-known stochastic blockmodels. Moreover, these previous models all make overly restrictive assumptions on the topology of the hierarchy. Such models create community structures in the social network that are too “balanced” and “homogeneous” to represent the heterogeneous structures seen in empirical investigations. Another key outcome was the introduction of a broader class of hierarchical models that bridge the gap between theory and practice - rich enough to represent empirical structures and simple enough to estimate and interpret. In fact, six alternative hierarchical social network models were shown to be equivalent. What emerges is a synthesized view of a model from six different angles. A fast fitting algorithm inspired by a standard phylogenetic inference method was developed and shown to work well in practice. 

Building further on the cross-fertilization with phylogenetics, the problem of hierarchical clustering using measures of dissimilarities was studied from the point of view of an optimization problem. A broad new class of objective functions for this problem were introduced and shown to have desirable theoretical properties. Many common agglomerative and divisive clustering methods were shown to reduce to myopic methods for these objectives, which are inspired by related concepts in phylogenetics.

The project supported the interdisciplinary training of four undergraduate students (in Statistics and Computer Sciences) and nine graduate students (six in Statistics, two in Mathematics, and one in Journalism and Mass Communications) in mathematics, statistics, and software development. Several Ph.D. theses also resulted from the work. Results were presented to professional scientists in multiple peer-reviewed publications, research talks and online resources on computational methods. New course materials on relevant mathematical techniques and their applications were developed and are freely available.

 


Last Modified: 11/29/2023
Modified by: Sebastien Roch

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page