0% found this document useful (0 votes)

63 views19 pages

Presentation Thesis

The document discusses an extension of cross-validation with confidence for determining the number of communities in Stochastic Block Models (SBM). It highlights the limitations of classical cross-validation methods and introduces a new approach that accounts for randomness in the test set to improve model selection consistency. The method is applied to a real-world dataset of political books, demonstrating its effectiveness in selecting community structures.

Uploaded by

qinjn.09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views19 pages

Presentation Thesis

Uploaded by

qinjn.09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Extension of cross validation with confidence to

determining number of communities in Stochastic

Block Models

Jining Qin

Feb 2025

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 1 / 12
Stochastic block models

Stochastic block models: first proposed in Fienberg and Wasserman 1981.

n nodes
with adjacency matrix
Aij n×n .
Node i is assigned community label:
gi ∈ {1, 2, · · · , K }.
Community interaction matrix
BK ×K .

Aij ∼ Bernoulli(Bgi gj )

Takes into account the ’structure

equivalence’ in networks; Allows for
individual variability.
Problem: where does the K come from?

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 1 / 12
Model Selection for Stochastic Block Models:
Cross-validation and CVC

Cross-validation for SBM in literature: Hoff (2008), Li, Levina and Zhu
(2017), Chen and Lei (2017).
Limitation of classical cross-validation:
Shao (1993) and Zhang(1993)
Classical cross validation cannot achieve model selection consistency unless
the testing set dominates the training set in size, which neither
leave-one-out CV nor k-fold cross validation satisfies.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 2 / 12
Model Selection for Stochastic Block Models:
Cross-validation and CVC

Lei (2017)
Cross-validation with Confidence accounts for the randomness in the test
set to avoid bias towards over-fitting. Developed for linear regression
context, extendable to other settings.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 2 / 12
CVC on SBM: big picture

Split: split the data set into training and testing.

Estimate: Obtain community labels ĝ (K ) and interaction matrix
B(K ) .
(K )
Evaluate: Calculate L(Bij , Aij ) for i, j ∈ Nte , K ∈ K, the set of
candidate K ’s.
Test: Test whether ∃K̃ ̸= K that significantly outperforms K
Selection: Confidence set: K ’s not rejected. Pick the smallest K in
the confidence set.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 3 / 12
1. Splitting & 2. Estimation

Figure: Block-wise node-pair splitting [Chen and Lei 2017].

Block-wise splitting: treating nodes as entities, rectangular training

set.
Rectangular spectral clustering for community label estimation: SVD
of adjacency matrix A = UΣV T , run k-means on row vectors of first
d columns of U.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 4 / 12
3. Evaluation

For each (i, j) in the test set, evaluate the cross-validation loss:
(K )
L(Aij , B (K ) (K ) )
ĝi ,ĝj

Options for loss function

Squared error loss: L(A, P) = (A − P)2
Negative log-likelihood: L(A, P) = −A log(P) − (1 − A) log(1 − P)

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 5 / 12
4. Testing

For each K ∈ {K1 , K2 , · · · , Ks }, we have the cross-validated loss

(K )
L(Aij , B (K ) (K ) ) over the test set.
ĝi ,ĝj
Test hypothesis:
! !
(K˜)

(K )
H0,K : Etest L Aij , B (K ) (K ) ≤ Etest L Aij , B (K˜) (K˜)
, ∀K̃ ̸= K
ĝi ,ĝj ĝi ,ĝj

! !
(K˜)

(K )
H1,K : ∃K̃ ̸= K , Etest L Aij , B (K ) (K ) > Etest L Aij , B (K˜) (K˜)
ĝi ,ĝj ĝi ,ĝj

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 6 / 12
4. Testing

For each K ∈ {K1 , K2 , · · · , Ks }, we have the cross-validated loss

(K )
L(Aij , B (K ) (K ) ) over the test set.
ĝi ,ĝj
Test hypothesis:
! !
(K˜)

(K )
H0,K : Etest L Aij , B (K ) (K ) ≤ Etest L Aij , B (K˜) (K˜)
, ∀K̃ ̸= K
ĝi ,ĝj ĝi ,ĝj

! !
(K˜)

(K )
H1,K : ∃K̃ ̸= K , Etest L Aij , B (K ) (K ) > Etest L Aij , B (K˜) (K˜)
ĝi ,ĝj ĝi ,ĝj

Confidence set consists of K ’s where H0,K are not rejected in the test.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 6 / 12
4. Testing

(K˜)

(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing

(K˜)

(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .
Let µ̂K ,K̃ , µˆ2 K ,K̃ represent the empirical first, second moment of ξijK ,K̃
over all (i, j) pair.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing
(K˜)

(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .
Let µ̂K ,K̃ , µˆ2 K ,K̃ represent the empirical first, second moment of ξijK ,K̃
over all (i, j) pair.
Test statistic r
n2 µ̂K ,K̃
T = max ·
V2
q
K̃ ̸=K
µˆ2 K ,K̃

Obtain the reference distribution of T using Gaussian multiplier

bootstrap:
r K ,K̃ K ,K̃
n2 X ξi,j − µ̂ (b) (b)
Tb∗ = max 2
ζij ; b = 1, · · · , B ; ζij ∼ N(0, 1)
K̃ ̸=K V µ̂ K , K̃
i,j∈N̄v 2

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing

(Alternatively) Obtain the reference distribution of T using

parametric bootstrap: obtain bootstrap samples of the test set
adjacency matrix generatively, assuming the null hypothesis is true.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
5. Selection

Confidence set consists of models (mostly K ’s) for which H0,K are not
rejected in the test.
Select the smallest K in the confidence set as the output.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 8 / 12
Real-world application: Political books data set

Data set: 105 American political books. Edge indicate ’bought together
with’.
0.2

1.0
0.1
Variance

PartyAffiliation
Conservative

x2
Liberal
0.0 Independent

0.5

-0.1

4 8 12 -0.2 -0.1 0.0 0.1 0.2

k x1

Figure: Left plot: variances explained by each principal component of the political
blogs adjacency matrix. Right plot: projections of all political blogs onto first two
principal components of their adjacency matrix.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 9 / 12
Real-world application: Political books data set

Selected models Frequency

SBM: 3, DCBM: 2 78
SBM: 3, DCBM: 3 43
SBM: 4, DCBM: 2 15
SBM: 4, DCBM: 3 15
SBM: 2, DCBM: 2 12
SBM: 5, DCBM: 2 8
SBM: 3, DCBM: 4 7
DCBM: 3 5
SBM: 5, DCBM: 3 5
Table: Most frequent model selection results selecting in both standard stochastic
block models and degree-corrected stochastic block models, using spectral
weighting of the singular vector matrix. We select the most parsimonious model
within the retained K ’s in each category. Omitted the results that appear fewer
than 5 times out of 200 runs.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 10 / 12
Theoretical Properties

Want to show the validity of our method, under modest assumptions:

Under-fitting models guaranteed to be eliminated with high
probability. Covered in proposal. Under-fitting model has a fixed
overhead compared to correct model.
Over-fitting models won’t outperform optimal model significantly.
Challenging because over-fitting models’ behaviors can be arbitrary.
We made a relatively modest community size assumption.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 11 / 12
Main result of theoretical studies

Theorem
Under assumptions A.1 - A.6, with the squared error loss and block-wise
node-pair splitting. We have the following:
1 For K̃ < K ∗ , we have

P(TK̃ ,K ∗ > Zαn ) → 1

TK̃ ,K ∗ : test statistic in our hypothesis testing. Zαn : upper αn

quantile of maximum of Gaussian vector with coveriance equal to
empirical correlation of ξ’s. αn ∈ ( n1 , 1) and αn → 0.
2 For K̃ > K ∗ , we have sup TK ∗ ,K̃ = OP (1)
K >K ∗
Since Zαn → ∞ as n → ∞ and αn → 0, we know that ∀K > K ∗ ,

P(TK ∗ ,K̃ < Zαn ) → 1

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 12 / 12

High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
A Powerful Chi-Square Specification Test With Support Vectors
No ratings yet
A Powerful Chi-Square Specification Test With Support Vectors
34 pages
Final 2001f
No ratings yet
Final 2001f
18 pages
Week-2 NK
No ratings yet
Week-2 NK
12 pages
CS229 Practice Midterm Overview
No ratings yet
CS229 Practice Midterm Overview
4 pages
Practice Midterm 2010
No ratings yet
Practice Midterm 2010
4 pages
Intro To Data Science
No ratings yet
Intro To Data Science
47 pages
CS229 Midterm Solutions 2010
No ratings yet
CS229 Midterm Solutions 2010
8 pages
Calibrated Model Criticism Using Split Predictive Checks: Jiawei Li and Jonathan H. Huggins
No ratings yet
Calibrated Model Criticism Using Split Predictive Checks: Jiawei Li and Jonathan H. Huggins
61 pages
DD&ME&KDE&KNN
No ratings yet
DD&ME&KDE&KNN
27 pages
2022 CS244 End Sem Soln
No ratings yet
2022 CS244 End Sem Soln
6 pages
Machine Learning Exam Instructions
No ratings yet
Machine Learning Exam Instructions
16 pages
I2ml3e Chap19
No ratings yet
I2ml3e Chap19
33 pages
Introduction To Kriging: To Cite This Version
No ratings yet
Introduction To Kriging: To Cite This Version
40 pages
Y0606188191 PDF
No ratings yet
Y0606188191 PDF
4 pages
Statistical+Inference+1 Shaw2007
No ratings yet
Statistical+Inference+1 Shaw2007
66 pages
Machine Learning Exam Guide
No ratings yet
Machine Learning Exam Guide
9 pages
STAT 714 Linear Statistical Models: Lecture Notes
No ratings yet
STAT 714 Linear Statistical Models: Lecture Notes
150 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Model Selection and Multiple Hypothesis Testing
No ratings yet
Model Selection and Multiple Hypothesis Testing
6 pages
AI & ML Internal Assessment Dec 2021
No ratings yet
AI & ML Internal Assessment Dec 2021
12 pages
Armstrong 2018
No ratings yet
Armstrong 2018
62 pages
t3 Sol
No ratings yet
t3 Sol
8 pages
Cours2 ML
No ratings yet
Cours2 ML
21 pages
Slides No Break
No ratings yet
Slides No Break
77 pages
IIT Kanpur Machine Learning End Sem Paper
100% (1)
IIT Kanpur Machine Learning End Sem Paper
10 pages
18 CV & Model Selection
No ratings yet
18 CV & Model Selection
11 pages
Machine Learning Final Exam Spring 2009
No ratings yet
Machine Learning Final Exam Spring 2009
25 pages
15.455x-MITx Quant Finance-Note (22.12.12)
No ratings yet
15.455x-MITx Quant Finance-Note (22.12.12)
4 pages
Hw2solparti 2017
No ratings yet
Hw2solparti 2017
4 pages
Confidence and Credential Intervals Explained
No ratings yet
Confidence and Credential Intervals Explained
15 pages
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
No ratings yet
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
12 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
ML 2024 Part1 CrossValidation
No ratings yet
ML 2024 Part1 CrossValidation
43 pages
TUM MA4401 Applied Regression Formula Sheet
No ratings yet
TUM MA4401 Applied Regression Formula Sheet
8 pages
ISYE6501 Homework 2
No ratings yet
ISYE6501 Homework 2
11 pages
Quiz3 2024
No ratings yet
Quiz3 2024
2 pages
HMWK 4
No ratings yet
HMWK 4
5 pages
Gauss Markov Book
No ratings yet
Gauss Markov Book
150 pages
Microeconometrics Slides
No ratings yet
Microeconometrics Slides
346 pages
MLB Assignment 7 Final
No ratings yet
MLB Assignment 7 Final
16 pages
Kernel Methods
No ratings yet
Kernel Methods
32 pages
Machine Learning Assignment 1 Basic Concepts: Due: 27 March 2015, 15:00pm
No ratings yet
Machine Learning Assignment 1 Basic Concepts: Due: 27 March 2015, 15:00pm
3 pages
ML Soln 2
No ratings yet
ML Soln 2
5 pages
Dimensionality Reduction & Model Evaluation
No ratings yet
Dimensionality Reduction & Model Evaluation
80 pages
K-Means++: The Advantages of Careful Seeding: David Arthur and Sergei Vassilvitskii
No ratings yet
K-Means++: The Advantages of Careful Seeding: David Arthur and Sergei Vassilvitskii
11 pages
OSU Adjustment Notes Part 1
No ratings yet
OSU Adjustment Notes Part 1
230 pages
Chapter 7 Analysis of Variance (ANOVA)
No ratings yet
Chapter 7 Analysis of Variance (ANOVA)
23 pages
On The Complexity of Strong and Epistemic Credal Networks
100% (1)
On The Complexity of Strong and Epistemic Credal Networks
10 pages
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
No ratings yet
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
2 pages
Ch5 Resampling Methods
No ratings yet
Ch5 Resampling Methods
66 pages
Endsem ML Regular AK
No ratings yet
Endsem ML Regular AK
7 pages
Econ-607 - Unit2-W1-3
No ratings yet
Econ-607 - Unit2-W1-3
117 pages
TD9 - K-Means 2025 Correction
No ratings yet
TD9 - K-Means 2025 Correction
10 pages
Robust Estimation with MMD
No ratings yet
Robust Estimation with MMD
48 pages
Maximum FICO Score Expanded
No ratings yet
Maximum FICO Score Expanded
3 pages
Camus Political Views
No ratings yet
Camus Political Views
2 pages
Philosophers Related To Camus Expanded
No ratings yet
Philosophers Related To Camus Expanded
3 pages
Parimutuel Simulation Extended
No ratings yet
Parimutuel Simulation Extended
3 pages
Optuna Successive Halving Pruner Example
No ratings yet
Optuna Successive Halving Pruner Example
3 pages
Bandit Algorithms in Hyperparameter Tuning Extended Refreshed
No ratings yet
Bandit Algorithms in Hyperparameter Tuning Extended Refreshed
3 pages
Bandit Pruning Techniques Explained
No ratings yet
Bandit Pruning Techniques Explained
2 pages
Bandit Algorithms for Hyperparameter Tuning
No ratings yet
Bandit Algorithms for Hyperparameter Tuning
1 page
2153 Pathformer Multi Scale TR
No ratings yet
2153 Pathformer Multi Scale TR
19 pages
Microstructure ML
No ratings yet
Microstructure ML
57 pages
SSRN 279911
No ratings yet
SSRN 279911
67 pages
Growing The Efficient Frontier On Panel Trees: Lin William Cong Guanhao Feng Jingyu He Xin He
No ratings yet
Growing The Efficient Frontier On Panel Trees: Lin William Cong Guanhao Feng Jingyu He Xin He
73 pages
Asset Pricing 1
100% (1)
Asset Pricing 1
437 pages
OPTIIILN2023Spring ConvexOpti
No ratings yet
OPTIIILN2023Spring ConvexOpti
341 pages
40 Classic Crude Oil Trades-Routledge (2022)
100% (2)
40 Classic Crude Oil Trades-Routledge (2022)
257 pages
Machine Learning
No ratings yet
Machine Learning
182 pages
Quant Roadmap (Ultimate Edition) 双语对照版
100% (1)
Quant Roadmap (Ultimate Edition) 双语对照版
148 pages
An Lou Shi
No ratings yet
An Lou Shi
45 pages
SSRN 4579159
No ratings yet
SSRN 4579159
59 pages
(Ebook PDF) Aviation Maintenance Technician - General Fourth Edition Instant Download
100% (2)
(Ebook PDF) Aviation Maintenance Technician - General Fourth Edition Instant Download
106 pages
Biostatistics Primer Part 2
No ratings yet
Biostatistics Primer Part 2
10 pages
One-Sample Hypothesis Testing Guide
No ratings yet
One-Sample Hypothesis Testing Guide
38 pages
Introduction to Quantitative Research
100% (1)
Introduction to Quantitative Research
16 pages
Reseeeeeeeeeeeeeeearcchhhhh Finalssss 1
No ratings yet
Reseeeeeeeeeeeeeeearcchhhhh Finalssss 1
8 pages
Understanding Beta in SPSS Output
No ratings yet
Understanding Beta in SPSS Output
1 page
ST1232 Tutorial 7 Solution
No ratings yet
ST1232 Tutorial 7 Solution
3 pages
Intro to Data Management Course Guide
No ratings yet
Intro to Data Management Course Guide
213 pages
UCL Past Year Math
No ratings yet
UCL Past Year Math
4 pages
Student Cluster Analysis Based On Moodle Data and Academic Performance Indicators
No ratings yet
Student Cluster Analysis Based On Moodle Data and Academic Performance Indicators
4 pages
Lesson 5 Frequency Analysis
No ratings yet
Lesson 5 Frequency Analysis
11 pages
A00-220 Study Guide and How To Crack Exa PDF
No ratings yet
A00-220 Study Guide and How To Crack Exa PDF
6 pages
PR1 RESEARCH STUDY Chapter123
100% (5)
PR1 RESEARCH STUDY Chapter123
35 pages
Optimal Clusters in Data Sets
No ratings yet
Optimal Clusters in Data Sets
6 pages
Critical Values of t Distribution
No ratings yet
Critical Values of t Distribution
2 pages
Chapter 6: Energy Use and End-Use Load Characterization
No ratings yet
Chapter 6: Energy Use and End-Use Load Characterization
25 pages
Extended Kalman Filter PDF
0% (2)
Extended Kalman Filter PDF
2 pages
Samplingandsamplingdistributions
No ratings yet
Samplingandsamplingdistributions
43 pages
Lottery Odds and Payoff Analysis
No ratings yet
Lottery Odds and Payoff Analysis
6 pages
Data Analyst Logic Test
100% (2)
Data Analyst Logic Test
3 pages
What Sampling Method Is Best For Small Population
No ratings yet
What Sampling Method Is Best For Small Population
3 pages
Assignment On Statistical Models
No ratings yet
Assignment On Statistical Models
2 pages
Role of Information Technology in Education System 170605100
No ratings yet
Role of Information Technology in Education System 170605100
7 pages
Syllabus Class Xi-Humanities 2023-24
No ratings yet
Syllabus Class Xi-Humanities 2023-24
30 pages
Factors Affectingthe Population Growthamong Hospitality Management Students
No ratings yet
Factors Affectingthe Population Growthamong Hospitality Management Students
13 pages
Time Series Forecast - A Basic Introduction Using Python
No ratings yet
Time Series Forecast - A Basic Introduction Using Python
18 pages
MLR with SPSS: A Beginner's Guide
No ratings yet
MLR with SPSS: A Beginner's Guide
17 pages
Advanced Analytics Unlocking The Power of Insight
No ratings yet
Advanced Analytics Unlocking The Power of Insight
15 pages
Machine Learning: Concepts & Applications
No ratings yet
Machine Learning: Concepts & Applications
185 pages
Bcom It Syllabus
No ratings yet
Bcom It Syllabus
47 pages

Presentation Thesis

Uploaded by

Presentation Thesis

Uploaded by

Extension of cross validation with confidence to

determining number of communities in Stochastic

Stochastic block models: first proposed in Fienberg and Wasserman 1981.

Takes into account the ’structure

Split: split the data set into training and testing.

Figure: Block-wise node-pair splitting [Chen and Lei 2017].

Block-wise splitting: treating nodes as entities, rectangular training

Options for loss function

For each K ∈ {K1 , K2 , · · · , Ks }, we have the cross-validated loss

For each K ∈ {K1 , K2 , · · · , Ks }, we have the cross-validated loss

Obtain the reference distribution of T using Gaussian multiplier

(Alternatively) Obtain the reference distribution of T using

4 8 12 -0.2 -0.1 0.0 0.1 0.2

Selected models Frequency

Want to show the validity of our method, under modest assumptions:

P(TK̃ ,K ∗ > Zαn ) → 1

TK̃ ,K ∗ : test statistic in our hypothesis testing. Zαn : upper αn

P(TK ∗ ,K̃ < Zαn ) → 1

You might also like