Extension of cross validation with confidence to
determining number of communities in Stochastic
Block Models
Jining Qin
Feb 2025
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 1 / 12
Stochastic block models
Stochastic block models: first proposed in Fienberg and Wasserman 1981.
n nodes
with adjacency matrix
Aij n×n .
Node i is assigned community label:
gi ∈ {1, 2, · · · , K }.
Community interaction matrix
BK ×K .
Aij ∼ Bernoulli(Bgi gj )
Takes into account the ’structure
equivalence’ in networks; Allows for
individual variability.
Problem: where does the K come from?
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 1 / 12
Model Selection for Stochastic Block Models:
Cross-validation and CVC
Cross-validation for SBM in literature: Hoff (2008), Li, Levina and Zhu
(2017), Chen and Lei (2017).
Limitation of classical cross-validation:
Shao (1993) and Zhang(1993)
Classical cross validation cannot achieve model selection consistency unless
the testing set dominates the training set in size, which neither
leave-one-out CV nor k-fold cross validation satisfies.
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 2 / 12
Model Selection for Stochastic Block Models:
Cross-validation and CVC
Cross-validation for SBM in literature: Hoff (2008), Li, Levina and Zhu
(2017), Chen and Lei (2017).
Limitation of classical cross-validation:
Shao (1993) and Zhang(1993)
Classical cross validation cannot achieve model selection consistency unless
the testing set dominates the training set in size, which neither
leave-one-out CV nor k-fold cross validation satisfies.
Lei (2017)
Cross-validation with Confidence accounts for the randomness in the test
set to avoid bias towards over-fitting. Developed for linear regression
context, extendable to other settings.
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 2 / 12
CVC on SBM: big picture
Split: split the data set into training and testing.
Estimate: Obtain community labels ĝ (K ) and interaction matrix
B(K ) .
(K )
Evaluate: Calculate L(Bij , Aij ) for i, j ∈ Nte , K ∈ K, the set of
candidate K ’s.
Test: Test whether ∃K̃ ̸= K that significantly outperforms K
Selection: Confidence set: K ’s not rejected. Pick the smallest K in
the confidence set.
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 3 / 12
1. Splitting & 2. Estimation
Figure: Block-wise node-pair splitting [Chen and Lei 2017].
Block-wise splitting: treating nodes as entities, rectangular training
set.
Rectangular spectral clustering for community label estimation: SVD
of adjacency matrix A = UΣV T , run k-means on row vectors of first
d columns of U.
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 4 / 12
3. Evaluation
For each (i, j) in the test set, evaluate the cross-validation loss:
(K )
L(Aij , B (K ) (K ) )
ĝi ,ĝj
Options for loss function
Squared error loss: L(A, P) = (A − P)2
Negative log-likelihood: L(A, P) = −A log(P) − (1 − A) log(1 − P)
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 5 / 12
4. Testing
For each K ∈ {K1 , K2 , · · · , Ks }, we have the cross-validated loss
(K )
L(Aij , B (K ) (K ) ) over the test set.
ĝi ,ĝj
Test hypothesis:
! !
(K˜)
(K )
H0,K : Etest L Aij , B (K ) (K ) ≤ Etest L Aij , B (K˜) (K˜)
, ∀K̃ ̸= K
ĝi ,ĝj ĝi ,ĝj
! !
(K˜)
(K )
H1,K : ∃K̃ ̸= K , Etest L Aij , B (K ) (K ) > Etest L Aij , B (K˜) (K˜)
ĝi ,ĝj ĝi ,ĝj
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 6 / 12
4. Testing
For each K ∈ {K1 , K2 , · · · , Ks }, we have the cross-validated loss
(K )
L(Aij , B (K ) (K ) ) over the test set.
ĝi ,ĝj
Test hypothesis:
! !
(K˜)
(K )
H0,K : Etest L Aij , B (K ) (K ) ≤ Etest L Aij , B (K˜) (K˜)
, ∀K̃ ̸= K
ĝi ,ĝj ĝi ,ĝj
! !
(K˜)
(K )
H1,K : ∃K̃ ̸= K , Etest L Aij , B (K ) (K ) > Etest L Aij , B (K˜) (K˜)
ĝi ,ĝj ĝi ,ĝj
Confidence set consists of K ’s where H0,K are not rejected in the test.
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 6 / 12
4. Testing
(K˜)
(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing
(K˜)
(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .
Let µ̂K ,K̃ , µˆ2 K ,K̃ represent the empirical first, second moment of ξijK ,K̃
over all (i, j) pair.
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing
(K˜)
(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .
Let µ̂K ,K̃ , µˆ2 K ,K̃ represent the empirical first, second moment of ξijK ,K̃
over all (i, j) pair.
Test statistic r
n2 µ̂K ,K̃
T = max ·
V2
q
K̃ ̸=K
µˆ2 K ,K̃
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing
(K˜)
(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .
Let µ̂K ,K̃ , µˆ2 K ,K̃ represent the empirical first, second moment of ξijK ,K̃
over all (i, j) pair.
Test statistic r
n2 µ̂K ,K̃
T = max ·
V2
q
K̃ ̸=K
µˆ2 K ,K̃
Obtain the reference distribution of T using Gaussian multiplier
bootstrap:
r K ,K̃ K ,K̃
n2 X ξi,j − µ̂ (b) (b)
Tb∗ = max 2
ζij ; b = 1, · · · , B ; ζij ∼ N(0, 1)
K̃ ̸=K V µ̂ K , K̃
i,j∈N̄v 2
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing
(K˜)
(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .
Let µ̂K ,K̃ , µˆ2 K ,K̃ represent the empirical first, second moment of ξijK ,K̃
over all (i, j) pair.
Test statistic r
n2 µ̂K ,K̃
T = max ·
V2
q
K̃ ̸=K
µˆ2 K ,K̃
(Alternatively) Obtain the reference distribution of T using
parametric bootstrap: obtain bootstrap samples of the test set
adjacency matrix generatively, assuming the null hypothesis is true.
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
5. Selection
Confidence set consists of models (mostly K ’s) for which H0,K are not
rejected in the test.
Select the smallest K in the confidence set as the output.
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 8 / 12
Real-world application: Political books data set
Data set: 105 American political books. Edge indicate ’bought together
with’.
0.2
1.0
0.1
Variance
PartyAffiliation
Conservative
x2
Liberal
0.0 Independent
0.5
-0.1
4 8 12 -0.2 -0.1 0.0 0.1 0.2
k x1
Figure: Left plot: variances explained by each principal component of the political
blogs adjacency matrix. Right plot: projections of all political blogs onto first two
principal components of their adjacency matrix.
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 9 / 12
Real-world application: Political books data set
Selected models Frequency
SBM: 3, DCBM: 2 78
SBM: 3, DCBM: 3 43
SBM: 4, DCBM: 2 15
SBM: 4, DCBM: 3 15
SBM: 2, DCBM: 2 12
SBM: 5, DCBM: 2 8
SBM: 3, DCBM: 4 7
DCBM: 3 5
SBM: 5, DCBM: 3 5
Table: Most frequent model selection results selecting in both standard stochastic
block models and degree-corrected stochastic block models, using spectral
weighting of the singular vector matrix. We select the most parsimonious model
within the retained K ’s in each category. Omitted the results that appear fewer
than 5 times out of 200 runs.
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 10 / 12
Theoretical Properties
Want to show the validity of our method, under modest assumptions:
Under-fitting models guaranteed to be eliminated with high
probability. Covered in proposal. Under-fitting model has a fixed
overhead compared to correct model.
Over-fitting models won’t outperform optimal model significantly.
Challenging because over-fitting models’ behaviors can be arbitrary.
We made a relatively modest community size assumption.
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 11 / 12
Main result of theoretical studies
Theorem
Under assumptions A.1 - A.6, with the squared error loss and block-wise
node-pair splitting. We have the following:
1 For K̃ < K ∗ , we have
P(TK̃ ,K ∗ > Zαn ) → 1
TK̃ ,K ∗ : test statistic in our hypothesis testing. Zαn : upper αn
quantile of maximum of Gaussian vector with coveriance equal to
empirical correlation of ξ’s. αn ∈ ( n1 , 1) and αn → 0.
2 For K̃ > K ∗ , we have sup TK ∗ ,K̃ = OP (1)
K >K ∗
Since Zαn → ∞ as n → ∞ and αn → 0, we know that ∀K > K ∗ ,
P(TK ∗ ,K̃ < Zαn ) → 1
Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 12 / 12