0% found this document useful (0 votes)
63 views19 pages

Presentation Thesis

The document discusses an extension of cross-validation with confidence for determining the number of communities in Stochastic Block Models (SBM). It highlights the limitations of classical cross-validation methods and introduces a new approach that accounts for randomness in the test set to improve model selection consistency. The method is applied to a real-world dataset of political books, demonstrating its effectiveness in selecting community structures.

Uploaded by

qinjn.09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views19 pages

Presentation Thesis

The document discusses an extension of cross-validation with confidence for determining the number of communities in Stochastic Block Models (SBM). It highlights the limitations of classical cross-validation methods and introduces a new approach that accounts for randomness in the test set to improve model selection consistency. The method is applied to a real-world dataset of political books, demonstrating its effectiveness in selecting community structures.

Uploaded by

qinjn.09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Extension of cross validation with confidence to

determining number of communities in Stochastic


Block Models

Jining Qin

Feb 2025

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 1 / 12
Stochastic block models

Stochastic block models: first proposed in Fienberg and Wasserman 1981.


n nodes
 with adjacency matrix
Aij n×n .
Node i is assigned community label:
gi ∈ {1, 2, · · · , K }.
Community interaction matrix
BK ×K .

Aij ∼ Bernoulli(Bgi gj )

Takes into account the ’structure


equivalence’ in networks; Allows for
individual variability.
Problem: where does the K come from?

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 1 / 12
Model Selection for Stochastic Block Models:
Cross-validation and CVC

Cross-validation for SBM in literature: Hoff (2008), Li, Levina and Zhu
(2017), Chen and Lei (2017).
Limitation of classical cross-validation:
Shao (1993) and Zhang(1993)
Classical cross validation cannot achieve model selection consistency unless
the testing set dominates the training set in size, which neither
leave-one-out CV nor k-fold cross validation satisfies.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 2 / 12
Model Selection for Stochastic Block Models:
Cross-validation and CVC

Cross-validation for SBM in literature: Hoff (2008), Li, Levina and Zhu
(2017), Chen and Lei (2017).
Limitation of classical cross-validation:
Shao (1993) and Zhang(1993)
Classical cross validation cannot achieve model selection consistency unless
the testing set dominates the training set in size, which neither
leave-one-out CV nor k-fold cross validation satisfies.

Lei (2017)
Cross-validation with Confidence accounts for the randomness in the test
set to avoid bias towards over-fitting. Developed for linear regression
context, extendable to other settings.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 2 / 12
CVC on SBM: big picture

Split: split the data set into training and testing.


Estimate: Obtain community labels ĝ (K ) and interaction matrix
B(K ) .
(K )
Evaluate: Calculate L(Bij , Aij ) for i, j ∈ Nte , K ∈ K, the set of
candidate K ’s.
Test: Test whether ∃K̃ ̸= K that significantly outperforms K
Selection: Confidence set: K ’s not rejected. Pick the smallest K in
the confidence set.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 3 / 12
1. Splitting & 2. Estimation

Figure: Block-wise node-pair splitting [Chen and Lei 2017].

Block-wise splitting: treating nodes as entities, rectangular training


set.
Rectangular spectral clustering for community label estimation: SVD
of adjacency matrix A = UΣV T , run k-means on row vectors of first
d columns of U.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 4 / 12
3. Evaluation

For each (i, j) in the test set, evaluate the cross-validation loss:
(K )
L(Aij , B (K ) (K ) )
ĝi ,ĝj

Options for loss function


Squared error loss: L(A, P) = (A − P)2
Negative log-likelihood: L(A, P) = −A log(P) − (1 − A) log(1 − P)

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 5 / 12
4. Testing

For each K ∈ {K1 , K2 , · · · , Ks }, we have the cross-validated loss


(K )
L(Aij , B (K ) (K ) ) over the test set.
ĝi ,ĝj
Test hypothesis:
! !
(K˜)
   
(K )
H0,K : Etest L Aij , B (K ) (K ) ≤ Etest L Aij , B (K˜) (K˜)
, ∀K̃ ̸= K
ĝi ,ĝj ĝi ,ĝj

! !
(K˜)
   
(K )
H1,K : ∃K̃ ̸= K , Etest L Aij , B (K ) (K ) > Etest L Aij , B (K˜) (K˜)
ĝi ,ĝj ĝi ,ĝj

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 6 / 12
4. Testing

For each K ∈ {K1 , K2 , · · · , Ks }, we have the cross-validated loss


(K )
L(Aij , B (K ) (K ) ) over the test set.
ĝi ,ĝj
Test hypothesis:
! !
(K˜)
   
(K )
H0,K : Etest L Aij , B (K ) (K ) ≤ Etest L Aij , B (K˜) (K˜)
, ∀K̃ ̸= K
ĝi ,ĝj ĝi ,ĝj

! !
(K˜)
   
(K )
H1,K : ∃K̃ ̸= K , Etest L Aij , B (K ) (K ) > Etest L Aij , B (K˜) (K˜)
ĝi ,ĝj ĝi ,ĝj

Confidence set consists of K ’s where H0,K are not rejected in the test.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 6 / 12
4. Testing

(K˜)
   
(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing

(K˜)
   
(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .
Let µ̂K ,K̃ , µˆ2 K ,K̃ represent the empirical first, second moment of ξijK ,K̃
over all (i, j) pair.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing

(K˜)
   
(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .
Let µ̂K ,K̃ , µˆ2 K ,K̃ represent the empirical first, second moment of ξijK ,K̃
over all (i, j) pair.
Test statistic r
n2 µ̂K ,K̃
T = max ·
V2
q
K̃ ̸=K
µˆ2 K ,K̃

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing
(K˜)
   
(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .
Let µ̂K ,K̃ , µˆ2 K ,K̃ represent the empirical first, second moment of ξijK ,K̃
over all (i, j) pair.
Test statistic r
n2 µ̂K ,K̃
T = max ·
V2
q
K̃ ̸=K
µˆ2 K ,K̃

Obtain the reference distribution of T using Gaussian multiplier


bootstrap:
r K ,K̃ K ,K̃
n2 X ξi,j − µ̂ (b) (b)
Tb∗ = max 2
ζij ; b = 1, · · · , B ; ζij ∼ N(0, 1)
K̃ ̸=K V µ̂ K , K̃
i,j∈N̄v 2

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
4. Testing

(K˜)
   
(K )
Want to test if ξijK ,K̃ = L Aij , B (K ) (K ) − L Aij , B (K˜) (K˜)
has a
ĝi ,ĝj ĝi ,ĝj
positive mean for at least one K̃ ∈ {1, 2, · · · , Ks }, K̃ ̸= K .
Let µ̂K ,K̃ , µˆ2 K ,K̃ represent the empirical first, second moment of ξijK ,K̃
over all (i, j) pair.
Test statistic r
n2 µ̂K ,K̃
T = max ·
V2
q
K̃ ̸=K
µˆ2 K ,K̃

(Alternatively) Obtain the reference distribution of T using


parametric bootstrap: obtain bootstrap samples of the test set
adjacency matrix generatively, assuming the null hypothesis is true.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 7 / 12
5. Selection

Confidence set consists of models (mostly K ’s) for which H0,K are not
rejected in the test.
Select the smallest K in the confidence set as the output.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 8 / 12
Real-world application: Political books data set

Data set: 105 American political books. Edge indicate ’bought together
with’.
0.2

1.0
0.1
Variance

PartyAffiliation
Conservative

x2
Liberal
0.0 Independent

0.5

-0.1

4 8 12 -0.2 -0.1 0.0 0.1 0.2


k x1

Figure: Left plot: variances explained by each principal component of the political
blogs adjacency matrix. Right plot: projections of all political blogs onto first two
principal components of their adjacency matrix.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 9 / 12
Real-world application: Political books data set

Selected models Frequency


SBM: 3, DCBM: 2 78
SBM: 3, DCBM: 3 43
SBM: 4, DCBM: 2 15
SBM: 4, DCBM: 3 15
SBM: 2, DCBM: 2 12
SBM: 5, DCBM: 2 8
SBM: 3, DCBM: 4 7
DCBM: 3 5
SBM: 5, DCBM: 3 5
Table: Most frequent model selection results selecting in both standard stochastic
block models and degree-corrected stochastic block models, using spectral
weighting of the singular vector matrix. We select the most parsimonious model
within the retained K ’s in each category. Omitted the results that appear fewer
than 5 times out of 200 runs.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 10 / 12
Theoretical Properties

Want to show the validity of our method, under modest assumptions:


Under-fitting models guaranteed to be eliminated with high
probability. Covered in proposal. Under-fitting model has a fixed
overhead compared to correct model.
Over-fitting models won’t outperform optimal model significantly.
Challenging because over-fitting models’ behaviors can be arbitrary.
We made a relatively modest community size assumption.

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 11 / 12
Main result of theoretical studies

Theorem
Under assumptions A.1 - A.6, with the squared error loss and block-wise
node-pair splitting. We have the following:
1 For K̃ < K ∗ , we have

P(TK̃ ,K ∗ > Zαn ) → 1

TK̃ ,K ∗ : test statistic in our hypothesis testing. Zαn : upper αn


quantile of maximum of Gaussian vector with coveriance equal to
empirical correlation of ξ’s. αn ∈ ( n1 , 1) and αn → 0.
2 For K̃ > K ∗ , we have sup TK ∗ ,K̃ = OP (1)
K >K ∗
Since Zαn → ∞ as n → ∞ and αn → 0, we know that ∀K > K ∗ ,

P(TK ∗ ,K̃ < Zαn ) → 1

Jining Qin (CMU Stat & DS) JQ Research Talk Feb 2025 12 / 12

You might also like