Lecture 14.
HGR Maximal Correlation (After-
class)
Notes: HGR correlation is a correlation metric in statistics. Compared with commonly-used correlations, it has the
advantage to handle non-linear statistical dependency. HGR maximum correlation can be computed by ACE (Alternative
Conditional Expectation) algorithm.
HGR: Hirschfeld-Gebelein-Rényi
Setup
Given 2 discrete random variables X, Y , want to measure the correlation between X and Y , how can we do?
Example (Pearson Correlation Coefficient):
E[(X − E[X])(Y − E[Y ])]
γ(X, Y ) ≜
var(X)var(Y )
If γ(X, Y ) = 0 ⇏X, Y are independent
A good correlation measurement ρ(X, Y ) should satisfy:
1. commutable ρ(X, Y ) = ρ(Y , X)
2. 0 ≤ ρ(X, Y ) ≤ 1
3. ρ(X, Y ) = 0 if and only if X, Y are independent
4. ρ(X, Y ) = 1, if Y = ξ(X) or X = η(Y ), for some deterministic functions ξ, η.
5. For one-to-one functions ξ, η, ρ(ξ(X), η(Y )) = ρ(X, Y )
Example (Pearson Correlation Coefficient):
E[(X − E[X])(Y − E[Y ])]
γ(X, Y ) ≜ , −1 ≤ γ(X, Y ) ≤ 1
var(X)var(Y )
∣γ(X, Y )∣ satisfies (1), (2), not (3), (4), (5).
∣γ(X, Y )∣ = 1, iff X = aY + b or Y = cX + d
Example (Mutual Information):
Lecture 14. HGR Maximal Correlation (After-class) 1
PX Y (x, y)
I(X; Y ) ≜ ∑ PX Y (x, y) log
x,y
PX (x)PY (y)
I(X; Y ) satisfies (1), (3), (5), not (2), (4).
Can we find some ρ(X; Y ) satisfying (1)-(5)?
Definition: The HGR maximal correlation ρ(X, Y ) is defined as
ρ(X, Y ) ≜ max E[f(X)g(Y )]
f : X →R, g : Y→R
E[f (X)]=E[g(Y )]=0
E[f 2 (X)]=E[g 2 (Y )]=1
ρ(X, Y ) = maxf ,g γ(f(X), g(Y ))
Data X, Y , want to find features of X and Y such that these features are the most related ⇒ HGR maximal
correlation functions
ρ(X, Y ) satisfies (1)-(5).
check (5): for one-to-one functions ξ, η,
ρ(ξ(X), η(Y )) = max γ(f(ξ(X)), g(η(Y )))
f ,g
= max
′ ′
γ(f ′ (X), g′ (Y ))
f ,g
= ρ(X, Y )
= ξ(X), let K(Y ): Y → R be a one-to-one function such that E[K(Y )] = 0,
check (4): If Y
E[K (Y )] = 1, then take f(X) = K(ξ(X)), g(Y ) = K(Y )
2
⇒ E[f(X) ⋅ g(Y )] = E[K(ξ(X)) ⋅ K(Y )] = E[K 2 (Y )] = 1 ⇒ ρ(X, Y ) = 1
~
Definition: The canonical dependence matrix B ∈ R∣Y∣×∣X ∣ is defined as
~ PX Y (x, y) − PX (x)PY (y)
B(y, x) ≜ , ∀x ∈ X , y ∈ Y.
PX (x)PY (y)
⎧ 1
3
if (x, y) = (0, 0)
=⎨
1
if (x, y) = (0, 1)
Example: X, Y are binary, PX Y (x, y) 6
⎩
1
6
if (x, y) = (1, 0)
1
3
if (x, y) = (1, 1)
PX (0) = 12 , PX (1) = 12 , PY (0) = 12 , PY (1) = 12
~ ~ 1
−1⋅1 ~ ~ 1 1 1
6−2⋅2
B(0, 0) = B(1, 1) = 3 12 12 = 16 B(0, 1) = B(1, 0) = 1 1
= − 16
2⋅2 2⋅2
Lecture 14. HGR Maximal Correlation (After-class) 2
⎡ − ⎤
1 1
~ 6 6
B=
⎣− ⎦
1 1
6 6
Definition: The information vectors ϕ, ψ associated to functions f , g are defined:
⁍, ⁍
Whenever functions f , g are given, we have the corresponding ϕ, ψ
Property:
Norm-square: ∥ϕ∥2 = ∑x ϕ2 (x) = ∑x PX (x)f 2 (x) = E[f 2 (X)], ∥ψ∥2 = E[g2 (Y )]
Inner produce:
⁍,
⟨ϕ, PX ⟩ = ∑ ϕ(x) ⋅ PX (x) = ∑ PX (x) ⋅ f(x) = E[f(X)]
x x
⁍,
⟨ψ, PY ⟩ = ∑ ψ(x) ⋅ PY (y) = ∑ PY (y) ⋅ g(y) = E[g(Y )]
y y
~
Theorem: The HGR maximal correlation ρ(X, Y ) is the largest singular value of B
Proof: Note that E[f(X) ⋅ g(Y )] = ∑ PX Y (x, y) ⋅ f(x) ⋅ g(y) =
x,y
⎛ ⎞
PXY (x,y) PX (x)PY (y)
∑ ( PX (x)f(x))( PY (y)g(y)) − ( PX (x)f(x))( PY (y)g(y))
PX (x)PY (y) PX (x)PY (y)
⎝ ⎠
x,y
=0
PX (x)PY (y)
∑ ( PX (x)f(x))( PY (y)g(y)) = ∑x,y PX (x)f(x)PY (y)g(y) = E[f(X)]E[g(Y )] =
x,y PX (x)PY (y)
PX Y (x, y) − PX (x)PY (y)
⇒ E[f(X) ⋅ g(Y )] = ∑ ( PX (x)f(x))( PY (y)g(y)) =
x,y PX (x)PY (y)
ϕ(x) ψ(y)
~
B (y,x)
~ ~
∑ B(y, x)ϕ(x)ψ(y) = ψT Bϕ
x,y
~
Lecture 14. HGR Maximal Correlation (After-class) 3
~
⇒ ρ(X, Y ) = max ψT Bϕ
ϕ,ψ
subject to : ∥ϕ∥2 = ∥ψ∥2 = 1 (↔ E[f 2 (X)] = E[g2 (Y )] = 1)
⟨ϕ, PX ⟩ = ⟨ψ, PY ⟩ = 0 (↔ E[f(X)] = E[g(Y )] = 0)
~ ~
Fact 1: arg max∥ϕ∥2 =∥ψ∥2 =1 ψT Bϕ = largest right/left singular vectors of B
Fact 2: singular vectors are orthogonal with each other
Check:
~ ~ PXY (x,y)−PX (x)PY (y) PY (y)−PY (y)
[B ⋅ PX ](y) = ∑x B(y, x) PX (x) = ∑x = =0
PY (y) PY (y)
~ ~
⇒ PX is a right singular vector of B with singular value 0, since B ⋅ PX = 0
~
PY is also a left singular vector of B
~
⇒ ρ(X, Y ) = σ1 (the largest singular value of B), with “=” iff ϕ and ψ are the largest right/left singular vectors of
~
B
~
⇒ Let ϕ1 and ψ1 be the largest right/left singular vectors of B, then
1 1
f ∗ (x) = ⋅ ϕ1 (x), g∗ (y) = ⋅ ψ1 (y)
PX (x) PY (y)
~
Check (3): ρ(X, Y ) = 0 iff σ1 = 0 iff B = 0 iff PX Y (x, y) = PX (x)PY (y) iff X , Y are independent.
Lecture 14. HGR Maximal Correlation (After-class) 4