0% found this document useful (0 votes)
14 views16 pages

MLB Assignment 7 Final

The document outlines a home assignment for a Machine Learning course, focusing on three main topics: XGBoost regression for predicting quasar redshift, a simplified version of Empirical Bernstein's inequality, and PAC-Bayes-Unexpected-Bernstein. It details the steps taken to implement XGBoost, including data preparation, model training, hyperparameter tuning, and performance evaluation against a baseline model. Additionally, it presents mathematical proofs related to the inequalities and their implications in probability theory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views16 pages

MLB Assignment 7 Final

The document outlines a home assignment for a Machine Learning course, focusing on three main topics: XGBoost regression for predicting quasar redshift, a simplified version of Empirical Bernstein's inequality, and PAC-Bayes-Unexpected-Bernstein. It details the steps taken to implement XGBoost, including data preparation, model training, hyperparameter tuning, and performance evaluation against a baseline model. Additionally, it presents mathematical proofs related to the inequalities and their implications in probability theory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning B (2025)

Home Assignment 7
Benjamin Baadsager (NPR151)

Contents
1 XGBoost (30 points) 2

2 A simple version of Empirical Bernstein’s inequality (30 points) 4

3 PAC-Bayes-Unexpected-Bernstein (40 points) 8

1
1 XGBoost (30 points)
We aim to predict quasar redshift y ∈ R from ten photometric features x ∈ R10 using
an XGBoost regression model. The dataset quasars.csv contains n observations, each row
giving ten input variables and the corresponding redshift.

1.
First, we load the CSV file and separate the first ten columns as our feature matrix X
and the last column as our target vector y. We then split the data into training (80%)
and hold-out test (20%) sets with scikit-learn’s train test split(X, y, test size=0.2,
random state=2025) with random state=2025 to ensure reproducibility.

2.
Next, we carve out 10% of the training data to form a validation set (again via train test split).
Using the remaining 90% for training, we initialize an XGBRegressor with:

• objective=’reg:squarederror’

• colsample bytree = 0.5

• learning rate = 0.1

• max depth = 4

• reg lambda = 1

• n estimators = 500

We pass our training and validation splits into the fit method via the eval set argument
so XGBoost can record RMSE at each boosting round. After fitting, we pull out the
RMSE curves from model.evals result() and plot training vs. validation RMSE against
boosting iterations. This is shown in the below figure:

2
As expected, both curves drop initially, with the validation error leveling off.

Finally, we use the fitted model to predict on the hold-out test set. The initial performance
is:

• Test RMSE: 0.4994

• Test R2: 0.3778

3.
To improve performance, we perform a grid search over a broad set of hyperparame-
ters—varying colsample bytree, learning rate, max depth, reg lambda and n estimators.
Especially we extend the grid to

• colsample bytree: [0.3, 0.5, 0.7, 0.9]

• learning rate: [0.01, 0.1, 0.05]

• max depth: [3, 4, 5, 6]

• reg lambda: [0.1, 1, 2]

• n estimators: [200, 300, 500, 700, 900]

3
We use GridSearchCV with 3-fold cross-validation on the 80% training data, optimizing
for RMSE by specifying scoring=’neg root mean squared error’. This returns the best
combination:
• colsample bytree: 0.7
• learning rate: 0.01
• max depth: 6
• reg lambda: 2
• n estimators: 500
With these optimal hyperparameters in hand, we refit an XGBoost model on the full 80%
training set and evaluate on the 20% hold-out test set. As a simple baseline, we also train
a 5-NN regressor (KNeighborsRegressor(n neighbors=5)). The test results are:

Model Test RMSE Test R2


Grid-Search XGBoost 0.4754 0.4361
5-NN Baseline 0.4778 0.4306

We see how the tuned XGBoost model slightly outperforms the 5-NN baseline in both
RMSE and R2.

2 A simple version of Empirical Bernstein’s inequality


(30 points)
1. Let X and X ′ be two independent identically distributed random variables with mean
µ = E [X] and variance Var(X) = σ 2 . We then wish to prove that
E (X − X ′ )2 = 2 Var(X).
 

Observe first that


2
(X − X ′ )2 = X 2 + X ′ − 2 X X ′ .
Taking expectations on both sides yields
2
E (X − X ′ )2 = E[ X 2 ] + E[ X ′ ] − 2 E[ X X ′ ]
 
2
= E[ X 2 ] + E[ X ′ ] − 2 E[ X]E [ X ′ ]
= 2E X 2 − 2µ2
 

where we used that X ′ is an independent copy of X, thus µ = E[X ′ ] and Var(X ′ ) = σ 2 .


Finally, since Var(X) = E [X 2 ] − µ2 , we get
2 E[X 2 ] − 2 µ2 = 2 E[X 2 ] − µ2 = 2 Var(X).


4
and hence
E (X − X ′ )2 = 2 Var(X),
 

as required.

2. Let X1 , X2 , . . . , Xn be i.i.d. random variables taking values in [0, 1], and assume n is
P n/2 2
even. Define νbn = n1 i=1 X2i − X2i−1 and ν = Var(X1 ). We then wish to prove that
 s 
1
ln δ
P ν ≥ ν̂n + ≤δ
n

We let
n
Zi := (X2i − X2i−1 )2 , i = 1, ...,
2
Since each Xj ∈ [0, 1], we have Zi ∈ [0, 1]. Independence of the Xj implies that the Zi are
independent as well. Moreover, by part 1,

E [Zi ] = E (X2i − X2i−1 )2 = 2 Var(X1 ) = 2ν


 

Next we observe that


n/2
2X
Zi = 2ν̂n
n i=1
and  
n/2
2 X
E Zi  = 2ν
n i=1

Next we define: s
1

ln δ
ε=
n
and observe the equivalences
 
n/2 n/2
2 X 2X
ν ≥ ν̂n + ε ⇔ 2ν − 2ν̂n ≥ 2ε ⇔ E  Zi −
 Zi ≥ 2ε
n i=1 n i=1

Now apply Hoeffding’s inequality to the independent [0, 1]–valued Zi :


   
n/2 n/2
2 X 2 X  n 
Zi ≥ 2ε ≤ exp −2 (2ε)2 = exp −4nε2

P E  Zi  −
n i=1 n i=1 2

5
Next we obtain by inserting the expression for ε:
     s 2 
n/2 n/2
ln 1δ

2X  2X 4
Zi − Zi ≥ 2ε ≤ exp −4n  =δ
 
P E 
n i=1 n i=1 n

since δ ∈ (0, 1) we get  s


1

ln δ
P 2ν − 2ν̂n ≥ 2 ≤δ
n

Thus:  s
1

ln δ
P ν − ν̂n ≥ ≤δ
n

and hence we get  s


1

ln δ
P ν ≥ ν̂n + ≤δ
n

as required.

3. Let X1 , ..., Xn , n, ν and ν̂n as before, and let µ = E [X1 ]. We now wish to prove that

n
s  ! 43 
2 √ 2 2

1 X 2ν̂n ln δ ln δ ln δ
P µ ≥ Xi + + 2 + ≤δ
n i=1 n n 3n

We start be fixing δ ∈ (0, 1) and set


s
2

ln δ
ε=
n
Next define the event
B = {ν ≤ ν̂n + ε}
By part (2)( with confidence parameter 2δ )
 s 
2
ln
P(B) = P ν > ν̂n + ≤ δ
δ
n 2

On B we have
s s
2 2 √ √
 
2ν ln δ
ln δ

≤ 2(ν̂n + ε) = a+b≤ a+ b
n n

6
with
2 2
 
ln δ
ln δ
a = 2ν̂n , b = 2ε
n n
q 2
ln(
δ )
Since ε = n
we obtain:
s  ! 43
√ 2 √ 2

ln δ
ln δ
b= 2ε = 2
n n

Hence on B, s s  ! 34
2 2 √ 2
 
2ν ln δ
ln δ
ln δ
≤ 2ν̂n + 2
n n n
Now let
 ! 43
 s 
n 2 √ 2 2
 
1X  2ν̂n ln δ
ln δ
ln δ
A= µ≥ Xi + + 2 +
 n i=1 n n 3n 

On B, the bound above shows


 s 
n 2
 2

 1X 2ν ln δ
ln δ
A⊂ µ≥ Xi + +
 n i=1 n 3n 

But by Bernstein’s inequality (applied with variance ν and confidence parameter δ/2),
 s 
n 2
 2
1 2ν ln δ ln δ
≤ δ
X
P µ ≥ Xi + +
n i=1 n 3n 2

δ
Hence we have P(A|B) ≤ 2
and thus

δ δ
P(A) ≤ P(A|B) + P(B) ≤ + =δ
2 2
which gives us the desired the result:

n
s  ! 43 
2 √ 2 2

1 X 2ν̂n ln δ
ln δ
ln δ
P µ ≥ Xi + + 2 + ≤δ
n i=1 n n 3n

7
3 PAC-Bayes-Unexpected-Bernstein (40 points)
1.

Let Z ≤ 1 be a random variable. We then wish to show, that for any λ ∈ [0, 12 ]:
h 2 2
i
E e−λZ−λ Z ≤ e−λE[Z]

1
We start by letting z = −λZ. Then since 0 ≤ λ ≤ 2
and Z ≤ 1, we have

−λZ ≥ −λ ≥ − 12

so the hinted inequality


z − z 2 ≤ ln(1 + z) (∀z ≥ −1/2)
applies. Noting that
z − z 2 = −λZ − λ2 Z 2
we obtain
−λZ − λ2 Z 2 ≤ ln(1 − λZ).
Exponentiating both sides gives
2Z2
e−λZ−λ ≤ 1 − λZ.

Taking expectations,
h 2 2
i
E e−λZ−λ Z ≤ E[1 − λZ] = 1 − λE[Z].

Finally, since 1 + x ≤ ex for all x, with x = −λE [Z] we get

1 − λE[Z] ≤ e−λE[Z] .

Combining these h i
2 2
E e−λZ−λ Z ≤ 1 − λE[Z] ≤ e−λE[Z] ,
as required.

Here the assumption Z ≤1 and λ ≤ 21 ensure −λ ≥ − 12 , so we may use z − z 2 ≤ ln(1 + z).
The assumption λ ∈ 0, 12 also guarantees 1 − λZ > 0, making ln(1 − λZ) well-defined.

2.

In this exercise we aim to prove that for Z ≤ 1 and λ ∈ [0, 21 ] we have


h 2 2
i
E eλ(E[Z]−Z)−λ Z ≤ 1

8
We start by rewriting the following
2Z2 2Z2)
eλ(E[Z]−Z)−λ = eλE[Z] e(−λZ−λ

Taking expectations gives


h 2 2
i h 2 2
i
E eλ(E[Z]−Z)−λ Z = eλE[Z] E e−λZ−λ Z

By part 1, for Z ≤ 1 and λ ∈ [0, 21 ] we have


h 2 2
i
E e−λZ−λ Z ≤ e−λE[Z]

Hence h i
2 2
E eλ(E[Z]−Z)−λ Z ≤ eλE[Z] · e−λE[Z] = 1
as claimed.

3.

Let Z1 , . . . , Zn be independent random variables with Z ≤ 1, and fix λ ∈ [0, 12 ]. We then


aim to show " #
 X n Xn 
E[Zi ] − Zi − λ2 Zi2

E exp λ ≤ 1.
i=1 i=1

First write
n
X n
X n h
X i
− λ2 Zi2 = − λ2 Zi2 .
 
S=λ E[Zi ] − Zi λ E[Zi ] − Zi
i=1 i=1 i=1

Then n
  Y  
exp S = exp λ (E[Zi ] − Zi ) − λ2 Zi2 .
i=1

By independence,
" # n
  Y h i
E exp S = E exp λ (E[Zi ] − Zi ) − λ2 Zi2 .
i=1

But from part 2 each factor


h i
E exp λ (E[Zi ] − Zi ) − λ2 Zi2 ≤ 1,

Hence n n
Y h i Y
2 2
E exp λ (E[Zi ] − Zi ) − λ Zi ≤ 1 = 1.
i=1 i=1

9
Thus we get the desired result:
" n n
#
 X X 
2 2
E exp λ (E[Zi ] − Zi ) − λ Zi ≤ 1,
i=1 i=1

4.

We let Z1 , ...Zn be independent random variables upper bounded by 1 and fix λ ∈ (0, 21 ].
We wish to show:
" n # n n  !
1X 1X λX 2 1 1
P E Zi ≥ Zi + Zi + ln ≤δ
n i=1 n i=1 n i=1 λn δ

From part 3 we have the follwoing


" n n
#
 X X 
2 2
E exp λ (E[Zi ] − Zi ) − λ Zi ≤ 1 (*)
i=1 i=1

Define the event:


( n n n  )
X X X 1 1
A= E [Zi ] ≥ Zi + λ Zi2 + ln
i=1 i=1 i=1
λ δ

Then we have on A:
n n n  
X X X 1 1
E [Zi ] − Zi ≥ λ Zi2 + ln
i=1 i=1 i=1
λ δ

Exponentiating (since ex is increasing) and multiply each term with λ gives


n
! n
!   
X
2
X
2 1 1
exp λ (E [Zi ] − Zi ) − λ Zi ≥ exp ln =
i=1 i=1
δ δ

Now let !
n
X n
X
2
X=λ (E [Zi ] − Zi ) −λ Zi2
i=1 i=1

We then apply Markov’s inequality


 
1
P(A) = P exp (X) ≥
δ
≤ δE [exp (X)]

10
By (*) we have E [exp(X)] ≤ 1, so
n n n  !
X X X 1 1
P E [Zi ] ≥ Zi + λ Zi2 + ln ≤δ
i=1 i=1 i=1
λ δ

Dividing all sums by n yields exactly the stated inequality:


" n # n n  !
1X 1X λX 2 1 1
P E Zi ≥ Zi + Zi + ln ≤δ
n i=1 n i=1 n i=1 λn δ

5.

Let Λ = {λ1 , . . . , λk } be a grid of k values, such that λi ∈ (0, 12 ] for all i. Prove that:
" n # n n
!!
1X 1X 1 X 2 ln(k/δ)
P E Zi ≥ Zi + min λ · Z + ≤ δ.
n i=1 n i=1 λ∈Λ n i=1 i λn

For each fixed λ ∈ Λ, part 4 gives


" n # n n
!
1X 1X 1 X 2 ln( δ1′ )
P E Zi ≥ Zi + λi · Z + ≤ δ′
n i=1 n i=1 n i=1 i λi n

For any δ ′ > 0. Now choose δ ′ = δ/k. Then for each λi ∈ Λ


" n # n n
!
1X 1X 1 X 2 ln( kδ ) δ
P E Zi ≥ Zi + λi · Zi + ≤
n i=1 n i=1 n i=1 λi n k

By the union bound over the k grid points, the probability that any λi fails its bound is
at most
" n # n n
! k
1X 1X 1 X 2 ln(k/δ) X δ
P ∃λ ∈ Λ : E Zi ≥ Zi + λ · Zi + ≤ = δ.
n i=1 n i=1 n i=1 λn i=1
k

Equivalently, with probability at least 1 − δ we have


" n # n n
1X 1X 1 X 2 ln(k/δ)
E Zi ≤ Zi + λ · Z + , for all λ ∈ Λ.
n i=1 n i=1 n i=1 i λn

Since this holds for all λ ∈ Λ, it also holds for the one corresponding to the smallest
right–hand side, so
" n # n n
!
1X 1X 1 X 2 ln(k/δ)
E Zi ≤ Zi + min λ · Z + .
n i=1 n i=1 λ∈Λ n i=1 i λn

11
and hence we end up with desired result:
" n # n n
!!
1X 1X 1 X 2 ln(k/δ)
P E Zi ≥ Zi + min λ · Z + ≤ δ.
n i=1 n i=1 λ∈Λ n i=1 i λn

6.

We now wish to compare two high-probability upper bounds on p − p̂n when sampling
n = 100 i.i.d. draws from the three-point distribution
1 − p1
P(Z = 0) = P (Z = 1) = 2
, P(Z = 12 ) = p1/2
2
1
so that the true mean is p = 2
for all p1/2 ∈ [0, 1]. For each p1/2 on a fine grid, we simulate
Z1 , ...Z100 and compute
n n
1X 1X 2
p̂n = Zi , v̂n = Z
n i=1 n i=1 i
and plot:

• kl bound: !
1
−1+ ln δ
p − p̂n ≤ kl p̂n , − p̂n
n

• Unexpected-Bernstein bound: Choose


 r 
n
k = log2 /2 , Λ = { 12 , 212 , ..., 21k }
ln(1/δ)

Then with probability at least 1 − δ


 
ln(k/δ)
p − p̂n ≤ min λv̂n +
λ∈Λ λn

The below figure shows the two bounds as a function of p1/2 with a fixed confidence
parameter δ = 0.05.

12
kl is tight at p1/2 = 0 (the Bernoulli case), but as p1/2 grows and variance shrinks, the
Unexpected-Bernstein bound performs better than the kl bound.

7.

Define Zi = ℓ(h(Xi ), Yi ) so that each Zi ∈ [0, 1]. Then


n n
1X 1X 2
L(h) = E [ℓ(h(X), Y )] = E [Zi ] , L̂(h, S) = Zi , V̂ (h, S) = Z
n i=1 n i=1 i

Our goal is then to show, that for any λ ∈ [0, 21 ]:


h     i
E exp n λ L(h) − L̂(h, S) − λ2 V̂ (h, S) ≤1

Start by observing the following


n
1X
L(h) − L̂(h, S) = E [Zi ] − Zi
n i=1

13
and n
1X 2
V̂ (h, S) = Z
n i=1 i
Hence
n
! n
!
  
2
 1X 21
X
n λ L(h) − L̂(h, S) − λ V̂ (h, S) = n λ E [Zi ] − Zi −λ Zi2
n i=1 n i=1

Which can be written as


n
X n
X n
X n
X n
X n
X n
X
λnE [Zi ]−λ Zi −λ2 Zi2 = λ E [Zi ]−λ Zi −λ2 Zi2 = λ (E [Zi ]−Zi )−λ2 Zi2
i=1 i=1 i=1 i=1 i=1 i=1 i=1

Hence:
n n
!
     X X
exp n λ L(h) − L̂(h, S) − λ2 V̂ (h, S) = exp λ (E [Zi ] − Zi ) − λ2 Zi2
i=1 i=1

Next using the independence of the Zi ≤ 1 together with part 3 gives, for any λ ∈ [0, 12 ],
" n n
!#
X X
E exp λ (E [Zi ] − Zi ) − λ2 Zi2 ≤1
i=1 i=1

Thus we conclude that for any λ ∈ [0, 21 ] we have


h     i
2
E exp n λ L(h) − L̂(h, S) − λ V̂ (h, S) ≤1

as required.

8.

As the hint suggests, we define the function


 
f (h, S) := n λ(L(h) − L̂(h, S)) − λ2 V̂ (h, S) ,

with λ ∈ (0, 21 ] and


n
1X
V̂ (h, S) = ℓ(h(Xi ), Yi )2
n i=1
From step 7 we know for each fixed h that

ES [exp (f (h, S))] ≤ 1.

14
Hence also
Eh∼π [ES [exp (f (h, S))]] ≤ 1.
for any prior π independent of S.

Now apply the PAC–Bayes lemma (Lemma 3.28): with probability at least 1 − δ over the
draw of S, every posterior ρ satisfies
   !!
Eh∼π ES ef (h,S)
P ∃ρ : Eh∼ρ [f (h, S)] ≥ KL(ρ∥π) + ln ≤ δ.
δ

Since we already have shown that Eh∼π [ES [exp (f (h, S))]] ≤ 1, we get the following:
  
1
P ∃ρ : Eρ [f (h, S)] ≥ KL(ρ∥π) + ln ≤ δ.
δ
Substituting back f and rearranging gives:
 h  i 
P ∃ρ : Eρ n λ(L(h) − L̂(h, S)) − λ2 V̂ (h, S) ≥ KL(ρ∥π) + ln 1
δ
≤ δ.

Thus we have with probability at least 1 − δ, that for all ρ


h  i
Eρ n λ(L(h) − L̂(h, S)) − λ2 V̂ (h, S) ≤ KL(ρ∥π) + ln 1

δ
.

Which can be written as:


 
nλ Eρ [L(h)] − Eρ [L̂(h, S)] ≤ nλ2 Eρ [V̂ (h, S)] + KL(ρ∥π) + ln 1

δ
.

Dividing by nλ > 0, yields with probability at least 1 − δ:


1

KL(ρ∥π) + ln δ
Eρ [L(h)] ≤ Eρ [L̂(h, S)] + λEρ [V̂ (h, S)] + .

Thus we get the desired result as:
1
!
KL(ρ∥π) + ln δ
P ∃ρ : Eρ [L(h)] ≥ Eρ [L̂(h, S)] + λEρ [V̂ (h, S)] + ≤ δ,

9.

Let S be an i.i.d. sample, ℓ ≤ 1 a loss function, and π any prior on H. Define


n n
b S) = 1 1X
X  2
L(h, ℓ h(Xi ), Yi , Vb (h, S) = ℓ h(Xi ), Yi ,
n i=1 n i=1

15
and denote the true loss L(h) = E[ℓ(h(X), Y )].

From Part 8 we have: for every λ ∈ (0, 21 ],


 KL(ρ∥π) + ln 1δ 
P ∃ ρ : Eρ [ L(h) ] ≥ Eρ [ L(h, S) ] + λ Eρ [ V (h, S) ] +
b b ≤ δ

Now let Λ = {λ1 , λ2 , . . . , λk } ⊂ (0, 21 ]. For each λi ∈ Λ, replace δ by δ/k in the bound
above. Then
KL(ρ∥π) + ln kδ 

 δ
P ∃ ρ : Eρ [ L(h) ] ≥ Eρ [ L(h, S) ] + λi Eρ [ V (h, S) ] +
b b ≤
n λi k
A union bound over i = 1, ..., k shows that with probability at least 1 − δ, all k inequalities
hold simultaneously. Hence for every posterior ρ we get

KL(ρ∥π) + ln kδ

Eρ [ L(h) ] ≤ Eρ [ L(h, S) ] + λi Eρ [ V (h, S) ] +
b b
n λi
Taking the minimum over i then yields for every ρ with probability at least 1 − δ:
k o

n KL(ρ∥π) + ln δ
Eρ [ L(h) ] ≤ Eρ [ L(h,
b S) ] + min λ Eρ [ Vb (h, S) ] + .
λ∈Λ nλ
which can be written as
k 
b S) ] + min λ Eρ [ Vb (h, S) ] + KL(ρ∥π) + ln δ
 
P ∃ ρ : Eρ [ L(h) ] ≥ Eρ [ L(h, ≤ δ.
λ∈Λ nλ
This is exactly the desired statement.

16

You might also like