MLB Assignment 7 Final
MLB Assignment 7 Final
Home Assignment 7
Benjamin Baadsager (NPR151)
Contents
1 XGBoost (30 points) 2
1
1 XGBoost (30 points)
We aim to predict quasar redshift y ∈ R from ten photometric features x ∈ R10 using
an XGBoost regression model. The dataset quasars.csv contains n observations, each row
giving ten input variables and the corresponding redshift.
1.
First, we load the CSV file and separate the first ten columns as our feature matrix X
and the last column as our target vector y. We then split the data into training (80%)
and hold-out test (20%) sets with scikit-learn’s train test split(X, y, test size=0.2,
random state=2025) with random state=2025 to ensure reproducibility.
2.
Next, we carve out 10% of the training data to form a validation set (again via train test split).
Using the remaining 90% for training, we initialize an XGBRegressor with:
• objective=’reg:squarederror’
• max depth = 4
• reg lambda = 1
• n estimators = 500
We pass our training and validation splits into the fit method via the eval set argument
so XGBoost can record RMSE at each boosting round. After fitting, we pull out the
RMSE curves from model.evals result() and plot training vs. validation RMSE against
boosting iterations. This is shown in the below figure:
2
As expected, both curves drop initially, with the validation error leveling off.
Finally, we use the fitted model to predict on the hold-out test set. The initial performance
is:
3.
To improve performance, we perform a grid search over a broad set of hyperparame-
ters—varying colsample bytree, learning rate, max depth, reg lambda and n estimators.
Especially we extend the grid to
3
We use GridSearchCV with 3-fold cross-validation on the 80% training data, optimizing
for RMSE by specifying scoring=’neg root mean squared error’. This returns the best
combination:
• colsample bytree: 0.7
• learning rate: 0.01
• max depth: 6
• reg lambda: 2
• n estimators: 500
With these optimal hyperparameters in hand, we refit an XGBoost model on the full 80%
training set and evaluate on the 20% hold-out test set. As a simple baseline, we also train
a 5-NN regressor (KNeighborsRegressor(n neighbors=5)). The test results are:
We see how the tuned XGBoost model slightly outperforms the 5-NN baseline in both
RMSE and R2.
4
and hence
E (X − X ′ )2 = 2 Var(X),
as required.
2. Let X1 , X2 , . . . , Xn be i.i.d. random variables taking values in [0, 1], and assume n is
P n/2 2
even. Define νbn = n1 i=1 X2i − X2i−1 and ν = Var(X1 ). We then wish to prove that
s
1
ln δ
P ν ≥ ν̂n + ≤δ
n
We let
n
Zi := (X2i − X2i−1 )2 , i = 1, ...,
2
Since each Xj ∈ [0, 1], we have Zi ∈ [0, 1]. Independence of the Xj implies that the Zi are
independent as well. Moreover, by part 1,
Next we define: s
1
ln δ
ε=
n
and observe the equivalences
n/2 n/2
2 X 2X
ν ≥ ν̂n + ε ⇔ 2ν − 2ν̂n ≥ 2ε ⇔ E Zi −
Zi ≥ 2ε
n i=1 n i=1
5
Next we obtain by inserting the expression for ε:
s 2
n/2 n/2
ln 1δ
2X 2X 4
Zi − Zi ≥ 2ε ≤ exp −4n =δ
P E
n i=1 n i=1 n
Thus: s
1
ln δ
P ν − ν̂n ≥ ≤δ
n
as required.
3. Let X1 , ..., Xn , n, ν and ν̂n as before, and let µ = E [X1 ]. We now wish to prove that
n
s ! 43
2 √ 2 2
1 X 2ν̂n ln δ ln δ ln δ
P µ ≥ Xi + + 2 + ≤δ
n i=1 n n 3n
On B we have
s s
2 2 √ √
2ν ln δ
ln δ
√
≤ 2(ν̂n + ε) = a+b≤ a+ b
n n
6
with
2 2
ln δ
ln δ
a = 2ν̂n , b = 2ε
n n
q 2
ln(
δ )
Since ε = n
we obtain:
s ! 43
√ 2 √ 2
ln δ
ln δ
b= 2ε = 2
n n
Hence on B, s s ! 34
2 2 √ 2
2ν ln δ
ln δ
ln δ
≤ 2ν̂n + 2
n n n
Now let
! 43
s
n 2 √ 2 2
1X 2ν̂n ln δ
ln δ
ln δ
A= µ≥ Xi + + 2 +
n i=1 n n 3n
But by Bernstein’s inequality (applied with variance ν and confidence parameter δ/2),
s
n 2
2
1 2ν ln δ ln δ
≤ δ
X
P µ ≥ Xi + +
n i=1 n 3n 2
δ
Hence we have P(A|B) ≤ 2
and thus
δ δ
P(A) ≤ P(A|B) + P(B) ≤ + =δ
2 2
which gives us the desired the result:
n
s ! 43
2 √ 2 2
1 X 2ν̂n ln δ
ln δ
ln δ
P µ ≥ Xi + + 2 + ≤δ
n i=1 n n 3n
7
3 PAC-Bayes-Unexpected-Bernstein (40 points)
1.
Let Z ≤ 1 be a random variable. We then wish to show, that for any λ ∈ [0, 12 ]:
h 2 2
i
E e−λZ−λ Z ≤ e−λE[Z]
1
We start by letting z = −λZ. Then since 0 ≤ λ ≤ 2
and Z ≤ 1, we have
−λZ ≥ −λ ≥ − 12
Taking expectations,
h 2 2
i
E e−λZ−λ Z ≤ E[1 − λZ] = 1 − λE[Z].
1 − λE[Z] ≤ e−λE[Z] .
Combining these h i
2 2
E e−λZ−λ Z ≤ 1 − λE[Z] ≤ e−λE[Z] ,
as required.
Here the assumption Z ≤1 and λ ≤ 21 ensure −λ ≥ − 12 , so we may use z − z 2 ≤ ln(1 + z).
The assumption λ ∈ 0, 12 also guarantees 1 − λZ > 0, making ln(1 − λZ) well-defined.
2.
8
We start by rewriting the following
2Z2 2Z2)
eλ(E[Z]−Z)−λ = eλE[Z] e(−λZ−λ
Hence h i
2 2
E eλ(E[Z]−Z)−λ Z ≤ eλE[Z] · e−λE[Z] = 1
as claimed.
3.
First write
n
X n
X n h
X i
− λ2 Zi2 = − λ2 Zi2 .
S=λ E[Zi ] − Zi λ E[Zi ] − Zi
i=1 i=1 i=1
Then n
Y
exp S = exp λ (E[Zi ] − Zi ) − λ2 Zi2 .
i=1
By independence,
" # n
Y h i
E exp S = E exp λ (E[Zi ] − Zi ) − λ2 Zi2 .
i=1
Hence n n
Y h i Y
2 2
E exp λ (E[Zi ] − Zi ) − λ Zi ≤ 1 = 1.
i=1 i=1
9
Thus we get the desired result:
" n n
#
X X
2 2
E exp λ (E[Zi ] − Zi ) − λ Zi ≤ 1,
i=1 i=1
4.
We let Z1 , ...Zn be independent random variables upper bounded by 1 and fix λ ∈ (0, 21 ].
We wish to show:
" n # n n !
1X 1X λX 2 1 1
P E Zi ≥ Zi + Zi + ln ≤δ
n i=1 n i=1 n i=1 λn δ
Then we have on A:
n n n
X X X 1 1
E [Zi ] − Zi ≥ λ Zi2 + ln
i=1 i=1 i=1
λ δ
Now let !
n
X n
X
2
X=λ (E [Zi ] − Zi ) −λ Zi2
i=1 i=1
10
By (*) we have E [exp(X)] ≤ 1, so
n n n !
X X X 1 1
P E [Zi ] ≥ Zi + λ Zi2 + ln ≤δ
i=1 i=1 i=1
λ δ
5.
Let Λ = {λ1 , . . . , λk } be a grid of k values, such that λi ∈ (0, 12 ] for all i. Prove that:
" n # n n
!!
1X 1X 1 X 2 ln(k/δ)
P E Zi ≥ Zi + min λ · Z + ≤ δ.
n i=1 n i=1 λ∈Λ n i=1 i λn
By the union bound over the k grid points, the probability that any λi fails its bound is
at most
" n # n n
! k
1X 1X 1 X 2 ln(k/δ) X δ
P ∃λ ∈ Λ : E Zi ≥ Zi + λ · Zi + ≤ = δ.
n i=1 n i=1 n i=1 λn i=1
k
Since this holds for all λ ∈ Λ, it also holds for the one corresponding to the smallest
right–hand side, so
" n # n n
!
1X 1X 1 X 2 ln(k/δ)
E Zi ≤ Zi + min λ · Z + .
n i=1 n i=1 λ∈Λ n i=1 i λn
11
and hence we end up with desired result:
" n # n n
!!
1X 1X 1 X 2 ln(k/δ)
P E Zi ≥ Zi + min λ · Z + ≤ δ.
n i=1 n i=1 λ∈Λ n i=1 i λn
6.
We now wish to compare two high-probability upper bounds on p − p̂n when sampling
n = 100 i.i.d. draws from the three-point distribution
1 − p1
P(Z = 0) = P (Z = 1) = 2
, P(Z = 12 ) = p1/2
2
1
so that the true mean is p = 2
for all p1/2 ∈ [0, 1]. For each p1/2 on a fine grid, we simulate
Z1 , ...Z100 and compute
n n
1X 1X 2
p̂n = Zi , v̂n = Z
n i=1 n i=1 i
and plot:
• kl bound: !
1
−1+ ln δ
p − p̂n ≤ kl p̂n , − p̂n
n
The below figure shows the two bounds as a function of p1/2 with a fixed confidence
parameter δ = 0.05.
12
kl is tight at p1/2 = 0 (the Bernoulli case), but as p1/2 grows and variance shrinks, the
Unexpected-Bernstein bound performs better than the kl bound.
7.
13
and n
1X 2
V̂ (h, S) = Z
n i=1 i
Hence
n
! n
!
2
1X 21
X
n λ L(h) − L̂(h, S) − λ V̂ (h, S) = n λ E [Zi ] − Zi −λ Zi2
n i=1 n i=1
Hence:
n n
!
X X
exp n λ L(h) − L̂(h, S) − λ2 V̂ (h, S) = exp λ (E [Zi ] − Zi ) − λ2 Zi2
i=1 i=1
Next using the independence of the Zi ≤ 1 together with part 3 gives, for any λ ∈ [0, 12 ],
" n n
!#
X X
E exp λ (E [Zi ] − Zi ) − λ2 Zi2 ≤1
i=1 i=1
as required.
8.
14
Hence also
Eh∼π [ES [exp (f (h, S))]] ≤ 1.
for any prior π independent of S.
Now apply the PAC–Bayes lemma (Lemma 3.28): with probability at least 1 − δ over the
draw of S, every posterior ρ satisfies
!!
Eh∼π ES ef (h,S)
P ∃ρ : Eh∼ρ [f (h, S)] ≥ KL(ρ∥π) + ln ≤ δ.
δ
Since we already have shown that Eh∼π [ES [exp (f (h, S))]] ≤ 1, we get the following:
1
P ∃ρ : Eρ [f (h, S)] ≥ KL(ρ∥π) + ln ≤ δ.
δ
Substituting back f and rearranging gives:
h i
P ∃ρ : Eρ n λ(L(h) − L̂(h, S)) − λ2 V̂ (h, S) ≥ KL(ρ∥π) + ln 1
δ
≤ δ.
9.
15
and denote the true loss L(h) = E[ℓ(h(X), Y )].
KL(ρ∥π) + ln kδ
Eρ [ L(h) ] ≤ Eρ [ L(h, S) ] + λi Eρ [ V (h, S) ] +
b b
n λi
Taking the minimum over i then yields for every ρ with probability at least 1 − δ:
k o
n KL(ρ∥π) + ln δ
Eρ [ L(h) ] ≤ Eρ [ L(h,
b S) ] + min λ Eρ [ Vb (h, S) ] + .
λ∈Λ nλ
which can be written as
k
b S) ] + min λ Eρ [ Vb (h, S) ] + KL(ρ∥π) + ln δ
P ∃ ρ : Eρ [ L(h) ] ≥ Eρ [ L(h, ≤ δ.
λ∈Λ nλ
This is exactly the desired statement.
16