Machine Learning (Fall 2020)
Assignment 1
ANSWER No 1: Linear regression
We have learned about linear regression in the class. Here we explore what we mean by "linear".
Specifically, we will see how linear regression can be used to fit non-linear functions of the data using
feature maps, and explore some of its limitations.
Learning degree-3 polynomials of the input
N
1
J(θ)= ∑ ¿ ¿- y i)2
❑
2 I=1
Differentiating this objective we get the update rule
N
∇ θj(θ )=∑ ¿¿ - y i)( x i)
I=1
N
∝ γ ∑ ¿ ¿- y i)( x i)
I=1
WHERE γ IS THE LEARING RATE THE UPDATE RULE IS
N
θ=θ ∑ ¿ ¿- y i)( x i)
I =1
Answer no 2: Coding question: degree-3 polynomials regression
Code
Output graph
3. Coding question: degree-3 polynomial GD and SGD
Output
Answer no 4: Coding question: degree-k polynomial regression
Output
Generally higher degree polynomials tend to fit the data better though very high degree poly-nominal’s
can be numerically unstable
Answer no 5: Coding question: other feature maps
Output
In the presence of the sin(x) feature, the models seem to the data better more robustly. However the
numerical instability that comes with high degree polynomials remain
Answer no 6 : Overfitting with expressive models and small data
We see that when the dataset is small. Higher degree polynomials tend to pass through all the
points, but qualitatively seem like a poor fit. Numerical instability with high degree polynomials
remain a problem even with small data, with or without sin(x).
2. Incomplete, Positive-Only Labels
2.1Coding problem: ideal (fully observed) case
Output
2.2 Coding problem: The naive method on partial labels
Output
3.Warm-up with Bayes rule
4.CP8318 Only Question
We have
ρ ¿-1| x i)- ρ ( y (i )-1⋀ t (i)-1| x i) from previous sub –question
= ρ ¿-1| x i)- ρ ( ρ ¿-1 t (i)-1. x(i)
= ρ ¿-1| x i)∝
Dividing both sides by ∝,we have the desired result
5.CP8318 Only Question: Estimating
This is a solution that is easier to understand but it strongly relies on the assumption that
we made here. The Solution 2 below is less intuitive but can work for more general
settings than the setting in this sub question.
We fix i and let A –{x: ρ ¿ ¿-1| x(i)-x)-1} be the set of all positive examples and R={x: ρ ¿ ¿
=0 x(i)=x)=1} be the set of all negative examples. By the assumption ρ ¿ ¿ -1 x(i) )ϵ
{0,1},we know that A and B covers the who;e input space.we first claim that ρ ¿ ∉ A. y i
=1)=0.this is because when x(i) ∉ A. we have that x(i) ∈B,WHICH IMPLIES THAT ρ ¿ ¿ -
0 | x(i) - 1,ehich implies that ρ ¿ ¿ -1 | x(i) – 0.
Therefore the event x(i) ∉ A implies the event ¿ ¿=0 which implies that y i =0.
Thus ρ ¿ ∈ A| y i -1 =1.
By 2nd for x∈ A,We have that h(x)-α p¿ ¿ -1 x(i)-x)-α .that is p(h( x(i)-α x (i )-x)-1.
The implies that p(h( x(i) -α ∨x (i) -x. y (i )-1)-1.
Note that for any event E.and any event F with positive probability p(E)-1 implies that
p(E|F)-1.
This is because p(E)-0 which implies p(E|F)-p(E∩ F )/p (F)-0)
p(h( x(i) )=α | y (i )=1= p(h( x(i))=α | y (i )=1. x(i) ∈ A). ρ ¿ A| y (i )=1……(4)
-1. ρ ¿ A| y (i )=1)
(by p(h( x(i))=α | x(i)=x. y (i )=1)=1}
-1.1 (because p(h( x(i)) ∈ A| y (i )-1)-1}….(5)
-1
Therefore Eh ( x(i)) y (i )-1)-α
Let g( x i)-p(t i-1| x i) By d) we have that
h(( x i)-∝ g ( x i)
Then we have that
E[ph( x i) y (i )=1=Eh ( x i)| y (i )−1]
Eh ( x i ) 1{ y (i)=1 }
= ()
p { y i −1 }
E [ E1 ( y −1 )| x ] h x
i i i
= BY ;LAW OF EXPECTION
E ¿¿
=E ¿ ¿
E g ( xi ) . g xi
=∝
E¿¿
=∝ (since g(. x i) ∈{0,1 }
6. Coding problem.
Using the validation set, estimate the constant α by averaging your classifier’s
predictions over all labeled examples in the validation set:
Output