0% found this document useful (0 votes)

16 views3 pages

Introduction To Neural Networks

The document discusses the mathematics of machine learning, focusing on concepts such as generalization risk, empirical risk, and overfitting. It also covers the Markov inequality, PAC-learnability, and the limitations of linear classifiers in separating certain functions like XOR. Additionally, it introduces the growth function and VC dimension, along with empirical Rademacher complexity in the context of classifiers.

Uploaded by

spammysharky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views3 pages

Introduction To Neural Networks

Uploaded by

spammysharky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

MA3K1 Mathematics of Machine Learning April 10, 2021

Solutions
Solution (1) We assume that there is an input space X , and output space Y, and a
probability distribution on X ⇥ Y. Given a map h : X ! Y and a loss function L
defined on Y ⇥ Y, the generalization risk of h is the expected value

R(h) = E[L(h(X), Y )],

where (X, Y ) is distributed on X ⇥ Y according to the given distribution. The general-

ization risk gives an indication on how h “performs”, on average, on unseen data.
If we observe n samples (x1 , y1 ), . . . , (xn , yn ) drawn from this distribution, we can
construct a classifier ĥ by minimizing the empirical risk
n
1X
R̂(h) = L(h(xi ), yi )
n i=1

over a class of functions H. When the class H is very large (for example, if it consists
of all possible functions from X to Y), then we can find a function ĥ for which the
empirical risk, R̂(ĥ), is very small or even zero (if we simply “memorise” the observed
data). Such a function ĥ can have a large generalization risk: it has been adapted too
closely to the observed data, at the expense of not generalizing well to unobserved data.
This is called overfitting.

Solution (2) The Markov inequality states that, given t > 0 and a non-negative
random variable X, we have
E[X]
P(X t)  .
t
Given any t 2 R and 0, e t
> 0, and hence

X E[e X
]
P(X t) = P(e e t)  t
,
e
where the last inequality follows by applying Markov’s inequality.

Solution (3) (a) The generalization risk is the expected value

R(h) = E[1{h(X) 6= Y }] = P(h(X) 6= Y ).

The empirical risk is

n
1X
R̂(h) = 1{h(Xi ) 6= Yi }.
n i=1

(b) Assume that we have n random samples (X1 , Y1 ), . . . , (Xn , Yn ) that are inde-
pendent and identically distributed. For each fixed h 2 H, we have the bound

|R̂(h) R(h)|  C(n, ),

7
MA3K1 Mathematics of Machine Learning April 10, 2021

with probability at least 1 , by assumption. Now if ĥ is the minimizer of R̂(h), that

is,
n
1X
ĥ = arg min 1{h(Xi ) 6= Yi },
h2H n i=1

then ĥ is also a random variable: it depends on the random data (Xi , Yi ). Hence, in the
expression |R̂(ĥ) R(ĥ)|, the given ĥ is not fixed and will vary with the data (Xi , Yi ).
The expected value may larger.

Solution (4) (a) The regression function for binary classification is

f (x) = E[Y | X = x] = P(Y = 1 | X = x)

The Bayes classifier can be defined as

(
1 if f (x) > 1/2
h⇤ (x) =
0 else

(b) To compute the Bayes classifier, we use Bayes’ rule:

⇢1 (x)P(Y = 1) p · ⇢1 (x)
P(Y = 1 | X = x) = = .
⇢(x) (1 p)⇢0 (x) + p⇢1 (x)
We can rearrange:

p · ⇢1 (x) 1 ⇢1 (x) 1 p
> , > .
(1 p)⇢0 (x) + p⇢1 (x) 2 ⇢0 (x) p

Solution (5) A hypothesis class H is called PAC-learnable if there exists a classifier

ĥ 2 H depending on n random samples (Xi , Yi ), i 2 {1, . . . , n}, and a polynomial
function p(x, y, z, w), such that for any ✏ > 0 and 2 (0, 1), for all distributions on
X ⇥ Y, ✓ ◆
P R(ĥ)  inf R(h) + ✏ 1
h2H

holds whenever n p(1/✏, 1/ , size(X ), size(H)).

Solution (6) (a) A function h 2 H that implements XOR would correspond to a line
in R2 that separates the points {(0, 0), (1, 1)} from the points {(0, 1), (1, 0)}. Any line
h(x) = wT x+b that has (0, 0) on one side, say, h(0, 0) < 0, has to have b < 0. Assume
that h separates (0, 0) from (0, 1) and (1, 0). Then h(1, 0) > 0 implies w1 + b > 0, and
h(0, 1) > 0 implies that w2 + b > 0. In particular, if we sum these two expression, we
get w1 + w2 + 2b > 0. But since b < 0, this implies that h(1, 1) = w1 + w2 + b > 0.
This shows that any line line separating (0, 0) from (0, 1) and (1, 0) also separates (0, 0)
from (1, 1),and hence XOR is not possible via linear classifiers.
(b) The problem amounts to finding the ways in which the corners of a square can
be separated by lines. There are 16 = 24 possible binary functions.

8
MA3K1 Mathematics of Machine Learning April 10, 2021

x1 x2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

The task is to figure out which of these can be implemented by a linear classifier.
Clearly, for every such function f that can be implemented by linear separation, the
inverted 1 f can also be implemented (if f (x) = 1{h(x) > 0}, then 1 f (x) =
1{ h(x) > 0}). Clearly, cases 1 and 16 can be implemented (take any line that contains
all the points on one side). We also get all the cases in the above table with exactly one
1 or exactly one 0 by a line that has one point on one side and all the others on the other
side (like x2 x1 + 1/2, for example). This makes another 8 cases. By considering
conditions of the form

xi 1/2 > 0, xi 1/2 < 0, i 2 {1, 2}

we get four additional cases, taking the total number of functions that can be implemen-
ted using H to 14. The remaining two cases are XOR and 1 XOR, which cannot be
implement using H as seen in Part (a).

Solution (7) Let H be a set of classifiers h : X ! Y. The growth function ⇧H is

defined as
⇧H (n) = maxn |{h(x) : h 2 H}| .
x2X

where h(x) = (h(x1 ), . . . , h(xn )) if x = (x1 , . . . , xn ). The VC dimension of H is

VC(H) = max{n 2 N : ⇧H (n) = 2n }.

We can take the set x = (e1 , . . . , ed ), where ei is the i-th unit vector. Then

|{h(x) : h 2 H}| = 2d ,

since for any 0 1 vector of length d, taking the function hI where I consists of
those indices i for which i = 1 gives the sign pattern hI (x) = . This shows that the
VC dimension is at least d.

Solution (8) Set K = |H|. For z = (z1 , . . . , zn ), with zi = (xi , yi ), we can define
the empirical Rademacher complexity as
" n
#
1X
R̂z (L H) = E sup i g(zi ) ,
g2L H n i=1

where g(zi ) = L(h(xi ), yi ), and the Rademacher complexity as

h i
Rn (L H) = EZ R̂Z (L H .

Sol All
No ratings yet
Sol All
66 pages
Deep Learning
No ratings yet
Deep Learning
3 pages
hw2 Sol
No ratings yet
hw2 Sol
3 pages
Machine Learning, Spring 2005
No ratings yet
Machine Learning, Spring 2005
3 pages
Chapter 3 Solutions Understanding Machine Learning
No ratings yet
Chapter 3 Solutions Understanding Machine Learning
6 pages
MATH 499 Homework 2
100% (3)
MATH 499 Homework 2
2 pages
Exam 21
No ratings yet
Exam 21
17 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
13 pages
Chapter 2 Solutions Understanding Machine Learning
No ratings yet
Chapter 2 Solutions Understanding Machine Learning
4 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
Ass8 Solns
No ratings yet
Ass8 Solns
10 pages
Binary Classification & Bayes Classifier
No ratings yet
Binary Classification & Bayes Classifier
4 pages
Cs221 Section2 Solutions
No ratings yet
Cs221 Section2 Solutions
7 pages
Homework - 1
No ratings yet
Homework - 1
10 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Kernel Methods
No ratings yet
Kernel Methods
32 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Homework 1
0% (1)
Homework 1
4 pages
Pat 02 Sol
100% (1)
Pat 02 Sol
5 pages
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
No ratings yet
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
9 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Quiz2 Mock Solutions
No ratings yet
Quiz2 Mock Solutions
19 pages
Machine Learning 10-701 Exam Prep
No ratings yet
Machine Learning 10-701 Exam Prep
14 pages
Machine Learning Exam: Statistical Methods
No ratings yet
Machine Learning Exam: Statistical Methods
24 pages
Quiz3 2024
No ratings yet
Quiz3 2024
2 pages
Mathematics of Machine Learning Lecture 9
No ratings yet
Mathematics of Machine Learning Lecture 9
6 pages
LogReg 2024 25 Exercs Sols
No ratings yet
LogReg 2024 25 Exercs Sols
20 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
Lec10 PDF
No ratings yet
Lec10 PDF
8 pages
Nonparametric Classification Techniques
No ratings yet
Nonparametric Classification Techniques
20 pages
Finite vs Infinite Hypothesis Spaces
No ratings yet
Finite vs Infinite Hypothesis Spaces
7 pages
Machine Learning MCQ Assignment
No ratings yet
Machine Learning MCQ Assignment
56 pages
Slides No Break
No ratings yet
Slides No Break
77 pages
SVM Exam Solutions Overview
No ratings yet
SVM Exam Solutions Overview
26 pages
Midterm 2008s Solution
No ratings yet
Midterm 2008s Solution
12 pages
Machine Learning Mid-term Exam Solutions
No ratings yet
Machine Learning Mid-term Exam Solutions
12 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Classification
No ratings yet
Classification
47 pages
Endsem ML Makeup AK - 1
No ratings yet
Endsem ML Makeup AK - 1
7 pages
Ass3 Solns
No ratings yet
Ass3 Solns
13 pages
CS236 Hw1 Answers
No ratings yet
CS236 Hw1 Answers
9 pages
Statistical Machine Learning Overview
No ratings yet
Statistical Machine Learning Overview
35 pages
Linearclassification
No ratings yet
Linearclassification
31 pages
Machine Learning Midterm Exam 2010
No ratings yet
Machine Learning Midterm Exam 2010
16 pages
Assignment 10 Solution
No ratings yet
Assignment 10 Solution
8 pages
Cs221 Section2 Problems
No ratings yet
Cs221 Section2 Problems
5 pages
Homework 1
No ratings yet
Homework 1
8 pages
ML 20230316 1
No ratings yet
ML 20230316 1
9 pages
Generalization Error of The Tilted Empirical Risk
No ratings yet
Generalization Error of The Tilted Empirical Risk
54 pages
2017-18-I MS Key
No ratings yet
2017-18-I MS Key
6 pages
CS 675 Machine Learning Midterm Solutions
No ratings yet
CS 675 Machine Learning Midterm Solutions
10 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Machine Learning Assignment Guide
No ratings yet
Machine Learning Assignment Guide
5 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
hw1 PDF
No ratings yet
hw1 PDF
6 pages
Lecture12 19feb2025
No ratings yet
Lecture12 19feb2025
12 pages
Lecture09 12feb2025
No ratings yet
Lecture09 12feb2025
17 pages
Brownian Motion Notes Ma4f7 2024
No ratings yet
Brownian Motion Notes Ma4f7 2024
63 pages
Intro To Linear Programming
No ratings yet
Intro To Linear Programming
3 pages
Introduction To SVM
No ratings yet
Introduction To SVM
3 pages
Notes For Stats
No ratings yet
Notes For Stats
8 pages
Acceptance Rate
No ratings yet
Acceptance Rate
232 pages
Spotify Logo
No ratings yet
Spotify Logo
1 page
Loss-Given-Default and Macroeconomic Conditions
No ratings yet
Loss-Given-Default and Macroeconomic Conditions
24 pages
Devine Et Al-2014-Child Development
No ratings yet
Devine Et Al-2014-Child Development
18 pages
Car Feature Impact Analysis
No ratings yet
Car Feature Impact Analysis
9 pages
Reading 39 Backtesting and Simulation
No ratings yet
Reading 39 Backtesting and Simulation
5 pages
Forecasting Models - PPT
No ratings yet
Forecasting Models - PPT
57 pages
Predictors of Weight Loss After Bariatric Surgery-A Cross-Disciplinary Approach Combining Physiological, Social, and Psychological Measures
No ratings yet
Predictors of Weight Loss After Bariatric Surgery-A Cross-Disciplinary Approach Combining Physiological, Social, and Psychological Measures
12 pages
The Relationship Between Success Criteria and Suc - 2014 - International Journal
No ratings yet
The Relationship Between Success Criteria and Suc - 2014 - International Journal
12 pages
MBA Quantitative Methods Course Overview
No ratings yet
MBA Quantitative Methods Course Overview
12 pages
JKKN College of Engineering and Technology: Mr.V.Dharani, Ap/Cse 3 0 0 B.E Cse 4 Iii/Vi 45 2023-2024 2021
No ratings yet
JKKN College of Engineering and Technology: Mr.V.Dharani, Ap/Cse 3 0 0 B.E Cse 4 Iii/Vi 45 2023-2024 2021
5 pages
RScale: Adaptive Microservices Scaling with Probabilistic ML
No ratings yet
RScale: Adaptive Microservices Scaling with Probabilistic ML
10 pages
The Influence of Online Education On Academic Achievements of Higher Education Students in The Education Department of Karachi Universities Amidst Covid-19
No ratings yet
The Influence of Online Education On Academic Achievements of Higher Education Students in The Education Department of Karachi Universities Amidst Covid-19
17 pages
Statistics Master's Curriculum
No ratings yet
Statistics Master's Curriculum
16 pages
The Importance of Soft Skills
No ratings yet
The Importance of Soft Skills
11 pages
Effective Decision Making Process Guide
No ratings yet
Effective Decision Making Process Guide
8 pages
EVA vs Traditional Metrics: Indian Study
No ratings yet
EVA vs Traditional Metrics: Indian Study
34 pages
Explainable Machine Learning On New Zealand Strong Motion For PGV
No ratings yet
Explainable Machine Learning On New Zealand Strong Motion For PGV
9 pages
Practice Test - Chap 7-9
No ratings yet
Practice Test - Chap 7-9
12 pages
Ids PDF
No ratings yet
Ids PDF
397 pages
Adoption of Renewable Energy Technologies in Rural Tigray, Ethiopia
No ratings yet
Adoption of Renewable Energy Technologies in Rural Tigray, Ethiopia
6 pages
UP Los Baños BS Statistics Course Overview
No ratings yet
UP Los Baños BS Statistics Course Overview
5 pages
Coordination Failures and Government Policy: Evidence From Emerging Countries
No ratings yet
Coordination Failures and Government Policy: Evidence From Emerging Countries
41 pages
Investigations of Customer Loyalty Strengthening P
No ratings yet
Investigations of Customer Loyalty Strengthening P
16 pages
Ginsburg 2010
No ratings yet
Ginsburg 2010
26 pages
AZiga Exp MathModellingofBeamDeflection
No ratings yet
AZiga Exp MathModellingofBeamDeflection
7 pages
Yee - 2006 - Motivations For Play in Online Games
No ratings yet
Yee - 2006 - Motivations For Play in Online Games
4 pages
Data 101 Terms
No ratings yet
Data 101 Terms
6 pages
7 Article8
No ratings yet
7 Article8
12 pages
International Marketing Research A Transformative Approach V Kumar PDF Download
100% (1)
International Marketing Research A Transformative Approach V Kumar PDF Download
86 pages
Correlational Research
100% (1)
Correlational Research
10 pages
Abundance Quartz Abrasion Predicion
No ratings yet
Abundance Quartz Abrasion Predicion
8 pages

Introduction To Neural Networks

Uploaded by

Introduction To Neural Networks

Uploaded by

MA3K1 Mathematics of Machine Learning April 10, 2021

R(h) = E[L(h(X), Y )],

where (X, Y ) is distributed on X ⇥ Y according to the given distribution. The general-

Solution (3) (a) The generalization risk is the expected value

R(h) = E[1{h(X) 6= Y }] = P(h(X) 6= Y ).

The empirical risk is

|R̂(h) R(h)|  C(n, ),

with probability at least 1 , by assumption. Now if ĥ is the minimizer of R̂(h), that

Solution (4) (a) The regression function for binary classification is

f (x) = E[Y | X = x] = P(Y = 1 | X = x)

The Bayes classifier can be defined as

(b) To compute the Bayes classifier, we use Bayes’ rule:

Solution (5) A hypothesis class H is called PAC-learnable if there exists a classifier

holds whenever n p(1/✏, 1/ , size(X ), size(H)).

xi 1/2 > 0, xi 1/2 < 0, i 2 {1, 2}

Solution (7) Let H be a set of classifiers h : X ! Y. The growth function ⇧H is

where h(x) = (h(x1 ), . . . , h(xn )) if x = (x1 , . . . , xn ). The VC dimension of H is

VC(H) = max{n 2 N : ⇧H (n) = 2n }.

where g(zi ) = L(h(xi ), yi ), and the Rademacher complexity as

You might also like