0% found this document useful (0 votes)

296 views96 pages

Deep Learning From Scratch

The document provides an overview of deep learning and machine learning concepts. It includes definitions of key terms like neurons, neural networks, and perceptrons. Code examples are shown for implementing perceptrons using forward propagation. Linear regression and optimization problems in machine learning are also briefly explained. References and recommended books are listed for readers interested in learning more.

Uploaded by

keyblade00789

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

296 views96 pages

Deep Learning From Scratch

Uploaded by

keyblade00789

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Deep Learning for Everyone

[Link]/newsletter
[Link]
References [Link]

[Link] [Link] [Link]

Neural Networks for Machine Learning Machine Learning

Geoff Hinton Andrew Ng

@bgoncalves [Link]
Requirements

[Link]

@bgoncalves [Link]
@bgoncalves [Link]
Machine Learning

@bgoncalves [Link]
What about Neurons?
Biological Neuron

@bgoncalves [Link]
How the Brain “Works” (Cartoon version)

@bgoncalves [Link]
How the Brain “Works” (Cartoon version)
• Each neuron receives input from other neurons
• 1011 neurons, each with with 104 weights
• Weights can be positive or negative
• Weights adapt during the learning process
• “neurons that fire together wire together” (Hebb)
• Different areas perform different functions using same structure (Modularity)

@bgoncalves [Link]
How the Brain “Works” (Cartoon version)

Inputs f(Inputs) Output

@bgoncalves [Link]
@bgoncalves [Link]
Perceptron
Bias
1
w
x1 0j
w
1j
x2 w2j
zj
w 3j wT x
x3
j
N
w

Inputs Weights Output

@bgoncalves [Link]
Perceptron - Forward Propagation
• The output of a perceptron is determined by a sequence of steps:

• obtain the inputs

• multiply the inputs by the respective weights

1
w
x1 0j
w
1j
x2 w2j
zj
T
w 3j w x
x3
j
N
w

xN
@bgoncalves [Link]
Perceptron - Forward Propagation

def forward(Theta, X, active):

N = [Link][0]

# Add the bias column

X_ = [Link](([Link]((N, 1)), X), 1)

# Multiply by the weights

z = [Link](X_, Theta.T)

return a

@bgoncalves [Link]
Perceptron - Training
• Training Procedure:

• If correct, do nothing

• If output incorrectly outputs 0, add input to weight

vector
1
• if output incorrectly outputs 1, subtract input to
weight vector w
x1 0j

• Guaranteed to converge, if a correct set of weights

w
1j
exists x2 w2j
zj
• Given enough features, perceptrons can learn almost T
w 3j w x
anything x3
j
• Specific Features used limit what is possible to learn w N

xN
@bgoncalves [Link]
1
Linear Boundaries 0

AND OR NOR

@bgoncalves [Link]
1
Linear Boundaries 0
• Perceptrons rely on hyperplanes to separate the data points. Unfortunately, this is not always
possible:
AND OR NOR

Possible Possible Possible

XOR

Impossible
@bgoncalves [Link]
Linear Boundaries

@bgoncalves [Link]
Code - Perceptron / Forward
Propagation
[Link]
Perceptron
1
w
x1 0j
w
1j
x2 w2j
zj
T
w 3j w x
x3
j
N
w

@bgoncalves [Link]
Perceptron
1
w
x1 0j
w
1j
x2 w2j
zj
T
w 3j w x
x3
j
N
w

• This is just a graphical representation of:

z = wT x
which is just linear regression!

@bgoncalves [Link]
Linear Regression
⃗
x ) x1
13 f( w1
y ≈ x 0+
Each point is w0 Add x0 ≡ 1
xi ⃗ = (x0, x1, ⋯, xn)
T
represented = to account
y
by a vector for intercept
9.75

6.5
y

3.25

0
0 5 10 15 20
x1
@bgoncalves [Link]
Optimization Problem
• (Machine) Learning can be thought of as an optimization problem.

• Optimization Problems have 3 distinct pieces:

• The constraints

• The function to optimize

• The optimization algorithm.

@bgoncalves [Link]
Optimization Problem
• (Machine) Learning can be thought of as an optimization problem.

• Optimization Problems have 3 distinct pieces:

• The constraints Problem Representation

• The function to optimize Prediction Error

• The optimization algorithm. Gradient Descent

@bgoncalves [Link]
Linear Regression
• We are assuming that our functional dependence is of the form:

f ( x )⃗ = w0 + w1x1 + ⋯ + wn xn ≡ X w ⃗

• In other words, at each step, our hypothesis is:

hw (X) = X w ⃗ ≡ ŷ

Feature N
Feature 1
Feature 2
Feature 3
…

value
and it imposes a Constraint on the solutions that can be found.
Sample 1
Sample 2
• We quantify our far our hypothesis is from the correct value using Sample 3
Sample 4
an Error Function: Sample 5

1 Sample 6

[ ]
2
Jw (X, y )⃗ = (
(i)
)
(i) X y
2m ∑
h w x − y .

i
or, vectorially:
1
Jw (X, y )⃗ = [X w ⃗ − y ]⃗
2
2m Sample M

@bgoncalves [Link]
Linear Regression

@bgoncalves [Link]
Geometric Interpretation
13
1
Jw (X, y )⃗ = [ ⃗ ⃗
]
2
X w − y
2m

9.75

6.5

Quadratic error
means that an error
3.25 twice as large is
penalized four times
as much.

0
0 5 10 15 20

@bgoncalves [Link]
Gradient Descent
• Goal: Find the minimum of Jw (X, y )⃗ by varying the components of w ⃗

• Intuition: Follow the slope of the error function until convergence

δ
− Jw (X, y )⃗
δ w⃗
Jw (X, y )⃗

δ
− Jw (X, y )⃗
δ w⃗

Jw (X, y ⃗)
• Algorithm:

⃗ (initial values of the parameters)

• Guess w (0)
step size
• Update until “convergence”:
δ
Jw (X, y )⃗
δ 1 T
wj = wj − α Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗
δwj δwj m
@bgoncalves [Link]
`
Geometric Interpretation

2D 3D nD
13

9.75

6.5

3.25

0
0 5 10 15 20

y = w0 + w1x1 y = w0 + w1x1 + w2 x2 y = X w⃗

Finds the hyperplane that Add

to account x0 ≡ 1
splits the points in two for intercept
such that the errors on
each side balance out

@bgoncalves [Link]
Code - Linear Regression
[Link]
Learning Procedure

Constraint

input predicted observed

hypothesis
output output
XT hw (X) ŷ ⃗ Jw (X, y )⃗

Learning Error
Algorithm Function

@bgoncalves [Link]
Learning Procedure

Constraint

input predicted observed

hypothesis
output output
XT hw (X) = ϕ (X T w) ŷ ⃗ Jw (X, y )⃗

Which we
can redefine

Learning Error
Algorithm Function

@bgoncalves [Link]
Learning Procedure

Constraint

T predicted observed
input z=X w hypothesis
output output
XT ϕ (z) ŷ ⃗ Jw (X, y )⃗
And
rewrite

Learning Error
Algorithm Function

@bgoncalves [Link]
Logistic Regression (Classification)
• Not actually regression, but rather Classification

• Predict the probability of instance belonging to the given class:

hw (X) ∈ [0,1]
1 - part of the class
• Use sigmoid/logistic function to map weighted inputs to[0,1]
0 - otherwise
hw (X) = ϕ (X w )⃗

z encapsulates all
the parameters and
input values

@bgoncalves
maximize the value
of z for members of
Geometric Interpretation the class

1
ϕ (z) ≥
2

@bgoncalves [Link]
Logistic Regression
• Error Function - Cross Entropy
1 T
Jw (X, y )⃗ = − [y log (hw (X)) + (1 − y) log (1 − hw (X))]
T
m
measures the “distance” between two probability distributions
1
hw (X) =
1 + e −X w ⃗
• Effectively treating the labels as probabilities (an instance with label=1 has Probability 1 of
belonging to the class).

• Gradient - same as Logistic Regression

δ
wj = wj − α Jw (X, y )⃗
δwj
δ 1 T
Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗
δwj m

@bgoncalves [Link]
Iris dataset

@bgoncalves [Link]
Code - Logistic Regression
[Link]
Logistic Regression

@bgoncalves [Link]
Logistic Regression

@bgoncalves [Link]
Learning Procedure

Constraint

T predicted observed
input z=X w hypothesis
output output
XT ϕ (z) ŷ ⃗ Jw (X, y )⃗

Learning Error
Algorithm Function

@bgoncalves [Link]
Comparison
• Linear Regression • Logistic Regression

z = X w⃗ z = X w⃗
Map features to a
continuous variable

hw (X) = ϕ (Z ) hw (X) = ϕ (Z ) Compare

prediction with
reality
1 Predict based on
ϕ (Z ) = Z ϕ (Z ) =
1 + e −Z continuous variable

1 1 T
⃗
Jw (X, y ) = ⃗
[hw (X) − y ] Jw (X, y ) = − [y log (hw (X)) + (1 − y) log (1 − hw (X))]
⃗
2 T
2m m

δ 1 T δ 1 T
Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗ Jw (X, y )⃗ = X ⋅ (hw (X ) − y )⃗
δwj m δwj m

@bgoncalves [Link]
Learning Procedure
Bias
1
w
x1 0j
w
1j
x2 w2j
zj
w 3j wT x (z)
x3
j
N
w

xN
Activation
Inputs Weights function

@bgoncalves [Link]
Generalized Perceptron
Bias
1
By changing the
w activation function,
x1 0j we change the
w underlying algorithm
1j
x2 w2j
zj
w 3j wT x (z)
x3
j
N
w

xN
Activation
Inputs Weights function

@bgoncalves [Link]
Activation Function
• Non-Linear function

• Differentiable

• non-decreasing

• Compute new sets of features

• Each layer builds up a more abstract representation of the data

@bgoncalves [Link]
Activation Function - Linear
• Non-Linear function

• Differentiable

• non-decreasing

ϕ (z) = z
• Compute new sets of features

• Each layer builds up a more abstract

representation of the data

• The simplest

@bgoncalves [Link]
Activation Function - Linear
• Non-Linear function

• Differentiable Linear Regression

• non-decreasing

ϕ (z) = z
• Compute new sets of features

• Each layer builds up a more abstract

representation of the data

• The simplest

@bgoncalves [Link]
Activation Function - Sigmoid
• Non-Linear function

• Differentiable

• non-decreasing

1
• Compute new sets of features ϕ (z) =
1 + e −z
• Each layer builds up a more abstract
representation of the data

• Perhaps the most common

@bgoncalves [Link]
Activation Function - Sigmoid
• Non-Linear function

• Differentiable Logistic Regression

• non-decreasing

1
• Compute new sets of features ϕ (z) =
1 + e −z
• Each layer builds up a more abstract
representation of the data

• Perhaps the most common

@bgoncalves [Link]
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:

• obtain the inputs

• multiply the inputs by the respective weights

• calculate output using the activation function

@bgoncalves [Link]
Code - Forward Propagation
[Link]
Activation Function - ReLu
• Non-Linear function

• Differentiable

• non-decreasing

ϕ (z) = z, z > 0
• Compute new sets of features

• Each layer builds up a more abstract

representation of the data

• Results in faster learning than with

sigmoid

@bgoncalves [Link]
Activation Function - ReLu
• Non-Linear function

• Differentiable Stepwise Regression

• non-decreasing

ϕ (z) = z, z > 0
• Compute new sets of features

• Each layer builds up a more abstract

representation of the data

• Results in faster learning than with

sigmoid

@bgoncalves [Link]
Stepwise Regression [Link]

• Multivariate Adaptive Regression Spline (MARS) is the best known example

• Fit curves using a linear combination of:

f ̂ (x) =
∑
ci Bi (x)
i
• The basis functions can be:
Bi (x)

• Constant

• “Hinge” functions of the form: and

max(0, x − b) max(0, b − x)

• Products of hinges

@bgoncalves [Link]
Stepwise Regression [Link]

• Multivariate Adaptive Regression Spline (MARS) is the best known example

• Fit curves using a linear combination of: f ̂ (x) = ∑ ci Bi (x)

y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
+ 1.591 max (0, x 0.907)

@bgoncalves [Link]
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• obtain the inputs
• multiply the inputs by the respective weights
• calculate output using the activation function
• To create a multi-layer perceptron, you can simply use the output of one layer as the input to
the next one.
1
1
w
w a1 0k
x1 0j w
1k
w
1j a2 w2k
x2 w2j ak
w 3k wT a
w 3j wT x aj
x3
k
j wN
wN
aN
xN

• But how can we propagate back the errors and update the weights?
@bgoncalves [Link]
Stepwise Regression [Link]

f ̂ (x) =
∑
ci Bi (x)
i

1
1

x ReLu
1

x ReLu Linear

x ReLu

@bgoncalves [Link]
Loss Functions
• For learning to occur, we must quantify how far off we are from the desired output. There are
two common ways of doing this:

• Quadratic error function:

1 X 2
E= |yn an |
N n
• Cross Entropy
1 Xh T T
i
J= yn log an + (1 yn ) log (1 an )
N n

The Cross Entropy is complementary to sigmoid

activation in the output layer and improves its stability.

@bgoncalves [Link]
Regularization
• Helps keep weights relatively small by adding a penalization to the cost function.

• Two common choices:

Jŵ (X) = Jw (X) + λ

∑
wij “Lasso”
ij

Jŵ (X) = Jw (X) + λ wij2

∑ L2
ij

• Lasso helps with feature selection by driving less important weights to zero

@bgoncalves [Link]
Backward Propagation of Errors (BackProp)
• BackProp operates in two phases:

• Forward propagate the inputs and calculate the deltas

• Update the weights

• The error at the output is a weighted average difference between predicted output and the
observed one.

• For inner layers there is no “real output”!

@bgoncalves [Link]
BackProp
• Let δ (l) be the error at each of the total L layers:

• Then:
δ (L) = hw (X) − y

• And for every other layer, in reverse order:

δ (l) = W (l)T δ (l+1). * ϕ † (z (l))

• Until:
δ (1) ≡ 0
as there’s no error on the input layer.

• And finally:

Δ(l)
ij
= Δ (l)
ij
+ a (l) (l+1)
j i
δ
∂ 1 (l)
⃗
(l) w ( )
(l)
J X, y = Δ + λw
∂wij m ij ij

@bgoncalves [Link]
A practical example - MNIST
8

Feature M
Feature 1
Feature 2
Feature 3
…

Label
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.

Sample N

[Link]/exdb/mnist/
@bgoncalves
A practical example - MNIST

Feature N
Feature 1
Feature 2
Feature 3
…

Label
Sample 1
Sample 2
Sample 3
3 layers: Sample 4
1 input layer

arg max
Sample 5
Sample 6
1 hidden layer X ⇥1 ⇥2 .
1 output layer X y

Sample M

[Link]/exdb/mnist/
@bgoncalves
A practical example - MNIST
def forward(Theta, X, active):
N = [Link][0]
5000 examples

# Add the bias column

X_ = [Link](([Link]((N, 1)), X), 1)

arg max
# Multiply by the weights
X ⇥1 ⇥2 z = [Link](X_, Theta.T)

# Apply the activation function

a = active(z)

return a

Vectors 784 50 10 1
Matrices 50 ⇥ 785 10 ⇥ 51 Forward Propagation
Matrices 50 ⇥ 784 10 ⇥ 50 Backward Propagation

def predict(Theta1, Theta2, X):

h1 = forward(Theta1, X, sigmoid)
h2 = forward(Theta2, h1, sigmoid)

return [Link](h2, 1)

@bgoncalves [Link]
Code - Simple Network
[Link]
Practical Considerations
• So far we have looked at very idealized cases. Reality is never this simple!

• In practice, many details have to be considered:

• Data normalization

• Overfitting

• Hyperparameters

• Bias, Variance tradeoffs

• etc…

@bgoncalves [Link]
Data Normalization
• The range of raw data values can vary widely.
• Using feature with very different ranges in the same analysis can cause numerical problems.
Many algorithms are linear or use euclidean distances that are heavily influenced by the
numerical values used (cm vs km, for example)
• To avoid difficulties, it’s common to rescale the range of all features in such a way that each
feature follows within the same range.
• Several possibilities:
x xmin
• Rescaling - x̂ =
xmax xmin
x µx
• Standardization - x̂ =
x
x
• Normalization - x̂ =
||x||

• In the rest of the discussion we will assume that the data has been normalized in some

@bgoncalves [Link]
Supervised Learning - Overfitting

Feature M
Feature 1
Feature 2
Feature 3
…

value
Sample 1
• “Learning the noise” Sample 2
Sample 3

Training
• “Memorization” instead of “generalization” Sample 4
Sample 5
Sample 6
• How can we prevent it? .

• Split dataset into two subsets: Training and Testing

Testing
• Train model using only the Training dataset and evaluate
results in the previously unseen Testing dataset. Sample N

• Different heuristics on how to split:

• Single split

• k-fold cross validation: split dataset in k parts, train in

k-1 and evaluate in 1, repeat k times and average
results.

@bgoncalves [Link]
Bias-Variance Tradeoff

@bgoncalves [Link]
Bias-Variance Tradeoff
High Bias Low Bias
Low Variance High Variance

Training
Error

Testing

Variance
Bias

Model Complexity

@bgoncalves [Link]
Learning Rate
δ
wij = wij − α Jw (X, y )⃗
δwij
δ
↵ defines size of step in direction of Jw (X, y )⃗
δwij

Very High Learning Rate

Loss

High Learning Rate

Low Learning Rate
Best Learning Rate

Epoch

@bgoncalves [Link]
Tips
• online learning - update weights after each case
- might be useful to update model as new data is obtained
- subject to fluctuations

• mini-batch - update weights after a “small” number of cases

- batches should be balanced
- if dataset is redundant, the gradient estimated using only a fraction of the data
is a good approximation to the full gradient.

• momentum - let gradient change the velocity of weight change instead of the value directly

• rmsprop - divide learning rate for each weight by a running average of “recent” gradients

• learning rate - vary over the course of the training procedure and use different learning rates
for each weight

@bgoncalves [Link]
Generalization
• Neural Networks are extremely modular in their design with

• Fortunately, we can write code that is also modular and can class Activation(object):
def f(z):
easily handle arbitrary numbers of layers pass

def df(z):
pass
• Let’s describe the structure of our network as a list of weight
matrices and activation functions class Linear(Activation):
def f(z):
return z

• We also need to keep track of the gradients of the activation def df(z):
return [Link]([Link])
functions so let us define a simple class:
class Sigmoid(Activation):
def f(z):
return 1./(1+[Link](-z))

def df(z):
h = Sigmoid.f(z)
return h*(1-h)

@bgoncalves [Link]
Generalization
• Now we can describe our simple MNIST model with:

Thetas = []
[Link](init_weights(input_layer_size, hidden_layer_size))
[Link](init_weights(hidden_layer_size, num_labels))

model = []

[Link](Thetas[0])
[Link](Sigmoid)
[Link](Thetas[1])
[Link](Sigmoid)

• Where Sigmoid is an object that contains both the sigmoid function and its gradient as was
defined in the previous slide.

@bgoncalves [Link]
Generalization - Forward propagation

def forward(Theta, X, active):

N = [Link][0]

# Add the bias column

X_ = [Link](([Link]((N, 1)), X), 1)

# Multiply by the weights

z = [Link](X_, Theta.T)

# Apply the activation function

a = active.f(z)

return a

def predict(model, X):

h = [Link]()

for i in range(0, len(model), 2):

theta = model[i]
activation = model[i+1]

h = forward(theta, h, activation)

return [Link](h, 1)

@bgoncalves [Link]
def backprop(model, X, y):
M = [Link][0]

Thetas = model[0::2]
activations = model[1::2]
layers = len(Thetas)

K = Thetas[-1].shape[0]
J = 0
Deltas = []

for i in range(layers):
[Link]([Link](Thetas[i].shape))

deltas = [0, 0, 0, 0]

for i in range(M):
As = []
Zs = [0]
Hs = [X[i]]
# Forward propagation, saving intermediate results
[Link]([Link](([1], Hs[0]))) # Input layer

for l in range(1, layers+1):

[Link]([Link](Thetas[l-1], As[l-1]))
[Link](activations[l-1].f(Zs[l]))
[Link]([Link](([1], Hs[l])))

y0 = one_hot(K, y[i])

# Cross entropy
J -= [Link](y0.T, [Link](Hs[2]))+[Link]((1-y0).T, [Link](1-Hs[2]))

# Calculate the weight deltas

deltas[layers] = Hs[layers]-y0

for l in range(layers-1, 1, -1):

deltas[l] = [Link](Thetas[l-1].T, deltas[l+1])[1:]*activations[l-1].df(Zs[l-1])
Deltas[l] += [Link](deltas[l+1], As[l])

J /= M
grads = []
[Link](Deltas[0]/M)
[Link](Deltas[1]/M)

@bgoncalves return [J, grads]

Code - Modular Network
[Link]
Neural Network Architectures

@bgoncalves [Link]
word2vec Mikolov 2013

Skipgram Continuous Bag of Words

max p (C|w) max p (w|C)

1
wj

wj
⇥2 ⇥1 word embeddings ⇥2

⇥2 context embeddings
wj

wj
⇥1 ⇥1
wj one hot vector
⇥2 ⇥2
activation function
wj+1

wj+1

Word Context Context Word

@bgoncalves [Link]
Visualization

[Link]
“You shall know a word by the company it keeps”
(J. R. Firth)
Analogies
• The embedding of each word is a function of the context it appears in:
(red) = f (context (red))
• words that appear in similar contexts will have similar embeddings:

context (red) ⇡ context (blue) =) (red) ⇡ (blue)

• “Distributional hypotesis” in linguistics

France
Italy Portugal Country context
Geometrical relations Paris
USA

between contexts imply Capital context Rome Lisbon

Washington DC
semantic relations
between words!

(F rance) (P aris) + (Rome) = (Italy)

~b ~a + ~c = d~
@bgoncalves [Link]
Analogies [Link]
What is the word d that is most similar to
~b ~a + ~c = d~
b and c and most dissimilar to a?
⇣ ⌘T
~b ~a + ~c
d† = argmax ~x
x ~b ~a + ~c
⇣ ⌘
d† ⇠ argmax ~bT ~x ~aT ~x + ~cT ~x
x

@bgoncalves [Link]
Feed Forward Networks

ht Output

xt Input

ht = f (xt)

@bgoncalves [Link]
Feed Forward Networks

ht Output

Information
Flow

xt Input

ht = f (xt)

@bgoncalves [Link]
Information
Recurrent Neural Network (RNN) Flow

ht Output

Previous ht−1
Output
xt Input

ht = f (xt, ht−1)

@bgoncalves [Link]
Recurrent Neural Network (RNN)
• Each output depends (implicitly) on all previous outputs.

• Input sequences generate output sequences (seq2seq)

ht−1 ht ht+1
ht−2 ht−1 ht ht+1

xt−1 xt xt+1

@bgoncalves [Link]
Long-Short Term Memory (LSTM)
• What if we want to keep explicit information about previous states (memory)?

• How much information is kept, can be controlled through gates.

• LSTMs were first introduced in 1997 by Hochreiter and Schmidhuber

ht−1 ht ht+1
ct−2 ct−1 ct ct+1
ht−2 ht−1 ht ht+1

xt−1 xt xt+1

@bgoncalves [Link]
Convolutional Neural Networks

@bgoncalves [Link]
@bgoncalves
Curve Fitting?

@bgoncalves [Link]
Interpretability?

@bgoncalves [Link]
“Deep” learning

@bgoncalves [Link]
Events
[Link]/newsletter
Time Series for Everyone
Dec 4, 2020 - 5am-9am (PST)

Advanced Time Series for Everyone

Dec 11, 2020 - 5am-9am (PST)

[Link]

Natural Language Processing (NLP) from Scratch

[Link] - On Demand

@bgoncalves
@bgoncalves [Link]
[Link]

Deep Learning From Scratch
No ratings yet
Deep Learning From Scratch
96 pages
Deep Learning for Data Experts
No ratings yet
Deep Learning for Data Experts
95 pages
2021 10 11 - Intro ML - Inserm
No ratings yet
2021 10 11 - Intro ML - Inserm
41 pages
NN Theory
No ratings yet
NN Theory
138 pages
Machine Learning (CSEN3203) 1-14
No ratings yet
Machine Learning (CSEN3203) 1-14
15 pages
GML Slides 2024 04 29
No ratings yet
GML Slides 2024 04 29
206 pages
05 Optimization Basics
No ratings yet
05 Optimization Basics
94 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Lecture 1
No ratings yet
Lecture 1
56 pages
Week3 LearningI
No ratings yet
Week3 LearningI
48 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
No ratings yet
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
59 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
02A DL2023 NN Basics
No ratings yet
02A DL2023 NN Basics
52 pages
6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
Predicting Student Pass Rates
No ratings yet
Predicting Student Pass Rates
17 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Machine Learning Algorithms Explained
No ratings yet
Machine Learning Algorithms Explained
46 pages
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
No ratings yet
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
54 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
ML Imp QB
No ratings yet
ML Imp QB
34 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Machine Learning Course Intro
No ratings yet
Machine Learning Course Intro
29 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
92 pages
Intro to Machine Learning Basics
No ratings yet
Intro to Machine Learning Basics
61 pages
S02 DNN Perceptron Wip
No ratings yet
S02 DNN Perceptron Wip
24 pages
Machine Learning Week 4
No ratings yet
Machine Learning Week 4
24 pages
Classification
No ratings yet
Classification
47 pages
Deep Learning Overview and Concepts
No ratings yet
Deep Learning Overview and Concepts
42 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
lec22-ML III
No ratings yet
lec22-ML III
51 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
1) Deep - Learning
No ratings yet
1) Deep - Learning
60 pages
Deep Feedforward Neural Networks Guide
No ratings yet
Deep Feedforward Neural Networks Guide
97 pages
Intro To Neural Networks Explained For Beginners: Sajjad Mustafa
No ratings yet
Intro To Neural Networks Explained For Beginners: Sajjad Mustafa
110 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
78 pages
Lec1 PerceptronPocket Recap
100% (1)
Lec1 PerceptronPocket Recap
61 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
Deep Learning Tutorial for Business
No ratings yet
Deep Learning Tutorial for Business
58 pages
AI Course Overview for Students
No ratings yet
AI Course Overview for Students
86 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
Deep Neural Networks - 2
No ratings yet
Deep Neural Networks - 2
55 pages
Gradient-Based Learning & Neural Networks
No ratings yet
Gradient-Based Learning & Neural Networks
72 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
XOR Problem & Two-Layer Perceptron
No ratings yet
XOR Problem & Two-Layer Perceptron
74 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Machine Learning Concepts Explained
No ratings yet
Machine Learning Concepts Explained
17 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Rattle: Easy Data Mining with R
No ratings yet
Rattle: Easy Data Mining with R
1 page
Lab2.ipynb - Colaboratory
No ratings yet
Lab2.ipynb - Colaboratory
2 pages
Language Acquisition in Children
No ratings yet
Language Acquisition in Children
9 pages
Listening For Gist (The Overall Idea) : A. Understanding This Listening Sub-Skill
100% (2)
Listening For Gist (The Overall Idea) : A. Understanding This Listening Sub-Skill
2 pages
Robot Perception
100% (1)
Robot Perception
19 pages
Assignment - 10: Shyam Shankar H R EE15B127 November 9, 2017
No ratings yet
Assignment - 10: Shyam Shankar H R EE15B127 November 9, 2017
15 pages
Comparative Study of Classification Algorithms Based On Mapreduce Model
No ratings yet
Comparative Study of Classification Algorithms Based On Mapreduce Model
4 pages
Decision Tree Algorithm Guide
No ratings yet
Decision Tree Algorithm Guide
4 pages
Lec05 Classification DecisionTree
No ratings yet
Lec05 Classification DecisionTree
67 pages
4.2. Finite State Machines (FSMS)
No ratings yet
4.2. Finite State Machines (FSMS)
4 pages
AI Course for Engineering Students
No ratings yet
AI Course for Engineering Students
3 pages
Routh-Hurwitz Stability Analysis Guide
No ratings yet
Routh-Hurwitz Stability Analysis Guide
49 pages
Design Paper GID 38
No ratings yet
Design Paper GID 38
13 pages
Markov Chain Analysis of Google PageRank
No ratings yet
Markov Chain Analysis of Google PageRank
15 pages
Competitive Learning Neural Networks
100% (1)
Competitive Learning Neural Networks
10 pages
Feature Extraction Based Wavelet Transform in Breast Cancer Diagnosis - A Survey
100% (1)
Feature Extraction Based Wavelet Transform in Breast Cancer Diagnosis - A Survey
4 pages
Materi Second Language Acquisition
100% (3)
Materi Second Language Acquisition
16 pages
PLC Overview and IEC 61131 Standards
100% (1)
PLC Overview and IEC 61131 Standards
64 pages
Recurrent Neural Network Wiki
100% (1)
Recurrent Neural Network Wiki
7 pages
Review of Image Segmentation Techniques
No ratings yet
Review of Image Segmentation Techniques
6 pages
Data Visualization Cheatsheet 1702209209
100% (1)
Data Visualization Cheatsheet 1702209209
7 pages
S06 - Joos Et Al - 2013
No ratings yet
S06 - Joos Et Al - 2013
11 pages
Portfolio Optimization With Return Prediction Using Deep Learning and Machine Learning
No ratings yet
Portfolio Optimization With Return Prediction Using Deep Learning and Machine Learning
15 pages
Machine Theory of Mind: ToMnet AI
No ratings yet
Machine Theory of Mind: ToMnet AI
21 pages
Control System MCQ
No ratings yet
Control System MCQ
9 pages
Interaction Styles & Devices Guide
No ratings yet
Interaction Styles & Devices Guide
18 pages
AI-900 Notes: Describe Artificial Intelligence Workloads and Considerations
100% (1)
AI-900 Notes: Describe Artificial Intelligence Workloads and Considerations
16 pages
Logic Tensor Networks
No ratings yet
Logic Tensor Networks
61 pages
Nonverbal Communication and Discourse Analysis
No ratings yet
Nonverbal Communication and Discourse Analysis
14 pages
Write A Program For Image Filtering: 1 Above Average (03) Average (02) Below Average
No ratings yet
Write A Program For Image Filtering: 1 Above Average (03) Average (02) Below Average
6 pages