CSCI218: Foundations
of Artificial Intelligence
Classical stats/ML: Minimize loss function
§ Which hypothesis space H to choose?
§ E.g., linear combinations of features: hw(x) = wTx
§ How to measure degree of fit?
§ Loss function, e.g., squared error Σj (yj – wTx)2
§ How to trade off degree of fit vs. complexity?
§ Regularization: complexity penalty, e.g., ||w||2
§ How do we find a good h?
§ Optimization (closed-form, numerical); discrete search
§ How do we know if a good h will predict well?
§ Try it and see (cross-validation, bootstrap, etc.)
2
Deep Learning/Neural Network
Image Classification
Very loose inspiration: Human neurons
Axonal arborization
Axon from another cell
Synapse
Dendrite Axon
Nucleus
Synapses
Cell body or Soma
Simple model of a neuron (McCulloch & Pitts, 1943)
Bias Weight
a0 = 1 aj = g(inj)
w0,j
g
wi,j inj
ai
Σ aj
Input Input Activation Output
Links Function Function Output Links
§ Inputs ai come from the output of node i to this node j (or from “outside”)
§ Each input link has a weight wi,j
§ There is an additional fixed input a0 with bias weight w0,j
§ The total input is inj = Si wi,j ai
§ The output is aj = g(inj) = g(Si wi,j ai) = g(w.a)
Activation functions g
g(ini) g(ini)
+1 +1
ini ini
(a)
Threshold (b)1/(1+e-x)
Sigmoid
Reminder: Linear Classifiers
▪ Inputs are feature values
▪ Each feature has a weight
▪ Sum is the activation
▪ If the activation is: f1
w1
▪ Positive, output +1 w2
▪ Negative, output -1
f2
w3 Σ >0?
f3
How to get probabilistic decisions?
If very positive, want probability going to 1
If very negative, want probability going to 0
Sigmoid function
Best w?
Maximum likelihood estimation:
with:
= Logistic Regression
Multiclass Logistic Regression
Multi-class linear classification
A weight vector for each class:
Score (activation) of a class y:
Prediction w/highest score wins:
How to make the scores into probabilities?
original activations softmax activations
Best w?
Maximum likelihood estimation:
with:
= Multi-Class Logistic Regression
Optimization
Optimization
i.e., how do we solve:
Hill Climbing
A simple, general idea
Start wherever
Repeat: move to the best neighboring state
If no neighbors better than current, quit
What’s particularly tricky when hill-climbing for multiclass
logistic regression?
• Optimization over a continuous space
• Infinitely many neighbors!
• How to do this efficiently?
1-D Optimization
Could evaluate and
Then step in best direction
Or, evaluate derivative:
Tells which direction to step into
2-D Optimization
Source: offconvex.org
Gradient Ascent
Perform update in uphill direction for each coordinate
The steeper the slope (i.e. the higher the derivative) the bigger the step
for that coordinate
E.g., consider:
Updates: ▪ Updates in vector notation:
with: = gradient
Steepest Descent
o Idea:
o Start somewhere
o Repeat: Take a step in the steepest descent direction
Figure source: Mathworks
Steepest Direction
o Steepest Direction = direction of the gradient
2 @g
3
@w1
6 @g7
6 @w2
7
rg = 6 7
4 ··· 5
@g
@wn
Optimization Procedure: Gradient Ascent
init
for iter = 1, 2, …
▪ : learning rate --- hyperparameter that needs to be chosen
carefully
Batch Gradient Ascent on the Log Likelihood Objective
init
for iter = 1, 2, …
Stochastic Gradient Ascent on the Log Likelihood Objective
Observation: once gradient on one training example has been
computed, might as well incorporate before computing next one
init
for iter = 1, 2, …
pick random j
Mini-Batch Gradient Ascent on the Log Likelihood Objective
Observation: gradient over small set of training examples (=mini-batch)
can be computed in parallel, might as well do that instead of a single one
init
for iter = 1, 2, …
pick random subset of training examples J
Neural Networks
Multi-class Logistic Regression
= special case of neural network (single layer, no hidden layer)
f1(x)
z1 s
f2(x) o
f
z2 t
f3(x)
m
a
x
… z3
fK(x)
Multi-layer Perceptron
x1
s
x2 o
f
… t
x3 m
a
… … … … x
…
xL
g = nonlinear activation function
Multi-layer Perceptron
Common Activation Functions
[source: MIT 6.S191 introtodeeplearning.com]
Multi-layer Perceptron
Training the MLP neural network is just like logistic regression:
just w tends to be a larger vector
just run gradient ascent è Back-propagation algorithm
Neural Networks Properties
Theorem (Universal Function Approximators). A two-layer
neural network with a sufficient number of neurons can
approximate any continuous function to any desired accuracy.
Practical considerations
Can deal with more complex, nonlinear classification & regression
Large number of neurons and weights
Danger for overfitting
Deep Learning Model
Neural network as
General computation graph
Krizhevsky, Suskever, Hinton, 2012
Deep Learning Model
Deep Learning Model
§ We need good features!
Feature Extraction Classification “Panda”?
Prior Knowledge,
Experience
Pose Occlusion Multiple Inter-class
objects similarity
Image courtesy of M. Ranzato
Deep Learning Model
§ Directly learn features representations from data.
§ Joint learn feature representation and classifier.
More abstract representation
Low-level Mid-level High-level
Features Features Features
Classifier “Panda”?
Deep Learning: train layers of features so that classifier works well.
Image courtesy of M. Ranzato
Deep Learning Model
Have we been here before?
ØYes.
• Basic ideas common to past neural networks research
• Standard machine learning strategies still relevant.
ØNo.
Today’s Deep Learning
Computational
Large-scale Data New Algorithms
Power
Deep Learning Model
Convolutional Neural Networks (CNNs)
§ A special multi-stage architecture inspired by visual system
§ Higher stages compute more global, more invariant features
Deep Learning Model
https://www.datasciencecentral.com/lenet-5-a-classic-cnn-architecture/
Different Neural Network Architectures
§ Exploration of different neural network architectures
§ ResNet: residual networks
§ Networks with attention
§ Transformer networks
§ Neural network architecture search
§ Really large models
§ GPT2, GPT3
§ CLIP
37
Acknowledgement
The lecture slides are based on the materials from ai.Berkey.edu
Thank you. Questions?