AI/ML Basics for Tech Enthusiasts
AI/ML Basics for Tech Enthusiasts
Mark Crowley
Assistant Professor
Electrical and Computer Engineering
University of Waterloo
[email protected]
Outline
Introduction
What is AI?
Neural Networks
My Background
Vision
Nat Lang Robotics
Processing
LSTM
Game
RNN A3C
CNN DQN Theory
Multi-
Deep Reinforcement agent
Probabilistic Learning Learning systems
Programming
ILP
Big Data Machine Artificial
Tools Learning Intelligence Constraint
Programming
SAT
SMP
features
Major Types/Areas of AI
Major Types/Areas of AI
Major Types/Areas of AI
Deep Learning
Deep Learning: methods which perform machine learning through the use
of multilayer neural networks of some kind. Deep Learning can be applied
in any of the three main types of ML:
Supervised Learning : very common, enourmous improvement in
recent years
Unsupervised Learning : just beginning, lots of potential
Reinforement Learning : recent (past 3 years) this has exploded,
exspecially for video games
Clustering Classification
Unsupervised Supervised
Uses unlabeled data Uses labeled data
Organize patterns w.r.t. an Requires training phase
optimization criteria Domain sensitive
Requires a definition of similarity Easy to evaluate (you know the
Hard to evaluate correct answer)
Examples: K-means, Fuzzy Examples: Naive Bayes, KNN,
C-means, Hierarchical SVM, Decision Trees, Random
Clustering, DBScan Forests
So choose carefully...
See http://scikit-learn.org/stable/auto_examples/
classification/plot_classifier_comparison.html
Outline
Introduction
What is AI?
Neural Networks
Building Upon Classic Machine Learning
History Of Neural Networks
Improving Performance
X 1
o(x) = σ(w T xi ) = σ(w0 + wi xi ) = P
1 + exp(−(w0 + i wi xi ))
i
N
!
X X
NLL(w ) = log 1 + exp(−(w0 + wi xi ))
i=1 i
∂ X
g= = (σ(w T xi ) − yi )xi
∂w
i
θk+1 = θk − ηk gk
Nonlinear Unit
Sigmoid/ReLU/ELU Output layer, Y
Hidden layer, H
Input layer, X
Input Layer
vector data, each input collects one feature/dimension of the data
and passes it on to the (first) hidden layer.
Hidden Layer
Each hidden unit computes a weighted sum of all the units from the
input layer (or any previous layer) and passes it through a nonlinear
activation function.
Output Layer
Each output unit computes a weighted sum of all the hidden units
and passes it through a (possibly nonlinear) threshold function.
yj = f (netj )
Activation Functions
g(z) = max{0, z}
0 (Goodfellow 2016)
Figure 6.3: The rectified linear activation function. This activation function is the default
activation function recommended for use with most feedforward neural networks. Applying
Rectified Linear Units (ReLU) have become standard max(0, netj )
this function to the output of a linear transformation yields a nonlinear transformation.
However, the function remains very close to linear, in the sense that is a piecewise linear
strong signals are alwasy easy to distinguish
function with two linear pieces. Because rectified linear units are nearly linear, they
mostmany
preserve valuesof theare zero,that
properties deritive is mostly
make linear zero
models easy to optimize with gradient-
based methods. They also preserve many of the properties that make linear models
they do not saturate as easily as sigmoid
generalize well. A common principle throughout computer science is that we can build
complicated systems from minimal components. Much as a Turing machine’s memory
new Exponential linear units - evidence that they perform better than
needs only to be able to store 0 or 1 states, we can build a universal function approximator
from rectified linear functions.
ReLU in some situations.
Mark Crowley A to Z of AI/ML Sep 23, 2017 46 / 112
Neural Networks History Of Neural Networks
Gradient Descent
E – Error function
MSE, cross-entro
loss
Error Function: Mean Squared Error, cross-entropy loss, etc.
d
Gradient Descent
E – Error function
MSE, cross-entro
loss
Error Function: Mean
h Squared Error, cross-entropy
i loss, etc.
∂E ∂E ∂E
Gradient: 5E [w] = ∂w0 , ∂w1 , . . . , ∂wd d
∂E
Training Update Rule: ∆wi = −η ∂w i
where η is the training rate.
Note: For regression, others, this gradient is convex. In ANNs it is not. So
we must solve iteratively For Neural Networks
E[w] no longer conve
(Slides from Tom Mitchell ML Course, CMU, 2010)
Mark Crowley A to Z of AI/ML Sep 23, 2017 53 / 112
Neural Networks History Of Neural Networks
Backpropagation Algorithm
We need an iterative algorithm for getting the gradient efficiently.
For each training example:
1 Forward propagation: Input the training example to the network and
compute outputs
2 Compute output units errors:
A Short History
40’s Early work in NN goes back to the 40s with a simple model
of the neuron by McCulloh and Pitt as a summing and
thresholding devices.
1958 Rosenblatt in 1958 introduced the Perceptron,a two layer
network (one input layer and one output node with a bias in
addition to the input features.
1969 Marvin Minsky: 1969. Perceptrons are ’just’ linear, AI goes
logical, beginning of ”AI Winter”
1980s Neural Network resurgence: Backpropagation (updating
weights by gradient descent)
1990s SVMs! Kernals can do anything! (no, they can’t)
A Short History
1993 LeNet 1 for digit recognition
2003 Deep Learning (Convolutional Nets Dropout/RBMs, Deep
Belief Networks)
1986, 2006 Restricted Boltzman Machines
2006 Neural Network outperform RBF SVM on MNIST
handwriting dataset (Hinton et al.)
2012 AlexNet for ImageNet challenge - this algorithm beat
competition by error rate of 16% vs 26% for next best
ImageNet : contains 15 million annotated images in
over 22,000 categories.
ZFNet paper (2013) extends this and has good
description of network structure
2012-present Google Cat Youtube, speech recognition, self driving cars,
computer defeats regional Go champion, ...
2014 GoogLeNet added many layers and introduced inception
modules (allows parallel computation rather than serially
Mark Crowley A to Z of AI/ML Sep 23, 2017 33 / 112
Neural Networks History Of Neural Networks
A Short History
Outline
Introduction
What is AI?
Neural Networks
Building Upon Classic Machine Learning
History Of Neural Networks
Improving Performance
Overfitting
Very inneficient for images, timeseries, large numbers of
inputs-outputs
Slow to train
Hard to interpret the resulting model
Overfitting
There are a number of useful heuristics for training Neural Networks that
are useful in practice (maybe we’ll learn more today):
Less hidden nodes, just enough complexity to work, not too much to
overfit.
Train multiple networks with different sizes and search for the best
design.
Validation set: train on training set until error on validation set starts
to rise, then evaluate on evaluation set.
Try different activiation functions: tanh, ReLU, ELU, ...?
Dropout (Hinton 2014) - randomly ignore certain units during
training, don’t update them via gradient descent, leads to hidden
units that specialize
Modify learning rate over time (cooling schedule)
Dropout
Dropout (Hinton 2014) - randomly ignore certain units during
training, don’t update them via gradient descent, leads to hidden
units that specialize.
With probability p don’t include a weight in the gradient updates.
Reduces overfitting by encouraging robustness of weights in the
network.
97
3, convolutional
Test accuracy (percent)
96
3, fully connected
95 11, convolutional
94
93
92
91
0.0 0.2 0.4 0.6 0.8 1.0
Number of parameters ⇥108
(Goodfellow 2016)
Figure 6.7: Deeper models tend to perform better. This is not merely because the model is
larger. This experiment from Goodfellow et al. (2014d) shows that increasing the number
of parameters in layers of convolutional networks without increasing their depth is not
nearly as effective at increasing test set performance. The legend indicates the depth of
network used to make each curve and whether the curve represents variation in the size of
the convolutional or the fully connected layers. We observe that shallow models in this
context overfit at around 20 million parameters while deep ones can benefit from having
over 60 million. This suggests that using a deep model expresses a useful preference over
the space of functions the model can
Mark Crowley learn.
A to Specifically, it expresses a belief that
Z of AI/ML Septhe
23, 2017 67 / 112
Convolutional Neural Networks
Outline
Introduction
What is AI?
Neural Networks
Parameter sharing
CHAPTER 9. CONVOLUTIONAL NETWORKS
Convolution s1 s2 s3 s4 s5
shares the same
parameters
across all spatial x1 x2 x3 x4 x5
locations
Traditional s1 s2 s3 s4 s5
matrix
multiplication
does not share x1 x2 x3 x4 x5
any parameters
Figure 9.5: Parameter sharing: Black arrows indicate the connections that use a
(Goodfellow 2016)
2D Convolution
Input
Kernel / Tensor
a b c d
w x
e f g h
y z
i j k l
Output
aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz
ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz
Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrict
(Goodfellow 2016)
the output to only positions where the kernel lies entirely within the image, called “valid”
convolution
Mark Crowley in some contexts. We drawA toboxes with arrows to indicate how the Sep
Z of AI/ML upper-left
23, 2017 76 / 112
Convolutional Neural Networks Motivation
A simpleCHAPTER
example9. CONVOLUTIONAL NETWORKS
Input
ure 9.6: Efficiency of edge detection. The image on the right was formed by taking
pixel in the original image and subtracting the value of its neighboring pixel on the
This shows the strength of all of the vertically oriented edges in the input image,
ch can be a useful operation for object detection. Both images are 280 pixels tall.
input image is Figure -1
320 pixels -1 while the output image is 319 pixels
9.6: wide
Efficiency of edge detection. The image on the
Output
wide. This
right was formed by ta
sformation can be described by a convolution kernel containing two
each pixel in the original image and subtracting the value elements, and
of its neighboring pixel on
uires 319 ⇥ 280 ⇥left. Kernel
3 = 960 floating
267,shows
This point of
the strength operations
all of the (two multiplications
vertically and in the input im
oriented edges
(Goodfellow 2016)
addition per output
whichpixel)
can betoacompute using convolution.
useful operation To describe
for object detection. theimages
Both same are 280 pixels
Mark Crowley A to Z of AI/ML Sep 23, 2017 77 / 112
Convolutional Neural Networks Motivation
Outline
Introduction
What is AI?
Neural Networks
Language Choices
Any language can be used for implementing/using AI/ML algorithms, but
some make it much easier
C++: you can do it, may need to implement many things yourself
Java: many of libraries for ML (Weka is a good open source one,
Deeplearning4j)
Scala: leaner, functional language that compile to JVM bytecode,
good for prototyping, can reuse libraries for Java
(Deeplearning4j)
R: focussed on statisical methods, more and more machine
learning libraries implemented for this
Matlab: good for all the calculations, if you have the right libraries
it’s great (not cheap or very portable beyond school)
Python: most commonly used right now for deep learning, we’re
gonna need another slide ...
Mark Crowley A to Z of AI/ML Sep 23, 2017 106 / 112
Do you need AI/ML? Languages and Libraries
Python
Cloud Services
There are several powerful, free services you can access via a student
account which you can request directly.
AWS: Amazon Web Service - very large, has accessible APIs to
connect to, many options for hardware to run on (but the
best ones will cost extra)
Azure: Microsoft - lots of visual tools for composing AI/ML
components.
Google Cloud ML Engine: - uses all the latest tools and tensorflow models
None of these provide GPU servers for free, that will cost
extra. (It will still work, just be slower for deep learning.)
Summary
Introduction
What is AI?
Landscape of Big Data/AI/ML
Classification
Neural Networks
Building Upon Classic Machine Learning
History Of Neural Networks
Improving Performance
Useful Books