COMP GI22/MI22
Deep Learning Lecture 1
Thore Graepel & Guest Lecturers from DeepMind
[Link]@[Link]
Overview
● Team and Structure of the Course
● Guest Lectures and Lecturers
● DeepMind approach to AI
● Why Deep Learning?
● Deep Reinforcement Learning at work
○ Learning to Play Atari Games with Deep RL
○ AlphaGo - Learning to master level Go
● Extra revision material (supervised learning)
The DeepMind/UCL Team
Matteo Hessel Diana Borsa
(TA Lead)
Koray Kavukcuoglu Hado van Hasselt
(Co-Lead DL) (Co-Lead RL)
Teaching Assistants:
● Zach Eaton-Rosen
● Lewis Moffat
● Michael Jones
● Raza Habib Marie Mulville
Alex Davies (PgM)
● Thomas Gaudelet
Format and Coursework
● Format: Two streams, both streams mandatory
○ Tuesdays: Deep Learning taught by a selection of fantastic guest lecturers from DeepMind
○ Thursdays: Reinforcement Learning taught by Hado Van Hasselt (also DeepMind)
○ Some exceptions, check timetable at [Link] and on Moodle (for topics)
● Assessment: 100% through Coursework
○ There are four deep learning and four reinforcement learning assignments
○ Each of the eight assignment will be weighted equally, i.e., counting 12.5%
○ Coursework is mixture of programming assignments and questions
○ Framework for coursework will be Colab, a Jupyter notebook environment that requires no
setup to use and runs entirely in the cloud.
○ Machine Learning algorithms will be implemented in TensorFlow through Colab.
○ You can find more information about the assessment on Moodle.
○ Todo: Set up Google account with address: "[Link]@[Link]",
where XXXXXXXX is your (numerical) student number
● Support: Use Moodle forum and Moodle direct messages
TensorFlow - What is it?
Hado or Diana
Warning: Lots of work and prior knowledge required!
● Last year, many people complained that it was too much work!
● If you do not know how to code in Python this may not be right for you!
● A lot of preliminary knowledge required - see quiz!
● Deep Learning lectures are delivered by top researchers in the field and will
stretch towards the current research frontier → brace yourselves!
● Check out the Self-Assessment Quiz on Moodle
DeepMind
Guest Lecturers
Introduction to TensorFlow
● Lecture topics:
○ Introduction to Tensorflow principles
○ Practical work-through examples in Colab
● Guest Lecturer: Matteo Hessel
○ Joined DeepMind in 2015.
○ Masters in Machine Learning from UCL Matteo Hessel
○ Master of Engineering Politecnico di Milano
● Guest Lecturer: Alex Davies
○ Joined Deepmind in 2017
○ PhD in Machine Learning at Cambridge
○ Worked with team of international scientists to build the world's first
machine learned musical.
Alex Davies
Neural Nets, Backprop, Automatic Differentiation
● Lecture topics:
○ Neural nets
○ Multi-class classification and softmax loss
○ Modular backprop
○ Automatic differentiation
● Guest Lecturer: Simon Osindero
○ Joined DeepMind in 2016.
○ Undergrad/Masters in Natural Sciences/Physics at University of Cambridge.
○ PhD in Computational Neuroscience from UCL (2004). Supervisor: Peter Dayan.
○ Postdoc at University of Toronto with Geoff Hinton. (Deep belief nets, 2006).
○ Started an A.I. company, LookFlow, in 2009. Sold to Yahoo in 2013.
○ Current research topics: deep learning, RL agent architectures and algorithms,
memory, continual learning.
Convolutional Neural Networks
● Lecture topics:
○ Convolutional networks
○ Large-scale image recognition
○ ImageNet models
● Guest Lecturer: Karen Simonyan
○ Joined DeepMind in 2014
○ DPhil (2013) and Postdoc (2014) at the University of Oxford
with Andrew Zisserman
○ Research topics: deep learning, computer vision
■ VGGNets, two-stream ConvNets, ConvNet visualisation, etc.
■ [Link]
Temporal Hierarchies
Recurrent Nets and Sequence Generation
● Lecture topics:
○ Recurrent Neural Networks
○ Long-Short Term Memory (LSTM)
○ (Conditional) Sequence Generation
● Guest Lecturer: Oriol Vinyals
○ Joined DeepMind in 2016.
○ Worked in Google Brain from 2013 to 2016.
○ PhD in Artificial Intelligence from UC Berkeley (2009-13). Supervisor: Darrell / Morgan.
○ Current research topics: deep learning, sequence modeling, generative models,
distillation, RL/Starcraft, one shot learning.
Sequence Prediction Seq2Seq Recurrent Architectures
End-To-End and Energy-Based Learning
● Lecture topics:
○ End-to-end learning
○ Energy based learning
○ Ranking
○ Embeddings
○ Triplet loss
● Guest Lecturer: Raia Hadsell
○ PhD From NYU, postdoc at CMU’s Robotics Institute
○ Senior Scientist and Tech Manager at SRI International
○ Now leading a research team at DeepMind
○ Research in Deep Learning, Robotics, Navigation, Life-Long Learning
Optimisation
● Lecture topics:
○ First-order methods
○ Second-order methods
○ Stochastic methods
○ Some convergence theory
● Guest Lecturer: James Martens
○ Joined DeepMind in Sept 2016
○ PhD from University of Toronto under Geoff Hinton & Rich Zemel in
2015
○ Undergrad from Waterloo in Math and Computer Science
○ Working on: second-order optimization for neural nets,
characterizing expressive power/efficiency of neural nets, generative
models / unsupervised learning
Attention and Memory Models
● Lecture topics:
○ Neural attention models
○ Recurrent neural networks with external memory
○ Neural Turing Machines / Differentiable Neural Computers
● Guest Lecturer: Alex Graves
○ Joined Deepmind 2013
○ Undergrad Theoretical Physics, Univ. of Edinburgh
○ Masters Mathematics and Theoretical Physics, Univ. of Cambridge
○ PhD Artificial Intelligence TU Munich, supervisor Jürgen Schmidhuber
○ CIFAR Junior fellow with Geoff Hinton, Univ. of Toronto
○ Research focuses on sequence learning with recurrent neural networks:
memory, attention, sequence generation, model compression
Deep Learning for Natural Language Processing
● Lecture topics:
○ Deep Learning for Natural Language Processing
○ Neural word embeddings
○ Neural machine translation
● Guest Lecturer: Ed Grefenstette
○ DPhil from Oxford
○ Co-Founder of Dark Blue Labs (acquired by DeepMind)
○ Research in Machine Learning, Computational Linguistics
Unsupervised Learning and Deep Generative Models
● Lecture topics:
○ Density estimation and unsupervised learning.
○ Deep Generative Models: latent variable and implicit models.
○ Approximate inference and variational inference.
○ Stochastic optimisation
● Guest Lecturer: Shakir Mohamed
○ Joined DeepMind in 2013.
○ PhD in Statistical Machine Learning, St John’s College, University of Cambridge. Supervisor: Zoubin
Ghahramani.
○ CIFAR Junior Research Fellow at the University of British Columbia with Nando de Freitas.
○ Topics in Probabilistic thinking, approximate Bayesian inference, unsupervised learning and density
estimation, deep Learning, reinforcement learning.
○ Undergrad in electrical engineering. From Johannesburg, South Africa.
Reinforcement Learning Stream (Hado)
● Introduction to Reinforcement Learning
● Markov Decision Processes
● Planning by Dynamic Programming
● Model-Free Prediction
● Model-Free Control
● Value Function Approximation (Deep RL)
● Policy Gradient Methods
● Integrating Learning and Planning Hado van Hasselt
● Exploration and Exploitation
● Case Study: AlphaGo
Case Study: AlphaGo (TBC)
● Lecture topics:
○ The story behind AlphaGo
○ Deep RL applied to Classical Board Games
○ Combining Tree Search and Neural Networks
○ Evaluation against machines and humans
● Guest Lecturer: David Silver
○ Computer Science at Cambridge, PhD Alberta
○ Co-Founder/CTO of Elixier Studios
○ Faculty member at UCL (on leave at DeepMind)
○ Joined DeepMind in 2013
○ Research in deep reinforcement learning, integration
of learning and planning, games
Case Study: Practical Deep RL (TBC)
● Lecture topics:
○ Learning to play Atari games: DQN in Detail
○ Faster Agents through parallel training
○ Better data efficiency through unsupervised RL
○ Some practical advice
● Guest Lecturer: Volodymyr Mnih
○ PhD in Machine Learning at the University of Toronto
○ Early DeepMind pioneer
○ Legendary work on Deep RL for playing Atari, published in Nature
DeepMind founded 2010 (joined Google 2014)
Mission: “Solve Intelligence”
An Apollo Programme for AI (150+ scientists)
A new approach to organizing science
General Artificial Intelligence
General-Purpose Learning Algorithms
Learn automatically from raw inputs - not pre-programmed
General - same system can operate across a wide range of tasks
Artificial ‘General’ Intelligence (AGI) – flexible, adaptive, inventive
‘Narrow’ AI – hand-crafted, special-cased, brittle
Reinforcement Learning
OBSERVATIONS
GOAL
Agent Environment
ACTIONS
○ General Purpose Framework for AI
○ Agent interacts with the environment
○ Select actions to maximise long-term reward
○ Encompasses supervised and unsupervised learning as special cases
Deep Learning
What is intelligence?
Intelligence measures an agent’s ability to achieve
goals in a wide range of environments
Complexity
Measure of Intelligence Value achieved
penalty
Sum over environments
Universal Intelligence: A Definition of Machine Intelligence, Legg & Hutter 2007
Multi-Agent and AI
Grounded Cognition
A true thinking machine has to be grounded in a rich sensorimotor reality
Games are the perfect platform for developing and testing AI algorithms
Unlimited training data, no testing bias, parallel testing, measurable progress
‘End-to-end’ learning agents: from pixels to actions
Thanks to Koray for DL slides
Why Deep Learning?
● Enables End-To-End Training
○ Optimise for the end loss
○ Don’t engineer your inputs
○ Learn good representations
● Versatile: Can be applied to images, text, audio, video
● Modular design of systems (modular backprop)
● Represent weak prior knowledge (e.g., convolutions)
● Now computationally feasible at scale (GPUs)
Deep Learning
Supervised Learning
○ Convolutional Networks on MNIST
[ Lecun, et. al ]
○ Convolutional Networks on ImageNet
[ Krizhevsky, et. al ]
Deep Learning
Supervised Learning
○ Convolutional Networks on Text
[ Zhang, et. al ]
○ Convolutional Networks on Video
[ Collobert, et. al ]
[ Simonyan, et. al ]
Deep Learning
Supervised Learning
○ End-to-End Training
○ Optimize for the end loss
○ No engineered inputs
○ With enough data, learn a big non-linear function
○ Learn good representations of data
■ Rich enough supervised labeling is enough to train transferrable representations
■ Best feature extractor
■ Karpathy, Razavian et al, Yosinski et al, Donahue et al
○ Large labeled dataset + big/deep neural network + GPUs
○ Ever more sophisticated modules → Differentiable Progrogramming
Deep Learning
Supervised Learning
○ Innovation continues
■ Inception
■ Ladder Nets
■ Residual Connections
■ …
○ Performance is continuously improving
○ Architectures for easier optimization [ Rasmus, et. al ]
■ Batchnorm
[ Szegedy, et. al ] [ He, et. al ]
Deep Learning
Unsupervised Learning
○ Unsupervised Learning/Generative Models
■ RBM
■ Auto-encoders
■ PCA, ICA, Sparse Coding
[ Hinton, et. al ]
■ VAE
■ NADE - and all variants
■ GANs
○ How to evaluate/rank different algorithms?
○ Quantitative approach or visual quality?
■ How can we trust if the input domain itself is not interpretable?
○ How can unsupervised learning help a task?
[ Larochelle, Murray]
Deep Learning
Sequence Modeling
○ Almost all data are sequence
■ Text
■ Video [ Hochreiter and Schmidhuber ]
■ Audio
■ Image [nade, pixelrnn]
■ Multi-modal (caption → image, image → caption)
[ Vinyals, et. al ]
[ Sutskever, et. al ]
Deep Learning
Human-level control
through deep
reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.
Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig
Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,
Daan Wierstra, Shane Legg, Demis Hassabis
Google DeepMind
(Mnih et al. Nature 2015)
ATARI Games
● Designed to be challenging and
interesting for humans
● Provides a good platform for sequential
decision making
● Widely adopted RL benchmark for
evaluating agents (Bellemare’13)
● Many different games emphasize
control, strategy, …
● Provide a rich visual domain
Deep Learning
End-to-End Reinforcement Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
DeepMind Lab - Challenging RL Problems in 3D
General Artificial Intelligence
Mastering the game of Go with deep
neural networks and tree search
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den
Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot,
Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy
Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel & Demis Hassabis
Google DeepMind
(Silver, Huang, et al 2016)
#3 most downloaded
academic paper this month
Why is Go hard for computers to play?
Game tree complexity = bd
Brute force search intractable:
1. Search space is huge
2. “Impossible” for computers
to evaluate who is winning
Value network
Evaluation
v (s)
Position
Policy network
Move probabilities
p (a|s)
Position
Reducing depth with value network
Reducing depth with value network
Reducing breadth with policy network
Evaluating current AlphaGo against computers
4500
AlphaGo (v18)
V13 scores 494/495
4000
against computer
opponents 3500 9p
Professional
7p
dan (p)
5p
AlphaGo (Nature v13)
3000 3p
1p
V18 beats V13 2500 9d
3 to 4 stones 7d
2000
Amateur
dan (d)
handicap 5d
Crazy Stone
Zen
1500 3d
Pachi
1d
Fuego
1000 1k
CAUTION: ratings
3k
Beginner
kyu (k)
based on self-play 500
5k
Go
Gnu
results 0
7k
Computer Programs Calibration Human Players
DeepMind challenge match Lee Sedol (9p)
AlphaGo (Mar 2016) Top player of
4-1 past decade
Beats Beats
Nature match Fan Hui (2p)
AlphaGo (Oct 2015) 3-times reigning
5-0 Euro Champion
Beats Beats
KGS Amateur
Crazy Stone and Zen
humans
Extra revision material (Supervised Learning)
• Review of concepts from supervised learning
• Generalisation, overfitting, Underfitting
• Learning curves
• Stochastic gradient descent
• Linear regression
• Cost function
• Gradients
• Logistic regression
• Cost function
• Gradients
Supervised Learning Problem
Given a set of input/output pairs (training set) we wish to compute the
functional relationship between the input and the output
Example 1: (people detection) given an image we wish to say if it depicts a
person or not. The output is one of two possible categories
Example 2: (pose estimation) we wish to predict the pose of a face image The
output is a continuous number (here a real number describing the face
rotation angle)
In both problems the input is a high dimensional vector x representing pixel
intensity/colour
Example: People Detection
Example: People Detection (cont.)
Supervised Learning Model
Supervised Learning Problem: Compute a function which best
describes I/O relationship
Learning Algorithm
• Example Algorithms:
• Linear Regression
• Logistic Regression
• Neural Networks
• Decision Trees
• In this lecture, we will revise linear and logistic regression
Key Questions for the ML Practitioner
• How is the data collected? (need assumptions!)
• How do we represent the inputs? (may require pre-processing step)
• How accurate is the learned function on new data (study of
generalization error)?
• Many algorithms may exist for a task. How do we choose?
• How “complex” is a learning task? (computational complexity,
sample complexity)
Important Challenges for ML
• New inputs differ from the ones in the training set (look up tables do
not work!)
• Inputs are measured with noise
• Output is not deterministically obtained by the input
• Input is often high dimensional but some components/variables may
be irrelevant
• How can we incorporate prior knowledge?
Generalisation
Most important idea of machine learning:
Train models such that they correctly predict on unseen data
(from the same distribution)
• Empirical risk minimization: Minimise error on training sample
• Validation: Hold out data for testing to obtain unbiased estimator
• When data is scarce, can use cross-validation
Cross Validation
Underfitting and Overfitting
Underfitting Overfitting
• Error driven by approximation • Error driven by generalization
• High bias / low variance • Low bias / high variance
• What to do? • What to do?
• Use more features • Use fewer features
• User more complex model • Use simpler model
• Reduce regularization • Increase regularization
• Train for longer • Stop training early
More Data versus Better Algorithm
• In high-variance, overfitting situations
more data helps
• Example: Confusion Set Disambiguation
• Banko and Brill 2001, “Scaling to Very
Very Large Corpora for Natural
Language Disambiguation”
• See also: “The Unreasonable
Effectiveness of Data”, Pereira, Norvig,
Halevy
Real-World Learning Curves: Underfitting
Training
Error
Validation
Error
Real-World Learning Curves: Overfitting
Training
Error
Early Stopping
Validation
Error
Real-World Learning Curves: Just Right
Training
Error
Validation
Error
Generalisation in Deep Learning
• “Understanding Deep Learning requires rethinking generalization”, Zhang, S. Bengio, Hardt,
Recht, Vinyals
• Deep Neural Networks easily fit random labels
• Generalization error varies from 0 to 90% without changes in model
• Deep NNs can even (rote) learn to classify random images
(Stochastic) Gradient Descent
Generalisation from Stochastic Gradient Descent
Linear Regression
Linear Regression Cost Function
• Model:
• Example-wise loss function:
• Total loss function:
• Minimising the squared error is equivalent to assuming Gaussian noise in a
maximum likelihood estimation
Stochastic gradient descent for regression
• Total loss gradient:
• Loss gradient:
• Model gradient:
• Put together:
Batch and stochastic gradient descent
•
Regularisation
Non-linear Basis Functions
Regression with polynomial basis functions
Degree = 0 Degree = 1 Degree = 2
Degree = 3 Degree = 4 Degree = 5
Polynomial Fit for different degrees
• Training error goes down with
increasing degree (better fit)
• Test error is optimal at degree 2,
and deteriorates for higher
degrees
• Note the similarity to learning
curves discussed earlier. The
effective hypothesis class of
neural networks becomes more
complex with longer training
Logistic Regression for classification
• Generalized linear model for
binary classification
• Used, e.g., in click-through-rate
prediction for search engine
advertising
• Find linear hyperplane to
separate the data
• Predict probability of class
Logistic Regression Cost Function
• Linear model:
• (Inverse) Link function:
• Cross entropy loss:
• The regression loss is a composition of these three functions,
aggregated over training examples
Logistic (Inverse) Link Function
By Michaelg2015 (Own work) [CC BY-SA 4.0 ([Link] via Wikimedia Commons
Cross Entropy
Logistic Regression Cost Function
Modular Gradients for Logistic Regression
• Total Gradient:
• Loss gradient:
• Link gradient:
• Model gradient:
Putting the gradient back together
• Similarly, the backpropagation algorithm works through the layers of
deeper neural networks to calculate error gradients w.r.t. to weights
• Simon’s lecture will give more details