0% found this document useful (0 votes)

38 views52 pages

Lecture 23

Uploaded by

teamsienna24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views52 pages

Lecture 23

Uploaded by

teamsienna24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Computing Science (CMPUT) 455

Search, Knowledge, and Simulations

Martin Müller

Department of Computing Science

University of Alberta
[email protected]

Fall 2024

1
455 Today - Lecture 23

• AlphaGo Zero
• Coursework:
• Reading: AlphaGo Zero paper
• Quiz 12
• Assignment 4

2
SPOT Surveys

• Student Perspectives of Teaching (SPOT)

• SPOT surveys were created to gather feedback on courses
to help instructors, departments and faculties shape
instruction and future curriculum.
• Feedback is anonymous and takes less than ten minutes
to complete
• You should have received an email
• SPOT surveys: https://ualberta.bluera.com
• Open November 25 - December 11
• 15 minutes in-class time: end of lecture, Thu Nov 28

3
AlphaGo Zero

• October 2017 article in Nature

• New simplified architecture
• Learns entirely from self-play
• No human knowledge beyond the basics such as rules of
game
• Stronger than previous AlphaGo versions
• Skill level: far beyond humans
4
Main Technical Changes in AlphaGo Zero

• New training method tailored for improving MCTS

• Self-learning from predicting searched moves and
outcomes
• New network architecture: resnets
• New network architecture: combine policy and value nets
into one net with two “heads”
• Does not need large distributed system anymore, strong
performance on “one machine”

5
Human Knowledge in Zero

• Rules of Go, legal moves

• Hard limit of 19 × 19 × 2 = 722 moves on game length
• Tromp-Taylor scoring
• Input has same 2-d grid structure as Go board
• Uses rotation and reflection invariance of Go rules for
training
• MCTS search parameters optimized

6
Human Knowledge Not Used in Zero

Some examples of knowledge used in many other Go

programs, but not in Zero
• Avoid eye-filling moves (as in Go1)
• Patterns
• Tactics, atari, selfatari
• Human game records
• Rules for a simulation policy
Zero has to learn all the basics of Go, and then it learns much
much more...

7
AlphaGo Zero’s Search

• Search - still MCTS

• Used in two different ways:
• For learning (new)
• For playing
• The most impressive innovation in Zero is how
search is used to improve learning, and
learning in turn improves search

8
Main Components of AlphaGo Zero -
Knowledge

Knowledge
• All knowledge created by machine learning from self-play
• New network architecture
• Knowledge represented by deep residual neural net
• Combines policy and value nets
• New: one single net with two output “heads”
• Both move and position evaluation learned together
• No more simulations (rollouts) to end of game!
• Evaluation only from neural net knowledge and ...
• ...from real game results at end of game states reached in
the tree
• MCTS win rates computed only from these

9
Knowledge Representation

• Deep residual neural network (He et al 2015)

• Learns two types of knowledge simultaneously
• Policy head
• Learns good moves for the search
• Value head
• Learns evaluation function - probability of winning
• Most of network is shared between both (why?)

10
Knowledge Representation
• Deep residual neural network (He et al 2015)
• Learns two types of knowledge simultaneously
• Policy head
• Learns good moves for the search
• Value head
• Learns evaluation function - probability of winning
• Most of network is shared between both (why?)

10
Deep Residual Neural Network (Resnet)

• Small change in network:

ImageNet test set,add
and won the
x
2015 connection
extra “bypass” classification competition.
• Main idea: resentations alsoofhave excellent g
weight layer
F(x)
pass output
relu
x on other recognition
previous “block” directly through tasks, and
weight layer
identity 1st places on: ImageNet detectio
• Each blockCOCO learnsdetection,
a “delta”and F(x),
COCO se
F(x) + x
relu small changeCOCO 2015 competitions.
to previous output xThis s
Figure 2. Residual learning: a building block.• Learning small the residual
changeslearning principle is g
is easier
it is applicable in other vision an
mparably good or better than the constructed• solution Can train really deep nets
able to do so in feasible time). efficiently 2. Related Work
this paper, we address the degradation problem • >100 by layers in image recognition
ucing a deep residual learning framework. In- Residual Representations. In i
• In theory, resnets have no greater
[18] is a representation that enco
of hoping each few stacked layers directly fit a
d underlying mapping, we explicitly let these lay- representational
with power
respect to than
a dictionary, and
a residual mapping. Formally, denoting the desired DCNN formulated as a probabilistic ver
ying mapping as H(x),
Image source: we article
He et al 2015 let the stacked •nonlinear of them are powerful
In practice, they learn better shallow rep
fit another mapping of F(x) := H(x) x. The orig- trieval and classification [4, 48]
encoding residual vectors [17]11is
Two-head Architecture

(p, v ) = fθ (s)

• Deep net
• Input Go position s
• Network weights θ
• Network computes function fθ (s)
Image source: All further images from • Two outputs: (p, v )
AlphaGo Zero paper, unless stated • p vector of move probabilities
otherwise p(s, a) for each move a
• v value of s

12
Output 1 of Neural Network: Policy Head

• New: learns to predict what the

search would do
• How frequently should each
move be tried in MCTS?
• Learning goal: minimize
cross-entropy between
• Predicted probability of move
• Frequency of move as
selected by MCTS
• Cross-entropy: measures how
well one probability distribution
can predict another
• We skip the math here...
• Details:
https://en.wikipedia.
org/wiki/Cross-entropy 13
Output 2 of Neural Network: Value Head

• Given a Go position
• Estimates probability of winning
• Used as static evaluation
function in search
• Trained from selfplay
• Learning goal: Minimize squared
error between:
• Predicted value v
• Final result z of game
• Same meaning as value net in
earlier AlphaGo

14
MCTS in AlphaGo Zero - Move Selection

• In tree move selection

• Same formulas as in previous
AlphaGo
• Exploitation term Q
• Exploration term u
• Meaning of Q is slightly changed

• Value of simulation ending in

in-tree state s = value head
evaluation of s
• No more simulation beyond
the tree, no more evaluation
component from rollouts

15
MCTS in AlphaGo Zero - Evaluation

• Node s just expanded

• Single call to neural net
• (p, v ) = fθ (s)
• p = vector of move probabilities,
p(s, a) for all moves a from s
• v estimated value of s

16
Training Pipeline
Self-play

Evaluation Optimization

• Self-play:
• Generate (very many) self-play game records
• New: use full AlphaGo Zero engine (MCTS using NN) for
both players
• Optimization:
sample from game records to update the NNs
• Evaluation:
• Play a match: AlphaGo Zero using updated NN vs using
previous NN
• If the new NN wins at least 55%:
• Replace previous by updated NN
• Continue loop, play more self-play games with (updated or
previous) engine
17
Network Optimization

• Error measured by loss function

• Combines three terms
• Error of policy head (cross entropy)
• Error of value head (as before)
• Regularization term to keep size of weights in check

18
MCTS Visit Count and Policy π

Meaning of policy π:
• Run MCTS from some state s
• If move a was played N(s, a) times:
• πt (a) ∝ N(s, a)1/τ
• Probability is proportional to its “exponentiated visit count”
• What does “proportional” ∝ mean?
• Compute values N(s, a)1/τ for all actions a, then divide by
their sum to make them into probabilities

19
Temperature Parameter τ

• πt (a) ∝ N(s, a)1/τ

• Temperature parameter τ controls exploration of
low-probability moves
• τ = 1 for early game only, small for rest of game
• Small τ : mostly greedy, play action a with highest N(s, a)

20
Self-Play Games by Sampling Moves

• Play whole game

• For each state st in game:
• Run MCTS on st
• Sample move to play according to number of simulations it
received
• Note difference to regular MCTS: exploration of many
different games due to sampling
• Regular MCTS would always pick the same move after the
same search
• Finish game, get outcome z (win = +1 or loss = -1)
• Store tuples (st , πt , zt ) for learning after end of game
21
Meaning of Tuples

• (st , πt , zt )
• st = state at time step t
• Game = sequence of states s1 , s2 ,...
• πt = probability distribution derived from visit count of
moves in MCTS of st
• zt = end of game result from current player’s point of view
(z or −z, negamax)

22
Learning from Self-Play Games

• After each game

• Randomly sample tuple (s, π, z)
from all tuples stored from the
game
• Adjust net weights θ by gradient
descent
• (p, v ) = fθ (s), improve theta:
• Make policy p better match π
• Make value v better match z

23
Training Process

• Most tests in paper with 20 block resnet

• 4.9 million self-play games
• 1600 simulations / move in MCTS
• Update net in minibatches of 2048 game positions

24
Zero After 3 Hours of Learning

• Net learned for 3 hours

• Quick game, MCTS with
1600 simulations/move
• Learned about capturing
stones
• Plays like human
beginner
• Clearly better than
random

25
Zero After 70 Hours of Learning

• 70 hours
• Quick game, MCTS with
1600 simulations/move
• Plays super-strong game
of Go
• Complex strategies
• Exact score estimates,
counting

26
Comparing Early Learning - RL vs SL and
AlphaGo Lee

• After 72 hours of training

• Slow games vs AlphaGo Lee
• 2 hours per game per player
• Zero won 100 - 0

27
Prediction of Human Moves

• Move prediction of human

professional moves
• Learning improves, then
plateaus
• Always stays below SL policy
predictions (why?)
• Despite lower stats, most moves
are very “human-like”

28
Predicting the Outcome of Human Games

• Predict winner of human

professional games
• MSE = mean square error
between value net and real
outcome
• Compare SL and RL value nets
• SL starts with big advantage
• RL becomes much better than
SL with more training

29
Compare Strength with AlphaGo Lee and
Master

• Strongest version - 40 residual blocks instead of 20

• Trained 29 million games, 40 days
• Learning compared to Elo strength of AlphaGo Lee and
AlphaGo Master
• Match Zero 40-block vs Master: 89 wins 11 losses in slow
games
30
Results for Different AlphaGo Versions

• Compares Elo for different

versions of AlphaGo
• Fast games, 5 seconds / move
• Also compares Zero’s “raw
network” vs full Zero
• Raw network:
• Evaluate current state s
• Play highest probability move
from policy head
• No search
• Almost as strong as: AlphaGo
Fan using search

31
The Importance of Search

• Evaluate Elo strength, move prediction, MSE of game outcomes

• sep (separate networks) vs dual (one net, 2 heads)
• conv (DCNN) vs res (residual net)
• Clearly better: dual architecture, sharing most of network
• Clearly better: residual net over DCNN
• Only exception: sep is better in move prediction
33
Some Limitations of AlphaGo

• AlphaGo only plays 19 × 19 Go with 7.5 komi

• AlphaGo is not perfect - no proofs, “mastering” vs solving
the game
• 5 × 5 is still the largest solved square board size, since
2002 http://erikvanderwerf.tengen.nl/5x5/
5x5solved.html
• AlphaGo is not open source - we do not know many of the
details
• AlphaGo required massive resources for training

34
Limitations of an AlphaGo-like Approach to
Problem-Solving

• AlphaGo relies on having a perfect model of the game

• Exact rules of game, perfect scoring of outcome, full state
of game known
• Model is used for creating many millions of self-play games
• Learning relies on having these games
• Big challenge: how to learn without perfect model
• MuZero (2019) can learn from games without knowing the
rules - see later lecture

35
Impact of AlphaGo

• Huge impact in media, outside of core AI community

• Often described as a major step towards “machine
intelligence”
• Remember main limitation - still needs an exact model to
work well
• Next:
• Impact on AI and machine learning in general
• Impact on heuristic search and computer game playing
• Impact on human Go community

36
Impact on AI and Machine Learning

• Prime example of how combination of search, simulation

and knowledge can achieve spectacular results
• Using deep learning:
• Search improves knowledge
• Knowledge improves search
• Virtuous cycle, positive feedback loop
• Simulation has changed dramatically
• In-tree only
• No more rollouts to end of game
• Search controlled completely by neural net evaluations
• AlphaGo (Zero) has dramatically shifted the landscape of
what knowledge can do as part of a larger search-based
system

37
Impact on AI and Machine Learning

• 10 years before, with MCTS there was a similar shift in

what search can do
• Main questions:
• Which other applications of deep learning can profit from
adding search and simulation?
• Which other applications of heuristic search can profit from
deep learning?

38
Impact on Heuristic Search and Computer
Game-playing

• Dramatic shift to much stronger knowledge

• With each major advance in one of the three areas -
search, knowledge, simulations -
• Need to rethink all heuristic search systems
• AlphaGo is a milestone
• Much work ahead to fully exploit the power of stronger
knowledge
• Can we learn stronger knowledge for other, harder
problems
• Video games (e.g. Atari games, Starcraft)
• More open real-world problems with less well-defined rules

39
Impact on Human Professional Go Community

• Human professionals study AlphaGo and other AI games

intensely
• They all use strong open source programs such as KataGo
to study
• Many AlphaGo-inspired openings now popular
• Some pros were worried for their jobs
• In practice, interest in human tournaments remains strong
• Strong programs did not replace human teachers

40
Impact on Human Amateur Go Community

• Temporary boost in excitement and visibility for the game

of Go
• Having strong computer opponents (and online play) helps
individuals in small communities without a Go club
• Cheating is becoming a problem, as in chess
• Goals:
• Turn programs into tools for teaching Go
• Explain programs’ moves in human terms

41
Impact on Computer Go Community

• Mission accomplished? Game over?

• Taking chess as example
• Public interest in programming Go has faded
• A core group of enthusiasts keeps going
• Everyone uses programs as study tools
• Level of humans is improving by studying with
programs
• It is now possible for a single person to write a
professional-level Go program in a year
• If they have enough computers...
• Many people modify KataGo - strongest open source
program

42
Beyond AlphaGo

• This lecture: Computer Go Today

• Future lectures:
• Applications to other games
• Other types of applications?
• Current research: improving and generalising the
techniques

43
Computer Go Today

• Many super-human level Go programs

• Many derived from KataGo, some independent
implementations
• Strongest: Tencent’s FineArt
• Can give top human professionals two stones handicap (!)
• Strongest open source program: KataGo

44
KataGo

KataGo: Strongest open source Go program

• Program by David Wu,
https://github.com/lightvector/KataGo
• Based on AlphaGo Zero
• Many improvements
• Can play different board sizes with same net
• Can play handicap games, different komi
• Most popular training tool for both amateurs and
professionals
• Superhuman even on a smartphone

45
KataGo Training
KataGo earlier training. b = blocks, c = channels, e.g. b6c96 =
6 blocks, 96 channels

Image source: https://github.com/lightvector/KataGo/blob/master/TrainingHistory.md

46
KataGo Recent Training

KataGo recent training

Image source: https://katagotraining.org

47
Summary

• Reviewed AlphaGo Zero in detail

• Discussed impact, state of computer Go after AlphaGo

Lecture 22
No ratings yet
Lecture 22
44 pages
AlphaGo Zero: Reinforcement Learning Mastery
No ratings yet
AlphaGo Zero: Reinforcement Learning Mastery
42 pages
Mastering The Game of Go Without Human Knowledge
100% (2)
Mastering The Game of Go Without Human Knowledge
18 pages
AlphaGo Zero Pseudo Code Guide
100% (1)
AlphaGo Zero Pseudo Code Guide
3 pages
Alpha Zero
100% (1)
Alpha Zero
23 pages
Lecture 24
No ratings yet
Lecture 24
25 pages
RADL LACuong
No ratings yet
RADL LACuong
81 pages
Case Study On AlphaGo Zero
100% (1)
Case Study On AlphaGo Zero
21 pages
Tutorial - AlphaGo PDF
No ratings yet
Tutorial - AlphaGo PDF
27 pages
AlphaGo Paper
No ratings yet
AlphaGo Paper
20 pages
Alpha Go Nature Paper
100% (1)
Alpha Go Nature Paper
20 pages
AlphaZero: Mastering Chess and Shogi
100% (1)
AlphaZero: Mastering Chess and Shogi
19 pages
Improvements To Increase The Efficiency of The Alphazero Algorithm: A Case Study in The Game 'Connect 4'
No ratings yet
Improvements To Increase The Efficiency of The Alphazero Algorithm: A Case Study in The Game 'Connect 4'
9 pages
Ee126 Project 1
No ratings yet
Ee126 Project 1
5 pages
Continuous Time 2
No ratings yet
Continuous Time 2
91 pages
Mastering The Game of Go With Deep Neural Networks and Tree Search - Nature - Nature Research
100% (1)
Mastering The Game of Go With Deep Neural Networks and Tree Search - Nature - Nature Research
15 pages
Table of Content
No ratings yet
Table of Content
33 pages
Understanding AlphaGo Zero's Learning
No ratings yet
Understanding AlphaGo Zero's Learning
21 pages
Deepnet Lourentzou
No ratings yet
Deepnet Lourentzou
49 pages
Section 5
No ratings yet
Section 5
29 pages
L1 - UCLxDeepMind DL2020
No ratings yet
L1 - UCLxDeepMind DL2020
97 pages
AlphaGo's AI Training and Achievements
100% (1)
AlphaGo's AI Training and Achievements
16 pages
Deep RL Research Guide
No ratings yet
Deep RL Research Guide
26 pages
F8 BP 2023 Zeman Jakub Thesis
No ratings yet
F8 BP 2023 Zeman Jakub Thesis
93 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
2019 Dac Tutorial Nvidia Part
No ratings yet
2019 Dac Tutorial Nvidia Part
55 pages
Deep Learning Turorial PDF
No ratings yet
Deep Learning Turorial PDF
301 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
2014 Move Evaluation in Go Using Deep Convolutional Neural Networks David Silver Ilya Sutsveker
No ratings yet
2014 Move Evaluation in Go Using Deep Convolutional Neural Networks David Silver Ilya Sutsveker
8 pages
AlphaGo's AI Mastery in Go
100% (1)
AlphaGo's AI Mastery in Go
41 pages
Improving Robustness of Alphazero Algorithms To Test-Time Environment Changes
No ratings yet
Improving Robustness of Alphazero Algorithms To Test-Time Environment Changes
20 pages
cs224r Practical Deep RL
No ratings yet
cs224r Practical Deep RL
77 pages
Probability in Computer Science
No ratings yet
Probability in Computer Science
15 pages
Learning To Stop: Dynamic Simulation Monte-Carlo Tree Search
No ratings yet
Learning To Stop: Dynamic Simulation Monte-Carlo Tree Search
10 pages
Alphazero - The New Chess King - Ver2020uio
No ratings yet
Alphazero - The New Chess King - Ver2020uio
19 pages
Alphazero - The New Chess King - Ver2019uio
No ratings yet
Alphazero - The New Chess King - Ver2019uio
20 pages
4b - Deep Reinforcement Learning
No ratings yet
4b - Deep Reinforcement Learning
29 pages
Reinforcement Learning With A and A Deep Heuristic: Ariel Kesleman Sergey Ten Adham Ghazali Majed Jubeh
No ratings yet
Reinforcement Learning With A and A Deep Heuristic: Ariel Kesleman Sergey Ten Adham Ghazali Majed Jubeh
6 pages
Deep Learning
No ratings yet
Deep Learning
8 pages
Alphago Game Algo
No ratings yet
Alphago Game Algo
3 pages
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
No ratings yet
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
7 pages
Deep Learning Computer Vision
No ratings yet
Deep Learning Computer Vision
302 pages
Lecture 21
No ratings yet
Lecture 21
29 pages
Self-Driving Car Racing: Application of Deep Reinforcement Learning
No ratings yet
Self-Driving Car Racing: Application of Deep Reinforcement Learning
12 pages
2.game AI 1
No ratings yet
2.game AI 1
268 pages
Deep Learning Course Overview
No ratings yet
Deep Learning Course Overview
30 pages
B C G P N N - L - P: Etter Omputer O Layer With Eural ET Work and ONG Term Rediction
No ratings yet
B C G P N N - L - P: Etter Omputer O Layer With Eural ET Work and ONG Term Rediction
10 pages
Lecture 3 - Real Time Dynamic Programming & Monte Carlo Tree Search
No ratings yet
Lecture 3 - Real Time Dynamic Programming & Monte Carlo Tree Search
21 pages
RLDL PBL AmriteshChandra 09411503121
No ratings yet
RLDL PBL AmriteshChandra 09411503121
15 pages
Green and Black Modern Machine Learning Presentation
No ratings yet
Green and Black Modern Machine Learning Presentation
14 pages
3D Laser Sensors for Quality Inspection
No ratings yet
3D Laser Sensors for Quality Inspection
24 pages
Gutenberg's Printing Press Revolution
No ratings yet
Gutenberg's Printing Press Revolution
3 pages
Customer Personality Analysis with Tableau
No ratings yet
Customer Personality Analysis with Tableau
9 pages
Interpretable Machine Learning Christoph Molnar Instant Download
No ratings yet
Interpretable Machine Learning Christoph Molnar Instant Download
84 pages
Final Project Report
No ratings yet
Final Project Report
42 pages
FI/CO and MM Integration Steps
No ratings yet
FI/CO and MM Integration Steps
10 pages
Jessica Xujia Wei - Resume (2025.02.07)
No ratings yet
Jessica Xujia Wei - Resume (2025.02.07)
1 page
Assignment: Problem-Solving Process
No ratings yet
Assignment: Problem-Solving Process
3 pages
Research Paper
No ratings yet
Research Paper
19 pages
Preparing For Audio Transcription
No ratings yet
Preparing For Audio Transcription
16 pages
Module IV
No ratings yet
Module IV
7 pages
Networking Projects Thesis
100% (2)
Networking Projects Thesis
7 pages
Hyva Alpha Series Front End Cylinder Specs
No ratings yet
Hyva Alpha Series Front End Cylinder Specs
2 pages
Vizio vx52l Fhdtv10a
No ratings yet
Vizio vx52l Fhdtv10a
31 pages
Exc 4.1
No ratings yet
Exc 4.1
4 pages
Video 7
No ratings yet
Video 7
3 pages
Kcse 2025 Moi High Comp Pp1 2025 Predictions
No ratings yet
Kcse 2025 Moi High Comp Pp1 2025 Predictions
8 pages
Hierarchy of Information Systems: Assignment
No ratings yet
Hierarchy of Information Systems: Assignment
4 pages
Embedded System in Printer
100% (1)
Embedded System in Printer
15 pages
Toys - New Designs From The Art Toy Revolution - Free Download, Borrow, and Streaming - Internet Archive
No ratings yet
Toys - New Designs From The Art Toy Revolution - Free Download, Borrow, and Streaming - Internet Archive
1 page
Slide Switches Te
No ratings yet
Slide Switches Te
74 pages
Nuvoton NCT5532D
No ratings yet
Nuvoton NCT5532D
302 pages
Motorbrakes Valves
No ratings yet
Motorbrakes Valves
72 pages
ICSE English Lit Prelim 2015
No ratings yet
ICSE English Lit Prelim 2015
4 pages
Exploranium 130 PDF
No ratings yet
Exploranium 130 PDF
70 pages
Final Paper
No ratings yet
Final Paper
3 pages
Theory Venue Letter - ON23 - P2100
No ratings yet
Theory Venue Letter - ON23 - P2100
3 pages
October 2021 Daily Time Record
No ratings yet
October 2021 Daily Time Record
2 pages
Input and Output Characteristics of Transistor CE and CB Configuration
No ratings yet
Input and Output Characteristics of Transistor CE and CB Configuration
8 pages
Gmail - TCS - BPS Documentation Verification
No ratings yet
Gmail - TCS - BPS Documentation Verification
2 pages