0% found this document useful (0 votes)
38 views52 pages

Lecture 23

Uploaded by

teamsienna24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views52 pages

Lecture 23

Uploaded by

teamsienna24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Computing Science (CMPUT) 455

Search, Knowledge, and Simulations

Martin Müller

Department of Computing Science


University of Alberta
[email protected]

Fall 2024

1
455 Today - Lecture 23

• AlphaGo Zero
• Coursework:
• Reading: AlphaGo Zero paper
• Quiz 12
• Assignment 4

2
SPOT Surveys

• Student Perspectives of Teaching (SPOT)


• SPOT surveys were created to gather feedback on courses
to help instructors, departments and faculties shape
instruction and future curriculum.
• Feedback is anonymous and takes less than ten minutes
to complete
• You should have received an email
• SPOT surveys: https://ualberta.bluera.com
• Open November 25 - December 11
• 15 minutes in-class time: end of lecture, Thu Nov 28

3
AlphaGo Zero

• October 2017 article in Nature


• New simplified architecture
• Learns entirely from self-play
• No human knowledge beyond the basics such as rules of
game
• Stronger than previous AlphaGo versions
• Skill level: far beyond humans
4
Main Technical Changes in AlphaGo Zero

• New training method tailored for improving MCTS


• Self-learning from predicting searched moves and
outcomes
• New network architecture: resnets
• New network architecture: combine policy and value nets
into one net with two “heads”
• Does not need large distributed system anymore, strong
performance on “one machine”

5
Human Knowledge in Zero

• Rules of Go, legal moves


• Hard limit of 19 × 19 × 2 = 722 moves on game length
• Tromp-Taylor scoring
• Input has same 2-d grid structure as Go board
• Uses rotation and reflection invariance of Go rules for
training
• MCTS search parameters optimized

6
Human Knowledge Not Used in Zero

Some examples of knowledge used in many other Go


programs, but not in Zero
• Avoid eye-filling moves (as in Go1)
• Patterns
• Tactics, atari, selfatari
• Human game records
• Rules for a simulation policy
Zero has to learn all the basics of Go, and then it learns much
much more...

7
AlphaGo Zero’s Search

• Search - still MCTS


• Used in two different ways:
• For learning (new)
• For playing
• The most impressive innovation in Zero is how
search is used to improve learning, and
learning in turn improves search

8
Main Components of AlphaGo Zero -
Knowledge

Knowledge
• All knowledge created by machine learning from self-play
• New network architecture
• Knowledge represented by deep residual neural net
• Combines policy and value nets
• New: one single net with two output “heads”
• Both move and position evaluation learned together
• No more simulations (rollouts) to end of game!
• Evaluation only from neural net knowledge and ...
• ...from real game results at end of game states reached in
the tree
• MCTS win rates computed only from these

9
Knowledge Representation

• Deep residual neural network (He et al 2015)


• Learns two types of knowledge simultaneously
• Policy head
• Learns good moves for the search
• Value head
• Learns evaluation function - probability of winning
• Most of network is shared between both (why?)

10
Knowledge Representation
• Deep residual neural network (He et al 2015)
• Learns two types of knowledge simultaneously
• Policy head
• Learns good moves for the search
• Value head
• Learns evaluation function - probability of winning
• Most of network is shared between both (why?)

10
Knowledge Representation
• Deep residual neural network (He et al 2015)
• Learns two types of knowledge simultaneously
• Policy head
• Learns good moves for the search
• Value head
• Learns evaluation function - probability of winning
• Most of network is shared between both (why?)

10
Knowledge Representation
• Deep residual neural network (He et al 2015)
• Learns two types of knowledge simultaneously
• Policy head
• Learns good moves for the search
• Value head
• Learns evaluation function - probability of winning
• Most of network is shared between both (why?)

10
Knowledge Representation
• Deep residual neural network (He et al 2015)
• Learns two types of knowledge simultaneously
• Policy head
• Learns good moves for the search
• Value head
• Learns evaluation function - probability of winning
• Most of network is shared between both (why?)

10
Deep Residual Neural Network (Resnet)

• Small change in network:


ImageNet test set,add
and won the
x
2015 connection
extra “bypass” classification competition.
• Main idea: resentations alsoofhave excellent g
weight layer
F(x)
pass output
relu
x on other recognition
previous “block” directly through tasks, and
weight layer
identity 1st places on: ImageNet detectio
• Each blockCOCO learnsdetection,
a “delta”and F(x),
COCO se
F(x) + x
relu small changeCOCO 2015 competitions.
to previous output xThis s
Figure 2. Residual learning: a building block.• Learning small the residual
changeslearning principle is g
is easier
it is applicable in other vision an
mparably good or better than the constructed• solution Can train really deep nets
able to do so in feasible time). efficiently 2. Related Work
this paper, we address the degradation problem • >100 by layers in image recognition
ucing a deep residual learning framework. In- Residual Representations. In i
• In theory, resnets have no greater
[18] is a representation that enco
of hoping each few stacked layers directly fit a
d underlying mapping, we explicitly let these lay- representational
with power
respect to than
a dictionary, and
a residual mapping. Formally, denoting the desired DCNN formulated as a probabilistic ver
ying mapping as H(x),
Image source: we article
He et al 2015 let the stacked •nonlinear of them are powerful
In practice, they learn better shallow rep
fit another mapping of F(x) := H(x) x. The orig- trieval and classification [4, 48]
encoding residual vectors [17]11is
Two-head Architecture

(p, v ) = fθ (s)

• Deep net
• Input Go position s
• Network weights θ
• Network computes function fθ (s)
Image source: All further images from • Two outputs: (p, v )
AlphaGo Zero paper, unless stated • p vector of move probabilities
otherwise p(s, a) for each move a
• v value of s

12
Output 1 of Neural Network: Policy Head

• New: learns to predict what the


search would do
• How frequently should each
move be tried in MCTS?
• Learning goal: minimize
cross-entropy between
• Predicted probability of move
• Frequency of move as
selected by MCTS
• Cross-entropy: measures how
well one probability distribution
can predict another
• We skip the math here...
• Details:
https://en.wikipedia.
org/wiki/Cross-entropy 13
Output 2 of Neural Network: Value Head

• Given a Go position
• Estimates probability of winning
• Used as static evaluation
function in search
• Trained from selfplay
• Learning goal: Minimize squared
error between:
• Predicted value v
• Final result z of game
• Same meaning as value net in
earlier AlphaGo

14
MCTS in AlphaGo Zero - Move Selection

• In tree move selection


• Same formulas as in previous
AlphaGo
• Exploitation term Q
• Exploration term u
• Meaning of Q is slightly changed

• Value of simulation ending in


in-tree state s = value head
evaluation of s
• No more simulation beyond
the tree, no more evaluation
component from rollouts

15
MCTS in AlphaGo Zero - Evaluation

• Node s just expanded


• Single call to neural net
• (p, v ) = fθ (s)
• p = vector of move probabilities,
p(s, a) for all moves a from s
• v estimated value of s

16
Training Pipeline
Self-play

Evaluation Optimization

• Self-play:
• Generate (very many) self-play game records
• New: use full AlphaGo Zero engine (MCTS using NN) for
both players
• Optimization:
sample from game records to update the NNs
• Evaluation:
• Play a match: AlphaGo Zero using updated NN vs using
previous NN
• If the new NN wins at least 55%:
• Replace previous by updated NN
• Continue loop, play more self-play games with (updated or
previous) engine
17
Network Optimization

• Error measured by loss function


• Combines three terms
• Error of policy head (cross entropy)
• Error of value head (as before)
• Regularization term to keep size of weights in check

18
MCTS Visit Count and Policy π

Meaning of policy π:
• Run MCTS from some state s
• If move a was played N(s, a) times:
• πt (a) ∝ N(s, a)1/τ
• Probability is proportional to its “exponentiated visit count”
• What does “proportional” ∝ mean?
• Compute values N(s, a)1/τ for all actions a, then divide by
their sum to make them into probabilities

19
Temperature Parameter τ

• πt (a) ∝ N(s, a)1/τ


• Temperature parameter τ controls exploration of
low-probability moves
• τ = 1 for early game only, small for rest of game
• Small τ : mostly greedy, play action a with highest N(s, a)

20
Self-Play Games by Sampling Moves

• Play whole game


• For each state st in game:
• Run MCTS on st
• Sample move to play according to number of simulations it
received
• Note difference to regular MCTS: exploration of many
different games due to sampling
• Regular MCTS would always pick the same move after the
same search
• Finish game, get outcome z (win = +1 or loss = -1)
• Store tuples (st , πt , zt ) for learning after end of game
21
Meaning of Tuples

• (st , πt , zt )
• st = state at time step t
• Game = sequence of states s1 , s2 ,...
• πt = probability distribution derived from visit count of
moves in MCTS of st
• zt = end of game result from current player’s point of view
(z or −z, negamax)

22
Learning from Self-Play Games

• After each game


• Randomly sample tuple (s, π, z)
from all tuples stored from the
game
• Adjust net weights θ by gradient
descent
• (p, v ) = fθ (s), improve theta:
• Make policy p better match π
• Make value v better match z

23
Training Process

• Most tests in paper with 20 block resnet


• 4.9 million self-play games
• 1600 simulations / move in MCTS
• Update net in minibatches of 2048 game positions

24
Zero After 3 Hours of Learning

• Net learned for 3 hours


• Quick game, MCTS with
1600 simulations/move
• Learned about capturing
stones
• Plays like human
beginner
• Clearly better than
random

25
Zero After 70 Hours of Learning

• 70 hours
• Quick game, MCTS with
1600 simulations/move
• Plays super-strong game
of Go
• Complex strategies
• Exact score estimates,
counting

26
Comparing Early Learning - RL vs SL and
AlphaGo Lee

• After 72 hours of training


• Slow games vs AlphaGo Lee
• 2 hours per game per player
• Zero won 100 - 0

27
Prediction of Human Moves

• Move prediction of human


professional moves
• Learning improves, then
plateaus
• Always stays below SL policy
predictions (why?)
• Despite lower stats, most moves
are very “human-like”

28
Predicting the Outcome of Human Games

• Predict winner of human


professional games
• MSE = mean square error
between value net and real
outcome
• Compare SL and RL value nets
• SL starts with big advantage
• RL becomes much better than
SL with more training

29
Compare Strength with AlphaGo Lee and
Master

• Strongest version - 40 residual blocks instead of 20


• Trained 29 million games, 40 days
• Learning compared to Elo strength of AlphaGo Lee and
AlphaGo Master
• Match Zero 40-block vs Master: 89 wins 11 losses in slow
games
30
Results for Different AlphaGo Versions

• Compares Elo for different


versions of AlphaGo
• Fast games, 5 seconds / move
• Also compares Zero’s “raw
network” vs full Zero
• Raw network:
• Evaluate current state s
• Play highest probability move
from policy head
• No search
• Almost as strong as: AlphaGo
Fan using search

31
The Importance of Search

• Raw network vs AlphaGo Zero,


5 seconds / move
• With search 2000 Elo stronger
• That’s 20 skill levels...
• Same gap as between top
human professional and weak
club player
• Stronger knowledge makes
search even stronger
• More time spent on searching
good move candidates deeply
• No “diminishing returns”
• Value of search remains very
high throughout
32
Comparing Network Architectures

• Evaluate Elo strength, move prediction, MSE of game outcomes


• sep (separate networks) vs dual (one net, 2 heads)
• conv (DCNN) vs res (residual net)
• Clearly better: dual architecture, sharing most of network
• Clearly better: residual net over DCNN
• Only exception: sep is better in move prediction
33
Some Limitations of AlphaGo

• AlphaGo only plays 19 × 19 Go with 7.5 komi


• AlphaGo is not perfect - no proofs, “mastering” vs solving
the game
• 5 × 5 is still the largest solved square board size, since
2002 http://erikvanderwerf.tengen.nl/5x5/
5x5solved.html
• AlphaGo is not open source - we do not know many of the
details
• AlphaGo required massive resources for training

34
Limitations of an AlphaGo-like Approach to
Problem-Solving

• AlphaGo relies on having a perfect model of the game


• Exact rules of game, perfect scoring of outcome, full state
of game known
• Model is used for creating many millions of self-play games
• Learning relies on having these games
• Big challenge: how to learn without perfect model
• MuZero (2019) can learn from games without knowing the
rules - see later lecture

35
Impact of AlphaGo

• Huge impact in media, outside of core AI community


• Often described as a major step towards “machine
intelligence”
• Remember main limitation - still needs an exact model to
work well
• Next:
• Impact on AI and machine learning in general
• Impact on heuristic search and computer game playing
• Impact on human Go community

36
Impact on AI and Machine Learning

• Prime example of how combination of search, simulation


and knowledge can achieve spectacular results
• Using deep learning:
• Search improves knowledge
• Knowledge improves search
• Virtuous cycle, positive feedback loop
• Simulation has changed dramatically
• In-tree only
• No more rollouts to end of game
• Search controlled completely by neural net evaluations
• AlphaGo (Zero) has dramatically shifted the landscape of
what knowledge can do as part of a larger search-based
system

37
Impact on AI and Machine Learning

• 10 years before, with MCTS there was a similar shift in


what search can do
• Main questions:
• Which other applications of deep learning can profit from
adding search and simulation?
• Which other applications of heuristic search can profit from
deep learning?

38
Impact on Heuristic Search and Computer
Game-playing

• Dramatic shift to much stronger knowledge


• With each major advance in one of the three areas -
search, knowledge, simulations -
• Need to rethink all heuristic search systems
• AlphaGo is a milestone
• Much work ahead to fully exploit the power of stronger
knowledge
• Can we learn stronger knowledge for other, harder
problems
• Video games (e.g. Atari games, Starcraft)
• More open real-world problems with less well-defined rules

39
Impact on Human Professional Go Community

• Human professionals study AlphaGo and other AI games


intensely
• They all use strong open source programs such as KataGo
to study
• Many AlphaGo-inspired openings now popular
• Some pros were worried for their jobs
• In practice, interest in human tournaments remains strong
• Strong programs did not replace human teachers

40
Impact on Human Amateur Go Community

• Temporary boost in excitement and visibility for the game


of Go
• Having strong computer opponents (and online play) helps
individuals in small communities without a Go club
• Cheating is becoming a problem, as in chess
• Goals:
• Turn programs into tools for teaching Go
• Explain programs’ moves in human terms

41
Impact on Computer Go Community

• Mission accomplished? Game over?


• Taking chess as example
• Public interest in programming Go has faded
• A core group of enthusiasts keeps going
• Everyone uses programs as study tools
• Level of humans is improving by studying with
programs
• It is now possible for a single person to write a
professional-level Go program in a year
• If they have enough computers...
• Many people modify KataGo - strongest open source
program

42
Beyond AlphaGo

• This lecture: Computer Go Today


• Future lectures:
• Applications to other games
• Other types of applications?
• Current research: improving and generalising the
techniques

43
Computer Go Today

• Many super-human level Go programs


• Many derived from KataGo, some independent
implementations
• Strongest: Tencent’s FineArt
• Can give top human professionals two stones handicap (!)
• Strongest open source program: KataGo

44
KataGo

KataGo: Strongest open source Go program


• Program by David Wu,
https://github.com/lightvector/KataGo
• Based on AlphaGo Zero
• Many improvements
• Can play different board sizes with same net
• Can play handicap games, different komi
• Most popular training tool for both amateurs and
professionals
• Superhuman even on a smartphone

45
KataGo Training
KataGo earlier training. b = blocks, c = channels, e.g. b6c96 =
6 blocks, 96 channels

Image source: https://github.com/lightvector/KataGo/blob/master/TrainingHistory.md


46
KataGo Recent Training

KataGo recent training

Image source: https://katagotraining.org

47
Summary

• Reviewed AlphaGo Zero in detail


• Discussed impact, state of computer Go after AlphaGo

48

You might also like