0% found this document useful (0 votes)
265 views29 pages

BAI701 - DLRL - Module 5 Notes

Deep Reinforcement Learning (DRL) mimics human learning through experience and rewards, utilizing trial-and-error to maximize long-term outcomes in complex environments. It encompasses various algorithms, including multi-armed bandits and ε-greedy strategies, to balance exploration and exploitation while addressing challenges like credit assignment and large state spaces. DRL applications span from video games to self-driving cars, where agents learn optimal actions based on state-dependent rewards.

Uploaded by

Pallavi T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
265 views29 pages

BAI701 - DLRL - Module 5 Notes

Deep Reinforcement Learning (DRL) mimics human learning through experience and rewards, utilizing trial-and-error to maximize long-term outcomes in complex environments. It encompasses various algorithms, including multi-armed bandits and ε-greedy strategies, to balance exploration and exploitation while addressing challenges like credit assignment and large state spaces. DRL applications span from video games to self-driving cars, where agents learn optimal actions based on state-dependent rewards.

Uploaded by

Pallavi T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Learning and Reinforcement Learning

Module-5

Deep Reinforcement Learning


9.1 Introduction

• Humans learn through experience-driven, reward-guided trial and error instead of fixed
training data.
• Learning is shaped by both individual interactions with the environment and evolution over
generations.
• Herbert Simon’s ant hypothesis suggests that human behavior appears complex because of
the environment's complexity.
• Biological intelligence arises from simple reward-seeking behavior through environmental
interaction.
• Artificial intelligence aims to simulate this by adopting trial-and-error learning.
• Reinforcement learning is a reward-driven trial-and-error approach seeking to maximize
long-term rewards by interacting with the environment.
• RL is seen as a path toward general artificial intelligence.
• In video games (e.g., Atari 2600), deep learners use raw pixel input to predict actions,
improving through reward feedback to reach or exceed human-level performance.
• AlphaGo learned to play Go using human and self-play data, developing novel strategies,
defeating top players, and extending to games like chess.
• Self-driving cars can use RL by processing sensor feedback to make driving decisions,
with RL helping reduce errors compared to human drivers.
• Robots can learn locomotion through RL without being shown how to walk, instead
learning efficient ways to move from trial-and-error reward signals.
• RL is suitable for problems where outcomes are easy to evaluate but actions are hard to
specify in advance.
• The multi-armed bandit problem illustrates balancing exploration of new options and
exploiting known good choices.

Nishita, AI&ML, BGSCET 1


Deep Learning and Reinforcement Learning

• Unlike bandit problems with identical decisions, real-world tasks (like games or driving)
require state-sensitive decisions learned through RL.
• RL enables learning complex behaviours from simple reward feedback, with the
complexity emerging from interaction with the environment.

9.2 Stateless Algorithms: Multi-Armed Bandits

• A gambler repeatedly chooses between slot machines, trying to find which one has the
highest average reward.
• He must balance exploration (trying all machines to learn about them) with exploitation
(choosing the best-known machine for maximum reward).
• Rewards for each machine follow a fixed probability distribution and do not depend on any
state, making this a simplified form of reinforcement learning.
• Various multi-armed bandit strategies help manage the exploration–exploitation trade-
off effectively.
• These strategies are foundational for general reinforcement learning and are often used as
components in more complex systems.

9.2.1 Naive Algorithm

• In the naive approach, the gambler first runs an exploration phase where each machine is
tried a fixed number of times.
• After this exploration, the gambler permanently switches to the machine with the highest
observed average payoff.
• At first glance, this strategy appears reasonable because it separates learning from using
the best option.
• A major drawback is that it’s hard to choose the right number of exploratory trials to
confidently identify the best machine.
• If too few trials are used, the estimate of the best machine may be inaccurate.
• Estimating payoffs is especially hard when big rewards are rare, needing many trials for
reliable estimates.
• Using many exploratory trials wastes effort on poor machines that won’t be used later.
Nishita, AI&ML, BGSCET 2
Deep Learning and Reinforcement Learning
• If the wrong machine is chosen at the end of exploration, the gambler is stuck with it
forever.
• This fixed approach is unrealistic in real-world problems where conditions can be uncertain
and changing.

9.2.2 ε-Greedy Algorithm

• The ε-greedy algorithm interleaves exploration and exploitation throughout the game.
• With probability ε, the gambler explores by choosing a random machine.
• With probability (1 − ε), the gambler exploits by choosing the machine with the highest
average reward so far.
• This ensures that exploration continues indefinitely, preventing the gambler from getting
stuck in a suboptimal choice.
• The approach starts exploitation early, letting the gambler benefit from good choices
sooner.
• The parameter ε controls the trade-off between exploration and exploitation.
• Typical values might be ε = 0.1, but the best choice depends on the specific problem.
• A small ε value means more exploitation but slower learning about other machines.
• A large ε value means more exploration but potentially less reward overall.
• Tuning ε can be hard since the ideal balance depends on the problem’s dynamics.
• A common strategy is annealing, where ε starts large (more exploration) and gradually
decreases over time (more exploitation).
• Annealing helps the gambler explore thoroughly early on but focus on the best machine
later.

9.2.3 Upper Bounding Methods

• Upper bounding strategies improve on ε-greedy by combining exploration and exploitation


more naturally in every decision.
• In upper bounding strategies, the gambler takes a more optimistic view of slot machines that have
not been tried sufficiently, and therefore uses a slot machine with the best statistical upper bound

Nishita, AI&ML, BGSCET 3


Deep Learning and Reinforcement Learning
on the payoff. Therefore, one can consider the upper bound Ui of testing a slot machine i as the
sum of expected reward Qi and one-sided confidence interval length Ci: Ui = Qi+Ci
• The value of Ci is like a bonus for increased uncertainty about that slot machine in the mind of the
gambler.
• Mathematically, Ci is proportional to the standard deviation of the mean reward and
inversely proportional to the square root of the number of trials.
• The value Ci is proportional to the standard deviation of the mean reward of the tries so far.
According to the central limit theorem, this standard deviation is inversely proportional to the
square-root of the number of times the slot machine i is tried.
• One can estimate the mean μi and standard deviation σi of the ith slot machine and then set Ci to
be where ni is the number of times the ith slot machine has been tried. Here, K decides
the level of confidence interval. Therefore, rarely tested slot machines will tend to have larger upper
bounds (because of larger confidence intervals Ci) and will therefore be tried more frequently.
• Unlike ε-greedy, this approach doesn’t separate trials into exploration and exploitation
stages.
• Every trial both explores (by considering uncertainty) and exploits (by favoring high
average rewards).
• The parameter K controls how much weight is given to the uncertainty bonus.
• A higher K encourages more exploration by inflating Ci, while a lower K favors
exploitation.
• For example, K = 3 gives ~99.99% confidence intervals if rewards are Gaussian.
• Tuning K allows the gambler to balance exploration and exploitation flexibly.
• This approach is often more efficient at discovering the true best machine in fewer trials.

9.3 The Basic Framework of Reinforcement Learning


• Bandit algorithms are stateless, meaning the environment is the same at every time step,
and past actions only change the agent’s knowledge—not the environment itself.
• In general reinforcement learning (RL) settings like video games or self-driving cars,
there is a notion of state—the environment changes based on the agent’s actions.
• In these settings, a single action’s reward depends on the sequence of past actions. For
example, in a video game, the reward for a move depends on what moves were made before
it.
Nishita, AI&ML, BGSCET 4
Deep Learning and Reinforcement Learning
• In self-driving cars, the same action (e.g., swerving) has different rewards depending on
the current state (e.g., normal driving vs. imminent collision).
• Therefore, RL needs to assign credit for rewards in a way that accounts for the specific
system state in which the action is taken.
• In RL, there is an agent (e.g., the player in a video game) that takes actions (e.g., moving
a joystick) in an environment (e.g., the game itself).
• Each action leads to a new state in the environment. For instance, the player’s position in
the game changes.
• The environment gives the agent rewards based on how well it achieves its goals (e.g.,
scoring points in a game).
• Rewards may depend on combinations of past actions. For example, a move might pay
off because of a clever position achieved earlier.
• The reward for an action in a given state may also be stochastic (random), like the result
of pulling a slot machine lever.
• One of the primary goals of reinforcement learning is to identify the inherent values of
actions in different states, irrespective of the timing and stochasticity of the reward.

• The learning process helps the agent choose actions based on the inherent values of the
actions in different states.
• This general principle applies to all forms of reinforcement learning in biological
organisms, such as a mouse learning a path through a maze to earn a reward. The rewards
earned by the mouse depend on an entire sequence of actions, rather than on only the latest

Nishita, AI&ML, BGSCET 5


Deep Learning and Reinforcement Learning
action. When a reward is earned, the synaptic weights in the mouse’s brain adjust to reflect
how sensory inputs should be used to decide future actions in the maze.
• This is exactly the approach used in deep reinforcement learning, where a neural network
is used to predict actions from sensory inputs (e.g., pixels of video game). This relationship
between the agent and the environment is shown in Figure 9.1.
• The agent-environment interaction is modeled as a Markov Decision Process (MDP).
• An MDP has states, actions, rewards, and rules for state transitions. The key property is
that the current state contains all the information needed to predict what will happen next.
• Finite Markov decision processes (e.g., tic-tac-toe) terminate in a finite number of steps,
which is referred to as an episode. A particular episode of this process is a finite sequence
of actions, states, and rewards. An example of length (n + 1) is the following:

where st is the state before performing action at, and performing the action at causes a
reward of rt and transition to state st+1.
• There’s a notation difference in Sutton and Barto’s book, which uses rₜ₊₁ after aₜ in sₜ.
• Infinite MDPs (e.g., continuous robot operation) don’t have a natural episode length and
are called non-episodic problems.
Examples:

In practice, a system state is often approximated rather than capturing the full environment.
For example, in Atari games, a fixed number of recent frames may represent the state.

• Tic-tac-toe, Chess, Go:


o State: Current board position.
o Actions: Legal moves.
o Reward: +1 (win), 0 (draw), −1 (loss), usually received at the end of the game
(delayed rewards).
• Robot Locomotion:
o State: Current joint angles and position of the robot.
o Actions: Torques applied to joints.
o Reward: Based on staying upright and making forward progress.

Nishita, AI&ML, BGSCET 6


Deep Learning and Reinforcement Learning

• Self-Driving Car:
o State: Sensor inputs (e.g., LIDAR, camera, GPS).
o Actions: Steering, acceleration, braking.
o Reward: Designed to encourage safe and efficient driving.
• Defining state representations and reward functions requires careful design effort.
• After these are defined, reinforcement learning can work as a complete end-to-end
learning system.

9.3.1 Challenges of Reinforcement Learning

• Credit Assignment Problem:

✓ When a reward is received (like winning a chess game), it's hard to know how much each
past action contributed.
✓ Rewards can also be probabilistic (like slot machines), making them hard to estimate
precisely.

• Large State Spaces:

✓ RL problems often have huge numbers of possible states (e.g., all chess positions).
✓ The system must generalize to make good decisions even in unseen states, which is where
deep learning helps.

• Exploration vs. Exploitation Trade-off:

✓ Choosing new, unexplored actions can help learn better strategies but may cost
performance in the short term.
✓ Only exploiting known actions can lead to suboptimal long-term results.

• Data Collection Challenges:

✓ In RL, learning and data collection are intertwined.


✓ Real-world systems (like robots or self-driving cars) must physically perform actions to
learn, which can be costly and dangerous.
Nishita, AI&ML, BGSCET 7
Deep Learning and Reinforcement Learning
✓ Early learning often involves many failures, and collecting enough real-world data is a
major challenge, limiting RL beyond simulations and games.

9.3.2 Simple Reinforcement Learning for Tic-Tac-Toe

• State Definition: In Tic-Tac-Toe, each board configuration is treated as a state, with


around 19,683 possible states (since each of the 9 cells in 3x3 board can be ‘X’, ‘O’, or
blank).
• Action Definition: An action is placing ‘X’ or ‘O’ in any valid empty position on the
board.
• State-Action Value Estimation: Instead of valuing actions globally (like in bandits), the
algorithm estimates values for each state-action pair (s, a) based on past outcomes against
a fixed opponent.
• Rewards with Discounting:

✓ Shorter wins are preferred at discount factor γ<1, and therefore the unnormalized value of

action a in state s is increased with in case of wins and − in case of losses after
r moves (including the current move). Draws are credited with 0. The discount also reflects
the fact that the significance of an action decays with time in real-world settings.

• Table Update Strategy:

✓ Table Updates occur only after the entire game ends (offline update).
✓ Normalized values are computed by dividing the accumulated (unnormalized) value by the
number of times the (state, action) pair has been played.

• Exploration vs Exploitation:

✓ The policy is ε-greedy: with probability 1−ε, it picks the action with the highest normalized
value; with probability ε, it explores randomly.

• Learning Over Time: The table of state-action values improves as more games are played,
allowing the agent to adapt its strategy to the fixed opponent.

Nishita, AI&ML, BGSCET 8


Deep Learning and Reinforcement Learning

• Self-Play Option: The agent can train optimally by playing against itself. In self-play,
rewards for updates are assigned as for loss/draw/win from the perspective of
the player.
• Inference: During actual play, the agent chooses moves that have the highest learned
normalized value for the current state.

9.3.3 Role of Deep Learning and a Straw-Man Algorithm

The overarching goal of the-greedy algorithm for tic-tac-toe was to learn the inherent long-
term value of each state-action pair, since the rewards are received long after valuable actions
are performed. The goal of the training process is to perform the value discovery task of
identifying which actions are truly beneficial in the long-term at a particular state. For example,
making a clever move in tic-tac-toe might set a trap, which eventually results in assured
victory. Examples of two such scenarios are shown in Figure 9.2(a) (although the trap on the
right is somewhat less obvious). Therefore, one needs to credit a strategically good move
favorably in the table of state-action pairs and not just the final winning move. The trial-and-
error technique based on the greedy method will indeed assign high values to clever traps.
Examples of typical values from such a table are shown in Figure 9.2(b). The less obvious trap
of Figure 9.2(a) has a slightly lower value because moves assuring wins after longer periods
are discounted by ,and greedy trial and-error might have a harder time finding the win
after setting the trap. The main problem with this approach is that the number of states in many
reinforcement learning settings is too large to tabulate explicitly.

Nishita, AI&ML, BGSCET 9


Deep Learning and Reinforcement Learning

(c) Positions from two different games between Alpha Zero (white) and Stockfish (black)

On the left, white sacrifices a pawn and concedes a passed pawn in order to trap black’s light-
square bishop behind black’s own pawns. This strategy eventually resulted in a victory for
white after many more moves than the horizon of a conventional chess-playing program like
Stockfish. In the second game on the right, white has sacrificed material to incrementally
cramp black to a position where all moves worsen the position. Incrementally improving
positional advantage is the hallmark of the very best human players rather than chess-playing
software like Stockfish, whose hand-crafted evaluations sometimes fail to accurately capture
subtle differences in positions. The neural network in reinforcement learning, which uses the

Nishita, AI&ML, BGSCET 10


Deep Learning and Reinforcement Learning
board state as input, evaluates positions in an integrated way without any prior assumptions.
The data generated by trial-and-error provides the only experience for training a very complex
evaluation function that is indirectly encoded within the parameters of the neural network. The
trained network can therefore generalize these learned experiences to new positions.

Monte Carlo simulations are used to refine and remember the long-term values of seen
states. One learns about the value of a trap in tic-tac-toe only because previous Monte Carlo
simulations have experienced victory many times from that exact board position. In most
challenging settings like chess, one must generalize knowledge learned from prior experiences
to a state that the learner has not seen before. All forms of learning (including reinforcement
learning) are most useful when they are used to generalize known experiences to unknown
situations. In such cases, the table-centric forms of reinforcement learning are woefully
inadequate. Deep learning models serve the role of function approximators. Instead of learning
and tabulating the values of all moves in all positions (using reward-driven trial and error), one
learns the value of each move as a function of the input state, based on a trained model using
the outcomes of prior positions. Without this approach, reinforcement learning cannot be used
beyond toy settings like tic-tac-toe.\

Although the aforementioned approach is too naive, a sophisticated system with Monte Carlo tree
search, known as Alpha Zero, has recently been trained to play chess. Two examples of positions
from different games in the match between Alpha Zero and a conventional chess program,
Stockfish-8.0, are provided in Figure 9.2(c). In the chess position on the left, the reinforcement

Nishita, AI&ML, BGSCET 11


Deep Learning and Reinforcement Learning
learning system makes a strategically astute move of cramping the opponent’s bishop at the
expense of immediate material loss, which most hand-crafted computer evaluations would not
prefer. In the position on the right, Alpha Zero has sacrificed two pawns and a piece exchange in
order to incrementally constrict black to a point where all its pieces are completely paralyzed. Even
though Alpha Zero (probably) never encountered these specific positions during training, its deep
learner has the ability to extract relevant features and patterns from previous trial-and-error
experience in other board positions. In this particular case, the neural network seems to recognize
the primacy of spatial patterns representing subtle positional factors over tangible material factors
(much like a human’s neural network). In real-life settings, states are often described using sensory
inputs. The deep learner uses this input representation of the state to learn the values of specific
actions (e.g., making a move in a game) in lieu of the table of state-action pairs. Even when the
input representation of the state (e.g., pixels) is quite primitive, neural networks are masters at
squeezing out the relevant insights. This is similar to the approach used by humans to process
primitive sensory inputs to define the state of the world and make decisions about actions using
our biological neural network. We do not have a table of pre-memorized state-action pairs for
every possible real-life situation. The deep-learning paradigm converts the forbiddingly large table
of state-action values into a parameterized model mapping states-action pairs to values, which can
be trained easily with backpropagation.

9.7 Case Studies

• Go is a two-person board game like chess. The complexity of a two-person board game
largely depends on the size of the board and the number of valid moves at each position.
• Go is far more complex than chess because it is played on a 19×19 board with a much
larger number of valid moves per position, making brute-force search infeasible.

Nishita, AI&ML, BGSCET 12


Deep Learning and Reinforcement Learning

• Players play with white or black stones, which are kept in bowls next to the Go board.
An example of a Go board is shown in Figure 9.7. The game starts with an empty board,
and it fills up as players put stones on the board. Black makes the first move and starts with
181 stones in her bowl, whereas white starts with 180 stones. The total number of junctions
is equal to the total number of stones in the bowls of the two players. A player places a
stone of her color in each move at a particular position (from the bowl), and does not move
it once it is placed. A stone of the opponent can be captured by encircling it. The objective
of the game is for the player to control a larger part of the board than her opponent by
encircling it with her stones.
• In chess, there are about 35 possible moves on average per position, while Go has
around 250, making its game tree exponentially larger and harder to search exhaustively.
• A typical game of Go is also deeper, averaging about 150 sequential moves compared to
around 80 for chess, further increasing the complexity of planning.
• Traditional chess engines use minimax search with pruning, evaluating positions using
heuristics about material and piece safety, a strategy that cannot scale to Go due to the vast
number of possible positions.
• The number of possible board states in Go at even modest depths (like 20 moves for
each player) exceeds the number of atoms in the observable universe, making brute-force
approaches impossible.
• Humans excel at Go by learning spatial patterns and using intuition rather than
exhaustively exploring move combinations, focusing on moves that are likely to increase
their advantage.
• AlphaGo mimics this human-like approach using reinforcement learning to learn
predictive patterns on the board, improving by playing both expert games and self-play.
• The board state in AlphaGo is encoded as multiple feature maps, including the current
positions, the number of moves since stones were placed, and other contextual information,
represented as 48 binary planes of 19×19 pixels.
• AlphaGo uses its win-loss experience with repeated game playing and AlphaGo’s
architecture includes a policy network that suggests good moves, a value network that

Nishita, AI&ML, BGSCET 13


Deep Learning and Reinforcement Learning
evaluates board positions, and a Monte Carlo Tree Search for final move selection, making
it a multi-stage, highly effective system.

Policy Networks

• The policy network in AlphaGo takes a visual representation of the Go board as input and
produces the probability of placing a stone at each legal position, using a softmax activation
to generate these probabilities. Two separate policy networks were trained for this task:
one using supervised learning and the other using reinforcement learning.
• Both networks shared the same architecture, consisting of 13 convolutional layers with
ReLU activation. Most of these layers used 3×3 filters, while the first and last layers used
5×5 and 1×1 filters, respectively. Each layer had 192 filters with zero padding to maintain
spatial size, and no max pooling was used in order to preserve the spatial details of the
board.
• The supervised learning (SL-policy) network was trained on moves from expert human
players, treating these moves as always correct with a score of +1. The training optimized
the network using the log-likelihood of the chosen move’s probability, effectively imitating
expert strategies.
• The reinforcement learning (RL-policy) network was trained through self-play, where
the network played games against older versions of itself to create a pool of diverse
opponents. Each move was labeled with +1 for a win or −1 for a loss, allowing the network
to improve its strategy based on game outcomes.
• These policy networks became strong Go players on their own, and their performance was
further enhanced by combining them with Monte Carlo Tree Search to make even more
effective strategic decisions.

Value Networks

• The value network is a convolutional neural network that takes the board state as input
and predicts a score between −1 and +1, where +1 indicates a certain win for the next player
to move.

Nishita, AI&ML, BGSCET 14


Deep Learning and Reinforcement Learning

• The output represents the expected outcome for the next player, so the input also includes
information about whether the next move is by the “player” or the “opponent” rather than
simply black or white.
• Its architecture is similar to the policy network, with early convolutional layers the
same, but with an extra convolutional layer at layer 12.
• After the final convolutional layer, the network includes a fully connected layer with 256
units and ReLU activation to process the features further.
• The final score is computed using a single tanh unit, ensuring the prediction lies within the
range [−1, +1].
• For training, the preferred approach was to generate data using self-play with the SL-
policy and RL-policy networks, playing full games to obtain reliable state-outcome pairs.
• To avoid overfitting, positions were sampled from different games rather than using many
positions from a single game, ensuring training examples were diverse and less correlated.

Monte Carlo Tree Search

• AlphaGo uses a modified version of Monte Carlo Tree Search (MCTS) with a
simplified exploration formula to balance exploration and exploitation during search.
• Unlike earlier versions that used only the RL-policy network for evaluating leaf nodes,
AlphaGo combines two evaluation methods for better accuracy.
• First, it performs fast Monte Carlo rollouts from leaf nodes to produce an evaluation e1.
• For these rollouts, instead of using the full policy network, AlphaGo uses a simplified
softmax classifier trained on a database of human games with additional hand-crafted
features to make rollouts faster.
• Second, the value network generates a separate evaluation e2 for each leaf node, providing
a learned estimate of position strength.
• The final evaluation at each leaf node is calculated as a convex combination of the two
estimates: with β=0.5 giving the best results.
• Interestingly, using only the value network for evaluation also produced similar
performance, showing it as a practical alternative.

Nishita, AI&ML, BGSCET 15


Deep Learning and Reinforcement Learning
• In the end, the move corresponding to the most visited branch in the Monte Carlo Tree
Search is selected as AlphaGo’s predicted move.

[Link] Alpha Zero: Enhancements to Zero Human Knowledge

• AlphaGo Zero improved on AlphaGo by removing the need for human expert moves and
eliminating the separate supervised-learning (SL) network entirely.
• Instead of using two separate networks, AlphaGo Zero employed a single neural network
that outputs both the policy (the probability distribution over moves) and the value
(predicted win probability for the position).
• The network is trained with a combined loss function, including cross-entropy loss for the
policy output and squared error loss for the value output, along with regularization.
• While the original AlphaGo used Monte Carlo Tree Search (MCTS) only for move
selection at play time, AlphaGo Zero integrated MCTS directly into training.
• During training, visit counts from MCTS act as improved targets for policy learning,
functioning as a policy improvement operator that refines the network’s suggested move
probabilities through lookahead search.
• The target policy π(s,a) is calculated from visit counts N(s,a) using a temperature-scaled
softmax, reflecting how often each move is chosen during tree search exploration.
• The prior probabilities p(s,a) from the neural network guide MCTS exploration, while
value estimates v(s) from the network evaluate newly expanded leaf nodes.
• MCTS repeatedly simulates games from a given state s, expanding the tree until new leaf
nodes or terminal states are reached, and updates Q-values and visit counts along the path
using the network’s evaluations.
• After many simulations from a position s, the resulting visit-count-based probabilities
π(s,a) are used to select moves in self-play games, producing new game data.
• Self-play games are played to completion, and the final game outcome (win or loss,
represented as z(s) in {−1, +1}) provides the ground-truth value for training.
• Each training example includes the board state s, the improved policy π(s,a) from MCTS,
and the ground-truth outcome z(s), forming the supervised target for network updates.

Nishita, AI&ML, BGSCET 16


Deep Learning and Reinforcement Learning
• This training instance is used to update the neural network parameters. Therefore, if the
probability and value outputs for the neural network are p(s,a)andv(s), respectively, the
loss for a neural network with weight vector W is as follows:

Here, λ>0 is the regularization parameter.


• This approach allows the system to bootstrap its own training targets purely from self-
play, without any human data or domain-specific knowledge.
• Alpha Zero extended this framework beyond Go, successfully mastering Go, chess, and
shogi using the same self-play reinforcement learning approach.
• Alpha Zero’s performance was exceptional—it defeated the best available software in these
games, such as Stockfish for chess and Elmo for shogi, surprising many who believed that
chess in particular required extensive human-crafted evaluation knowledge.

Comments on Performance

• AlphaGo demonstrated exceptional performance against both computer and human


opponents, winning 494 out of 495 games against various computer programs.
• Even when handicapped by giving four free stones to its opponents, AlphaGo still achieved
impressive win rates of 77% against Crazy Stone, 86% against Zen, and 99% against
Pachi.
• It also defeated top human players, including the European champion, the World
champion, and the top-ranked Go player in the world.
• AlphaGo’s style of play was notable for its unconventional, creative moves that often
defied traditional Go strategies and only made sense in hindsight after its victories.
• Many of these surprising moves challenged conventional Go wisdom and revealed
innovative insights that AlphaGo developed through extensive self-play.
• As a result of playing against AlphaGo, some top human Go players reconsidered and
refined their own approach to the game, acknowledging the new ideas it introduced.

Nishita, AI&ML, BGSCET 17


Deep Learning and Reinforcement Learning
• Alpha Zero displayed similar impressive performance in chess, making strategic
material sacrifices to improve its position and restrict its opponent—behaviors often seen
in top human play.
• Unlike traditional chess engines that rely on hand-crafted evaluation functions, Alpha
Zero had no built-in assumptions about the value of pieces or king safety and learned all
strategies through self-play.
• Alpha Zero independently discovered well-known chess openings and developed its
own evaluations of which openings were more effective, showing an ability to generate
knowledge autonomously.
• A key difference of reinforcement learning from supervised learning is that it has the
ability to innovate beyond known knowledge through learning by reward-guided trial
and error.
• Such innovative behavior suggests promise for applying reinforcement learning
approaches to other complex problem domains beyond games.

9.7.2 Self-Learning Robots

• Self-learning robots use rewards to learn tasks: These robots rely on a reward-driven
approach to master activities like walking, repairs, or picking up objects without being
explicitly programmed with rules for every situation.
• Locomotion learning highlights the challenge: A robot designed to walk must figure out
the right movements to stay balanced and travel from point A to point B, something humans
do naturally but robots must learn through trial and error.
• Reinforcement learning suits these problems: Instead of giving the robot detailed
instructions, it receives rewards when it makes progress, letting it explore and discover
effective movement strategies on its own.
• Robots start without prior knowledge: The robot isn’t programmed with what walking
"should" look like; it only knows that moving successfully will earn rewards, making
learning adaptive and flexible.

[Link] Deep Learning of Locomotion Skills

Nishita, AI&ML, BGSCET 18


Deep Learning and Reinforcement Learning

• Virtual robots were trained in locomotion tasks using the MuJoCo physics engine, which
enables fast and accurate simulation without needing physical hardware.
• Both a humanoid and a quadruped robot were used. An example of the biped model is
shown in Figure 9.8. The advantage of this type of simulation is that it is inexpensive to
work with a virtual simulation, and one avoids the natural safety and expense issues that
arise with the physical damages in an experimentation framework that is likely to be marred
by high levels of mistakes/accidents. On the flip side, a physical model provides more
realistic results. In general, a simulation can often be used for smaller scale testing before
building a physical model.

Figure 9.8: Example of the virtual humanoid robot.


• The humanoid model had 33 state dimensions and 10 actuated degrees of freedom, while
the quadruped model had 29 state dimensions and 8 actuated degrees of freedom.
• Rewards were given for forward progress, but episodes ended if the robot's center of mass
fell too low.
• Robot actions were controlled via joint torques.
• Input features included sensor data such as obstacle positions, joint positions, and angles,
which were fed into neural networks.
• Two neural networks were used: one for value estimation and another for policy

Nishita, AI&ML, BGSCET 19


Deep Learning and Reinforcement Learning
estimation, forming an actor-critic architecture.

Nishita, AI&ML, BGSCET 20


Deep Learning and Reinforcement Learning
• Both networks used a feed-forward design with three hidden layers of 100, 50, and 25
tanh units.
• The value network had a single output, while the policy network had outputs equal to the
number of actions, differing mainly in output layer and loss function.
• The approach combined Generalized Advantage Estimation (GAE) with Trust Region
Policy Optimization (TRPO).
• After 1000 reinforcement learning iterations, the robot learned to walk with a visually
pleasing gait.
• A video of the result is available, and Google DeepMind later released similar work with
added capabilities like obstacle avoidance.

[Link] Deep Learning of Visuomotor Skills

• A robot was trained using reinforcement learning to perform household tasks such as
hanging a coat, inserting blocks into shapes, fitting a toy hammer under a nail, and screwing
a cap onto a bottle.
• The robot's actions were controlled through 7-dimensional joint torque commands,
requiring sequences of commands to complete tasks effectively.
• Training was done on an actual physical robot, which used a camera to detect and locate
objects for manipulation.
• The camera image acted as the robot’s "eyes," with a convolutional neural network (CNN)
processing visual input similarly to how the human visual cortex works.
• Although different from Atari video game environments, this setting was similar in using
CNNs on image frames to learn policy actions.
• Additional inputs such as robot and object positions were also used, making the problem
more complex.
• The tasks demanded advanced learning in visual perception, precise coordination, and
understanding contact dynamics, all of which had to be learned automatically.

Nishita, AI&ML, BGSCET 21


Deep Learning and Reinforcement Learning

Figure 9.9: Deep learning of visuomotor skills.

• A convolutional neural network (CNN) was used to map camera images to robot actions,
learning spatial features needed to achieve task-specific rewards (similar to Atari game
approaches).
• The CNN had 7 layers and about 92,000 parameters (86,000 in convolutional layers).
• The first three layers were convolutional, while the fourth was a spatial softmax layer
producing a probability distribution over spatial locations.
• The fifth layer transformed these distributions into 2D feature points via a soft argmax
mechanism, creating precise spatial representations suitable for control.
• These feature points were then concatenated with the robot’s configuration data (joint
angles, velocities, end-effector pose) after the convolution layers.
• The combined features were processed by two fully connected layers with 40 rectified units
each, followed by a linear output layer that predicted motor torques.
• Only camera images were input to the convolutional layers, while robot state data was
added at the first fully connected layer to better handle non-visual inputs.
• The architecture of the convolutional neural network is shown in Figure 9.9(b).
• Observations included RGB images, joint encoder readings, velocities, and end-effector
poses, with full robot states ranging from 14 to 32 dimensions (including joint angles,
object positions, and velocities).
Nishita, AI&ML, BGSCET 22
Deep Learning and Reinforcement Learning
• The network’s output represented the robot’s motor torques (actions) in the policy-based
learning framework.
• A guided policy search method was used to transform parts of the reinforcement learning
problem into supervised learning, simplifying training.

9.7.3 Building Conversational Systems: Deep Learning for Chat bots

• Chatbots (conversational/dialog systems) aim for natural human-like conversation across


many topics.
• General-purpose systems like Apple’s Siri can handle broad topics but often fail on difficult
questions.
• Closed-domain systems focus on a specific task, making them easier to train reliably.
• Facebook built an end-to-end system for learning negotiation skills in a closed-domain
setting focused specifically on negotiation tasks.
• In the test-bed, two agents negotiate to split a collection of items (like books, hats, balls)
shown to both of them.
• Each agent values the items differently, but these values are private (they don’t know
each other’s assigned values), mirroring real-life negotiation uncertainty.
• The system ensures meaningful negotiation by imposing constraints:
• Total value of all items for each agent is fixed at 10.
• Every item has non-zero value for at least one agent.
• Some items have non-zero value for both agents.
• Because of these constraints, it is impossible for both agents to get the maximum score
of 10 simultaneously, creating a competitive setting.
• Negotiations last up to 10 turns, after which agents may choose "no agreement," resulting
in 0 points for both.
• The reward function in the reinforcement learning framework is defined as the final value
of the items acquired by the agent.
• Initial attempts using pure supervised learning (maximizing likelihood of utterances with
recurrent networks) led to overly compromising agents.

Nishita, AI&ML, BGSCET 23


Deep Learning and Reinforcement Learning

• Therefore, the approach combines supervised learning and reinforcement learning to


balance linguistic fluency with negotiation effectiveness.
• Supervised learning helps the model stay close to human language patterns and avoid
divergence.
• A dialog roll-out technique (a planning method) was introduced to simulate and evaluate
possible future dialog sequences.
• The system uses an encoder-decoder recurrent neural network (based on sequence-to-
sequence learning) where the decoder is trained to maximize the negotiation reward
instead of merely mimicking human utterances.
• 5808 negotiation dialogs were collected via Amazon Mechanical Turk, across 2236
unique scenarios.
• A scenario is defined by specific item quantities and the value each agent assigns to
them.
• 252 scenarios (526 dialogs) were set aside for testing.
• Each scenario produces two training examples, one for each agent’s perspective.
• A concrete training example could be one in which the items to be divided among the two
agents correspond to 3 books, 2 hats, and 1 ball. These are part of the input to each agent.
The second input could be the value of each item to the agent, which are (i) Agent A:
book:1, hat:3, ball:1, and (ii) Agent B: book:2, hat:1, ball:2. Note that this means that agent
A should secretly try to get as many hats as possible in the negotiation, whereas agent B
should focus on books and balls. An example of a dialog in the training data is given below:

o
• In the example scenario, agent A ends up with 2 books and 2 hats, while agent B gets 1
book and 1 ball, reflecting their different goals and negotiation strategies.
• Each agent has its own inputs and outputs, so each scenario yields two training examples,
one for each agent’s perspective.
• Dialogs are represented as sequences of tokens that include speaker markers and a special
token signaling agreement.

Nishita, AI&ML, BGSCET 24


Deep Learning and Reinforcement Learning

• The supervised learning model uses four gated recurrent units (GRUs):
o One GRUg to encode input goals.
o One GRUq to generate dialog turns.
o A forward-output and a backward-output to produce final choices as
a bidirectional output.
• These GRUs are connected end-to-end and trained jointly.
• The supervised loss combines two parts: predicting dialog tokens and predicting final item
allocations.
• The same GRU architecture can be adapted for reinforcement learning by changing the loss
function.
• In reinforcement learning, the model acts as a policy network generating dialog roll-outs
using Monte Carlo sampling.
• Each sampled dialog (or action) is paired with its final reward, which is based on the
negotiated item values.
• Self-play is used so the agent negotiates with itself to improve its strategy, following the
REINFORCE algorithm.
• To avoid agents inventing unnatural language in self-play, one of the negotiating agents is
kept as a supervised model.
• For final prediction, rather than sampling directly, a two-stage approach is used:
o Multiple candidate utterances are sampled.
o The one with the highest expected reward (scaled by dialog probability) is selected.
• Observations showed that supervised models often conceded too quickly, while
reinforcement learning models negotiated more persistently.
• Reinforcement learning agents sometimes used human-like tactics, such as pretending to
value unimportant items to secure better deals elsewhere.

9.7.4 Self-Driving Cars

• Like robot locomotion, the self-driving car’s goal in reinforcement learning is to travel
safely from point A to point B without accidents or incidents.

Nishita, AI&ML, BGSCET 25


Deep Learning and Reinforcement Learning

• The car uses various sensors (video, audio, proximity, motion) to observe its environment,
with the aim of driving safely in diverse conditions.
• Driving is difficult to specify with exact rules for every scenario, but it is easy to recognize
good driving — a setting well-suited to reinforcement learning.
• The text focuses on a simplified setup where a single front-facing camera is used, showing
that even limited sensing can enable significant progress with reinforcement learning.
• This approach was inspired by Pomerleau’s 1989 ALVINN system, with improvements
mainly due to increased data availability, greater computational power, and advances in
convolutional neural networks.
• Training data was collected on varied roads and conditions, primarily in central New
Jersey, and also on highways in Illinois, Michigan, Pennsylvania, and New York.
• Two additional front-mounted cameras were used during training (but not for final driving
decisions) to provide shifted and rotated views for data augmentation, helping the model
learn to recover from off-center positions.
• The neural network was trained to minimize the error between its predicted steering
commands and those given by a human driver.
• This method resembles supervised learning more than pure reinforcement learning, and is
specifically known as imitation learning, which is often used to overcome the "cold start"
problem in reinforcement learning systems.
• Imitation learning scenarios often look similar to reinforcement learning scenarios.
• A reinforcement learning approach could reward the car for autonomous progress and
penalize it for stalling or needing human intervention.
• A major challenge in applying reinforcement learning to self-driving cars is ensuring safety
during training.
• The convolutional neural network architecture is shown in Figure 9.10.
• The neural network architecture has 9 layers: 1 normalization layer, 5 convolutional layers,
and 3 fully connected layers.
• The first convolutional layer uses a 5×5 filter with stride 2; the next two use 3×3 filters
without stride.

Nishita, AI&ML, BGSCET 26


Deep Learning and Reinforcement Learning

• These convolutional layers feed into three fully connected layers, with the final output
being a control value representing the inverse turning radius.
• The network has approximately 27 million connections and 250,000 parameters.
• The system was tested in both simulation and on real roads, always with a human driver
ready to intervene.
• Human intervention was needed only 2% of the time during road tests, meaning the car
was autonomous for 98% of the time.
• A video demonstration of this autonomous driving is available in reference [611].
• Visualization of the trained network’s activation maps showed it learned to focus on image
features critical for driving.
• For unpaved roads, the activation maps highlighted the road outlines effectively.
• In forest environments, the activation maps were noisy because the network was not trained
to recognize irrelevant details like trees and leaves.
• Unlike networks trained on general datasets like ImageNet, which learn to recognize many
object types, this driving-focused network learns only features relevant to its goal of safe
driving.

Figure 9.10: The neural network architecture of the control system in the self-driving car.

Nishita, AI&ML, BGSCET 27


Deep Learning and Reinforcement Learning

9.7.5 Inferring Neural Architectures with Reinforcement Learning

• Reinforcement learning can be used to automatically design the architecture of a neural


network for a specific task, such as building a convolutional neural network to classify a
dataset like CIFAR-10.
• The architecture of such a network depends on many hyperparameters, including:
o Number of filters
o Filter height and width
o Stride height and width
• These hyperparameters are interdependent, and the configuration of later layers depends
on the choices made in earlier layers.

Figure 9.11: The controller network for learning the convolutional architecture of the child
network. The controller network is trained with the REINFORCE algorithm.

• The reinforcement learning method uses a recurrent network as the controller to decide the
parameters of the convolutional network, which is also referred to as the child network.
• The overall architecture of the recurrent network is illustrated in Figure 9.11.
• The choice of a recurrent network is motivated by the sequential dependence between
different architectural parameters.
• The softmax classifier is used to predict each output as a token rather than a numerical
value.
• This token is then used as an input into the next layer, which is shown by the dashed lines
in Figure 9.11. The generation of the parameter as a token results in a discrete action space,

Nishita, AI&ML, BGSCET 28


Deep Learning and Reinforcement Learning
which is generally more common in reinforcement learning as compared to a continuous
action space.
• The performance of the child network on a validation set from CIFAR-10 is used to
produce the reward signal.
• To evaluate accuracy, the child network must be fully trained on the CIFAR-10 dataset,
making the process computationally expensive.
• The reward signal is used with the REINFORCE algorithm to train the controller
network, which acts as the policy network that generates a sequence of interdependent
architectural parameters.
• The number of layers in the child network is not fixed; it follows a schedule during
training.
• Early in training, the architectures are shallower with fewer layers, while the number of
layers increases gradually as training progresses.
• In the policy gradient approach, a recurrent network is trained using the reward signal
rather than a feed-forward network.

Nishita, AI&ML, BGSCET 29

You might also like