BAI701 - DLRL - Module 5 Notes
BAI701 - DLRL - Module 5 Notes
Module-5
• Humans learn through experience-driven, reward-guided trial and error instead of fixed
training data.
• Learning is shaped by both individual interactions with the environment and evolution over
generations.
• Herbert Simon’s ant hypothesis suggests that human behavior appears complex because of
the environment's complexity.
• Biological intelligence arises from simple reward-seeking behavior through environmental
interaction.
• Artificial intelligence aims to simulate this by adopting trial-and-error learning.
• Reinforcement learning is a reward-driven trial-and-error approach seeking to maximize
long-term rewards by interacting with the environment.
• RL is seen as a path toward general artificial intelligence.
• In video games (e.g., Atari 2600), deep learners use raw pixel input to predict actions,
improving through reward feedback to reach or exceed human-level performance.
• AlphaGo learned to play Go using human and self-play data, developing novel strategies,
defeating top players, and extending to games like chess.
• Self-driving cars can use RL by processing sensor feedback to make driving decisions,
with RL helping reduce errors compared to human drivers.
• Robots can learn locomotion through RL without being shown how to walk, instead
learning efficient ways to move from trial-and-error reward signals.
• RL is suitable for problems where outcomes are easy to evaluate but actions are hard to
specify in advance.
• The multi-armed bandit problem illustrates balancing exploration of new options and
exploiting known good choices.
• Unlike bandit problems with identical decisions, real-world tasks (like games or driving)
require state-sensitive decisions learned through RL.
• RL enables learning complex behaviours from simple reward feedback, with the
complexity emerging from interaction with the environment.
• A gambler repeatedly chooses between slot machines, trying to find which one has the
highest average reward.
• He must balance exploration (trying all machines to learn about them) with exploitation
(choosing the best-known machine for maximum reward).
• Rewards for each machine follow a fixed probability distribution and do not depend on any
state, making this a simplified form of reinforcement learning.
• Various multi-armed bandit strategies help manage the exploration–exploitation trade-
off effectively.
• These strategies are foundational for general reinforcement learning and are often used as
components in more complex systems.
• In the naive approach, the gambler first runs an exploration phase where each machine is
tried a fixed number of times.
• After this exploration, the gambler permanently switches to the machine with the highest
observed average payoff.
• At first glance, this strategy appears reasonable because it separates learning from using
the best option.
• A major drawback is that it’s hard to choose the right number of exploratory trials to
confidently identify the best machine.
• If too few trials are used, the estimate of the best machine may be inaccurate.
• Estimating payoffs is especially hard when big rewards are rare, needing many trials for
reliable estimates.
• Using many exploratory trials wastes effort on poor machines that won’t be used later.
Nishita, AI&ML, BGSCET 2
Deep Learning and Reinforcement Learning
• If the wrong machine is chosen at the end of exploration, the gambler is stuck with it
forever.
• This fixed approach is unrealistic in real-world problems where conditions can be uncertain
and changing.
• The ε-greedy algorithm interleaves exploration and exploitation throughout the game.
• With probability ε, the gambler explores by choosing a random machine.
• With probability (1 − ε), the gambler exploits by choosing the machine with the highest
average reward so far.
• This ensures that exploration continues indefinitely, preventing the gambler from getting
stuck in a suboptimal choice.
• The approach starts exploitation early, letting the gambler benefit from good choices
sooner.
• The parameter ε controls the trade-off between exploration and exploitation.
• Typical values might be ε = 0.1, but the best choice depends on the specific problem.
• A small ε value means more exploitation but slower learning about other machines.
• A large ε value means more exploration but potentially less reward overall.
• Tuning ε can be hard since the ideal balance depends on the problem’s dynamics.
• A common strategy is annealing, where ε starts large (more exploration) and gradually
decreases over time (more exploitation).
• Annealing helps the gambler explore thoroughly early on but focus on the best machine
later.
• The learning process helps the agent choose actions based on the inherent values of the
actions in different states.
• This general principle applies to all forms of reinforcement learning in biological
organisms, such as a mouse learning a path through a maze to earn a reward. The rewards
earned by the mouse depend on an entire sequence of actions, rather than on only the latest
where st is the state before performing action at, and performing the action at causes a
reward of rt and transition to state st+1.
• There’s a notation difference in Sutton and Barto’s book, which uses rₜ₊₁ after aₜ in sₜ.
• Infinite MDPs (e.g., continuous robot operation) don’t have a natural episode length and
are called non-episodic problems.
Examples:
In practice, a system state is often approximated rather than capturing the full environment.
For example, in Atari games, a fixed number of recent frames may represent the state.
• Self-Driving Car:
o State: Sensor inputs (e.g., LIDAR, camera, GPS).
o Actions: Steering, acceleration, braking.
o Reward: Designed to encourage safe and efficient driving.
• Defining state representations and reward functions requires careful design effort.
• After these are defined, reinforcement learning can work as a complete end-to-end
learning system.
✓ When a reward is received (like winning a chess game), it's hard to know how much each
past action contributed.
✓ Rewards can also be probabilistic (like slot machines), making them hard to estimate
precisely.
✓ RL problems often have huge numbers of possible states (e.g., all chess positions).
✓ The system must generalize to make good decisions even in unseen states, which is where
deep learning helps.
✓ Choosing new, unexplored actions can help learn better strategies but may cost
performance in the short term.
✓ Only exploiting known actions can lead to suboptimal long-term results.
✓ Shorter wins are preferred at discount factor γ<1, and therefore the unnormalized value of
action a in state s is increased with in case of wins and − in case of losses after
r moves (including the current move). Draws are credited with 0. The discount also reflects
the fact that the significance of an action decays with time in real-world settings.
✓ Table Updates occur only after the entire game ends (offline update).
✓ Normalized values are computed by dividing the accumulated (unnormalized) value by the
number of times the (state, action) pair has been played.
• Exploration vs Exploitation:
✓ The policy is ε-greedy: with probability 1−ε, it picks the action with the highest normalized
value; with probability ε, it explores randomly.
• Learning Over Time: The table of state-action values improves as more games are played,
allowing the agent to adapt its strategy to the fixed opponent.
• Self-Play Option: The agent can train optimally by playing against itself. In self-play,
rewards for updates are assigned as for loss/draw/win from the perspective of
the player.
• Inference: During actual play, the agent chooses moves that have the highest learned
normalized value for the current state.
The overarching goal of the-greedy algorithm for tic-tac-toe was to learn the inherent long-
term value of each state-action pair, since the rewards are received long after valuable actions
are performed. The goal of the training process is to perform the value discovery task of
identifying which actions are truly beneficial in the long-term at a particular state. For example,
making a clever move in tic-tac-toe might set a trap, which eventually results in assured
victory. Examples of two such scenarios are shown in Figure 9.2(a) (although the trap on the
right is somewhat less obvious). Therefore, one needs to credit a strategically good move
favorably in the table of state-action pairs and not just the final winning move. The trial-and-
error technique based on the greedy method will indeed assign high values to clever traps.
Examples of typical values from such a table are shown in Figure 9.2(b). The less obvious trap
of Figure 9.2(a) has a slightly lower value because moves assuring wins after longer periods
are discounted by ,and greedy trial and-error might have a harder time finding the win
after setting the trap. The main problem with this approach is that the number of states in many
reinforcement learning settings is too large to tabulate explicitly.
(c) Positions from two different games between Alpha Zero (white) and Stockfish (black)
On the left, white sacrifices a pawn and concedes a passed pawn in order to trap black’s light-
square bishop behind black’s own pawns. This strategy eventually resulted in a victory for
white after many more moves than the horizon of a conventional chess-playing program like
Stockfish. In the second game on the right, white has sacrificed material to incrementally
cramp black to a position where all moves worsen the position. Incrementally improving
positional advantage is the hallmark of the very best human players rather than chess-playing
software like Stockfish, whose hand-crafted evaluations sometimes fail to accurately capture
subtle differences in positions. The neural network in reinforcement learning, which uses the
Monte Carlo simulations are used to refine and remember the long-term values of seen
states. One learns about the value of a trap in tic-tac-toe only because previous Monte Carlo
simulations have experienced victory many times from that exact board position. In most
challenging settings like chess, one must generalize knowledge learned from prior experiences
to a state that the learner has not seen before. All forms of learning (including reinforcement
learning) are most useful when they are used to generalize known experiences to unknown
situations. In such cases, the table-centric forms of reinforcement learning are woefully
inadequate. Deep learning models serve the role of function approximators. Instead of learning
and tabulating the values of all moves in all positions (using reward-driven trial and error), one
learns the value of each move as a function of the input state, based on a trained model using
the outcomes of prior positions. Without this approach, reinforcement learning cannot be used
beyond toy settings like tic-tac-toe.\
Although the aforementioned approach is too naive, a sophisticated system with Monte Carlo tree
search, known as Alpha Zero, has recently been trained to play chess. Two examples of positions
from different games in the match between Alpha Zero and a conventional chess program,
Stockfish-8.0, are provided in Figure 9.2(c). In the chess position on the left, the reinforcement
• Go is a two-person board game like chess. The complexity of a two-person board game
largely depends on the size of the board and the number of valid moves at each position.
• Go is far more complex than chess because it is played on a 19×19 board with a much
larger number of valid moves per position, making brute-force search infeasible.
• Players play with white or black stones, which are kept in bowls next to the Go board.
An example of a Go board is shown in Figure 9.7. The game starts with an empty board,
and it fills up as players put stones on the board. Black makes the first move and starts with
181 stones in her bowl, whereas white starts with 180 stones. The total number of junctions
is equal to the total number of stones in the bowls of the two players. A player places a
stone of her color in each move at a particular position (from the bowl), and does not move
it once it is placed. A stone of the opponent can be captured by encircling it. The objective
of the game is for the player to control a larger part of the board than her opponent by
encircling it with her stones.
• In chess, there are about 35 possible moves on average per position, while Go has
around 250, making its game tree exponentially larger and harder to search exhaustively.
• A typical game of Go is also deeper, averaging about 150 sequential moves compared to
around 80 for chess, further increasing the complexity of planning.
• Traditional chess engines use minimax search with pruning, evaluating positions using
heuristics about material and piece safety, a strategy that cannot scale to Go due to the vast
number of possible positions.
• The number of possible board states in Go at even modest depths (like 20 moves for
each player) exceeds the number of atoms in the observable universe, making brute-force
approaches impossible.
• Humans excel at Go by learning spatial patterns and using intuition rather than
exhaustively exploring move combinations, focusing on moves that are likely to increase
their advantage.
• AlphaGo mimics this human-like approach using reinforcement learning to learn
predictive patterns on the board, improving by playing both expert games and self-play.
• The board state in AlphaGo is encoded as multiple feature maps, including the current
positions, the number of moves since stones were placed, and other contextual information,
represented as 48 binary planes of 19×19 pixels.
• AlphaGo uses its win-loss experience with repeated game playing and AlphaGo’s
architecture includes a policy network that suggests good moves, a value network that
Policy Networks
• The policy network in AlphaGo takes a visual representation of the Go board as input and
produces the probability of placing a stone at each legal position, using a softmax activation
to generate these probabilities. Two separate policy networks were trained for this task:
one using supervised learning and the other using reinforcement learning.
• Both networks shared the same architecture, consisting of 13 convolutional layers with
ReLU activation. Most of these layers used 3×3 filters, while the first and last layers used
5×5 and 1×1 filters, respectively. Each layer had 192 filters with zero padding to maintain
spatial size, and no max pooling was used in order to preserve the spatial details of the
board.
• The supervised learning (SL-policy) network was trained on moves from expert human
players, treating these moves as always correct with a score of +1. The training optimized
the network using the log-likelihood of the chosen move’s probability, effectively imitating
expert strategies.
• The reinforcement learning (RL-policy) network was trained through self-play, where
the network played games against older versions of itself to create a pool of diverse
opponents. Each move was labeled with +1 for a win or −1 for a loss, allowing the network
to improve its strategy based on game outcomes.
• These policy networks became strong Go players on their own, and their performance was
further enhanced by combining them with Monte Carlo Tree Search to make even more
effective strategic decisions.
Value Networks
• The value network is a convolutional neural network that takes the board state as input
and predicts a score between −1 and +1, where +1 indicates a certain win for the next player
to move.
• The output represents the expected outcome for the next player, so the input also includes
information about whether the next move is by the “player” or the “opponent” rather than
simply black or white.
• Its architecture is similar to the policy network, with early convolutional layers the
same, but with an extra convolutional layer at layer 12.
• After the final convolutional layer, the network includes a fully connected layer with 256
units and ReLU activation to process the features further.
• The final score is computed using a single tanh unit, ensuring the prediction lies within the
range [−1, +1].
• For training, the preferred approach was to generate data using self-play with the SL-
policy and RL-policy networks, playing full games to obtain reliable state-outcome pairs.
• To avoid overfitting, positions were sampled from different games rather than using many
positions from a single game, ensuring training examples were diverse and less correlated.
• AlphaGo uses a modified version of Monte Carlo Tree Search (MCTS) with a
simplified exploration formula to balance exploration and exploitation during search.
• Unlike earlier versions that used only the RL-policy network for evaluating leaf nodes,
AlphaGo combines two evaluation methods for better accuracy.
• First, it performs fast Monte Carlo rollouts from leaf nodes to produce an evaluation e1.
• For these rollouts, instead of using the full policy network, AlphaGo uses a simplified
softmax classifier trained on a database of human games with additional hand-crafted
features to make rollouts faster.
• Second, the value network generates a separate evaluation e2 for each leaf node, providing
a learned estimate of position strength.
• The final evaluation at each leaf node is calculated as a convex combination of the two
estimates: with β=0.5 giving the best results.
• Interestingly, using only the value network for evaluation also produced similar
performance, showing it as a practical alternative.
• AlphaGo Zero improved on AlphaGo by removing the need for human expert moves and
eliminating the separate supervised-learning (SL) network entirely.
• Instead of using two separate networks, AlphaGo Zero employed a single neural network
that outputs both the policy (the probability distribution over moves) and the value
(predicted win probability for the position).
• The network is trained with a combined loss function, including cross-entropy loss for the
policy output and squared error loss for the value output, along with regularization.
• While the original AlphaGo used Monte Carlo Tree Search (MCTS) only for move
selection at play time, AlphaGo Zero integrated MCTS directly into training.
• During training, visit counts from MCTS act as improved targets for policy learning,
functioning as a policy improvement operator that refines the network’s suggested move
probabilities through lookahead search.
• The target policy π(s,a) is calculated from visit counts N(s,a) using a temperature-scaled
softmax, reflecting how often each move is chosen during tree search exploration.
• The prior probabilities p(s,a) from the neural network guide MCTS exploration, while
value estimates v(s) from the network evaluate newly expanded leaf nodes.
• MCTS repeatedly simulates games from a given state s, expanding the tree until new leaf
nodes or terminal states are reached, and updates Q-values and visit counts along the path
using the network’s evaluations.
• After many simulations from a position s, the resulting visit-count-based probabilities
π(s,a) are used to select moves in self-play games, producing new game data.
• Self-play games are played to completion, and the final game outcome (win or loss,
represented as z(s) in {−1, +1}) provides the ground-truth value for training.
• Each training example includes the board state s, the improved policy π(s,a) from MCTS,
and the ground-truth outcome z(s), forming the supervised target for network updates.
•
Comments on Performance
• Self-learning robots use rewards to learn tasks: These robots rely on a reward-driven
approach to master activities like walking, repairs, or picking up objects without being
explicitly programmed with rules for every situation.
• Locomotion learning highlights the challenge: A robot designed to walk must figure out
the right movements to stay balanced and travel from point A to point B, something humans
do naturally but robots must learn through trial and error.
• Reinforcement learning suits these problems: Instead of giving the robot detailed
instructions, it receives rewards when it makes progress, letting it explore and discover
effective movement strategies on its own.
• Robots start without prior knowledge: The robot isn’t programmed with what walking
"should" look like; it only knows that moving successfully will earn rewards, making
learning adaptive and flexible.
• Virtual robots were trained in locomotion tasks using the MuJoCo physics engine, which
enables fast and accurate simulation without needing physical hardware.
• Both a humanoid and a quadruped robot were used. An example of the biped model is
shown in Figure 9.8. The advantage of this type of simulation is that it is inexpensive to
work with a virtual simulation, and one avoids the natural safety and expense issues that
arise with the physical damages in an experimentation framework that is likely to be marred
by high levels of mistakes/accidents. On the flip side, a physical model provides more
realistic results. In general, a simulation can often be used for smaller scale testing before
building a physical model.
• A robot was trained using reinforcement learning to perform household tasks such as
hanging a coat, inserting blocks into shapes, fitting a toy hammer under a nail, and screwing
a cap onto a bottle.
• The robot's actions were controlled through 7-dimensional joint torque commands,
requiring sequences of commands to complete tasks effectively.
• Training was done on an actual physical robot, which used a camera to detect and locate
objects for manipulation.
• The camera image acted as the robot’s "eyes," with a convolutional neural network (CNN)
processing visual input similarly to how the human visual cortex works.
• Although different from Atari video game environments, this setting was similar in using
CNNs on image frames to learn policy actions.
• Additional inputs such as robot and object positions were also used, making the problem
more complex.
• The tasks demanded advanced learning in visual perception, precise coordination, and
understanding contact dynamics, all of which had to be learned automatically.
• A convolutional neural network (CNN) was used to map camera images to robot actions,
learning spatial features needed to achieve task-specific rewards (similar to Atari game
approaches).
• The CNN had 7 layers and about 92,000 parameters (86,000 in convolutional layers).
• The first three layers were convolutional, while the fourth was a spatial softmax layer
producing a probability distribution over spatial locations.
• The fifth layer transformed these distributions into 2D feature points via a soft argmax
mechanism, creating precise spatial representations suitable for control.
• These feature points were then concatenated with the robot’s configuration data (joint
angles, velocities, end-effector pose) after the convolution layers.
• The combined features were processed by two fully connected layers with 40 rectified units
each, followed by a linear output layer that predicted motor torques.
• Only camera images were input to the convolutional layers, while robot state data was
added at the first fully connected layer to better handle non-visual inputs.
• The architecture of the convolutional neural network is shown in Figure 9.9(b).
• Observations included RGB images, joint encoder readings, velocities, and end-effector
poses, with full robot states ranging from 14 to 32 dimensions (including joint angles,
object positions, and velocities).
Nishita, AI&ML, BGSCET 22
Deep Learning and Reinforcement Learning
• The network’s output represented the robot’s motor torques (actions) in the policy-based
learning framework.
• A guided policy search method was used to transform parts of the reinforcement learning
problem into supervised learning, simplifying training.
o
• In the example scenario, agent A ends up with 2 books and 2 hats, while agent B gets 1
book and 1 ball, reflecting their different goals and negotiation strategies.
• Each agent has its own inputs and outputs, so each scenario yields two training examples,
one for each agent’s perspective.
• Dialogs are represented as sequences of tokens that include speaker markers and a special
token signaling agreement.
• The supervised learning model uses four gated recurrent units (GRUs):
o One GRUg to encode input goals.
o One GRUq to generate dialog turns.
o A forward-output and a backward-output to produce final choices as
a bidirectional output.
• These GRUs are connected end-to-end and trained jointly.
• The supervised loss combines two parts: predicting dialog tokens and predicting final item
allocations.
• The same GRU architecture can be adapted for reinforcement learning by changing the loss
function.
• In reinforcement learning, the model acts as a policy network generating dialog roll-outs
using Monte Carlo sampling.
• Each sampled dialog (or action) is paired with its final reward, which is based on the
negotiated item values.
• Self-play is used so the agent negotiates with itself to improve its strategy, following the
REINFORCE algorithm.
• To avoid agents inventing unnatural language in self-play, one of the negotiating agents is
kept as a supervised model.
• For final prediction, rather than sampling directly, a two-stage approach is used:
o Multiple candidate utterances are sampled.
o The one with the highest expected reward (scaled by dialog probability) is selected.
• Observations showed that supervised models often conceded too quickly, while
reinforcement learning models negotiated more persistently.
• Reinforcement learning agents sometimes used human-like tactics, such as pretending to
value unimportant items to secure better deals elsewhere.
• Like robot locomotion, the self-driving car’s goal in reinforcement learning is to travel
safely from point A to point B without accidents or incidents.
• The car uses various sensors (video, audio, proximity, motion) to observe its environment,
with the aim of driving safely in diverse conditions.
• Driving is difficult to specify with exact rules for every scenario, but it is easy to recognize
good driving — a setting well-suited to reinforcement learning.
• The text focuses on a simplified setup where a single front-facing camera is used, showing
that even limited sensing can enable significant progress with reinforcement learning.
• This approach was inspired by Pomerleau’s 1989 ALVINN system, with improvements
mainly due to increased data availability, greater computational power, and advances in
convolutional neural networks.
• Training data was collected on varied roads and conditions, primarily in central New
Jersey, and also on highways in Illinois, Michigan, Pennsylvania, and New York.
• Two additional front-mounted cameras were used during training (but not for final driving
decisions) to provide shifted and rotated views for data augmentation, helping the model
learn to recover from off-center positions.
• The neural network was trained to minimize the error between its predicted steering
commands and those given by a human driver.
• This method resembles supervised learning more than pure reinforcement learning, and is
specifically known as imitation learning, which is often used to overcome the "cold start"
problem in reinforcement learning systems.
• Imitation learning scenarios often look similar to reinforcement learning scenarios.
• A reinforcement learning approach could reward the car for autonomous progress and
penalize it for stalling or needing human intervention.
• A major challenge in applying reinforcement learning to self-driving cars is ensuring safety
during training.
• The convolutional neural network architecture is shown in Figure 9.10.
• The neural network architecture has 9 layers: 1 normalization layer, 5 convolutional layers,
and 3 fully connected layers.
• The first convolutional layer uses a 5×5 filter with stride 2; the next two use 3×3 filters
without stride.
• These convolutional layers feed into three fully connected layers, with the final output
being a control value representing the inverse turning radius.
• The network has approximately 27 million connections and 250,000 parameters.
• The system was tested in both simulation and on real roads, always with a human driver
ready to intervene.
• Human intervention was needed only 2% of the time during road tests, meaning the car
was autonomous for 98% of the time.
• A video demonstration of this autonomous driving is available in reference [611].
• Visualization of the trained network’s activation maps showed it learned to focus on image
features critical for driving.
• For unpaved roads, the activation maps highlighted the road outlines effectively.
• In forest environments, the activation maps were noisy because the network was not trained
to recognize irrelevant details like trees and leaves.
• Unlike networks trained on general datasets like ImageNet, which learn to recognize many
object types, this driving-focused network learns only features relevant to its goal of safe
driving.
Figure 9.10: The neural network architecture of the control system in the self-driving car.
Figure 9.11: The controller network for learning the convolutional architecture of the child
network. The controller network is trained with the REINFORCE algorithm.
• The reinforcement learning method uses a recurrent network as the controller to decide the
parameters of the convolutional network, which is also referred to as the child network.
• The overall architecture of the recurrent network is illustrated in Figure 9.11.
• The choice of a recurrent network is motivated by the sequential dependence between
different architectural parameters.
• The softmax classifier is used to predict each output as a token rather than a numerical
value.
• This token is then used as an input into the next layer, which is shown by the dashed lines
in Figure 9.11. The generation of the parameter as a token results in a discrete action space,