Introduction to Reinforcement Learning
Resources:
Recap
- Generative AI Models and Applications:
- Text-to-Text and Text-to-Image models
- Architecture of Large Language Models (LLMs)
- Pre-training and fine-tuning of LLMs
- Reinforcement Learning with Human Feedback (RLHF) in fine-tuning LLMs
- Connection to this document’s focus:
- This document’s focus on Reinforcement Learning (RL) is broader than RLHF
- Will delve into underlying theory and diverse applications of RL
- How do you obtain your reward function in RLHF?
- Collect human feedback through comparisons of model outputs
- Human reviewers rank or compare outputs
- Preferences are used to train a reward model
- What do you use the reward function for?
- Fine-tune the LLM to generate preferred outputs
- Adjust model parameters to maximize the reward
- Encourage outputs that align with human preferences
Agenda Overview
- Introduction to Reinforcement Learning
- Motivation and key concepts
- Tools for experimenting with RL (e.g., Gymnasium)
- Key ingredients: States, Actions, Rewards
- Running Example: Blackjack
- Introduction and rules of the game
- Teaching an agent to play using RL
- Markov Decision Processes (MDPs)
- Modeling sequential decision-making problems
- Fundamental framework without learning algorithms
- Applying MDPs to Blackjack
- Deriving optimal strategies
- Challenges of explicit modeling
- Q-Learning
- Introduction to value-based RL methods
- Understanding the Q-Learning algorithm
- Applying Q-Learning to the Blackjack example
- Deep Q-Learning
- Extending Q-Learning to handle larger state spaces
- Conceptual differences from tabular Q-Learning
- Applications in complex environments (e.g., video games)
- Implementation details and code examples
Learning Objectives
- Understand Reinforcement Learning (RL):
- How RL differs from supervised and unsupervised learning
- The sequential and interactive nature of RL
- Comprehend Markov Decision Processes (MDPs):
- The modeling framework underlying RL methods
- Key components: states, actions, rewards, transition probabilities
- Learn Value-Based Methods:
- Grasp the concepts of Q-Learning
- Implement Q-Learning algorithms in code
- Apply Q-Learning to practical problems like Blackjack
- Explore Deep Q-Learning:
- Understand how Deep Q-Learning extends Q-Learning
- Address problems with large or continuous state spaces
- See real-world applications and code implementations
Reinforcement Learning Introduction
Thought Exercise: Dice Game
The considerations to make when thinking about a strategy to play this game are similar to what gets modeled in Markov Decision Processes.
- Game Rules:
- Roll a six-sided die
- If it lands on 1:
- If it lands on 2-6:
- Choose to stop and receive N dollars (N = die number)
- Or choose to roll again
- Objective: Maximize winnings by deciding when to stop
- Strategic Considerations:
- Risk vs. Reward:
- Higher numbers offer better immediate rewards
- Rolling again risks landing on 1 and losing all winnings
- Decision-Making Over Time:
- Multiple opportunities to make choices
- Uncertainty about future die rolls
- Questions to Ponder:
- When is it optimal to stop?
- How does the probability of future outcomes affect current decisions?
Applications of Reinforcement Learning
- Robotics:
- Teaching robots to perform tasks through interaction
- Example: Quadruped robot learning to walk
- Starts with no predefined walking strategy
- Learns by trial and error, receiving rewards for desirable behaviors
- Adapts to disturbances (e.g., being pushed)
- Games:
- Learning to play board games (e.g., Backgammon)
- RL agents can surpass human expertise
- Influence on human strategies and game understanding
- Video games and complex environments
- Other Fields:
- Finance: Portfolio management, trading strategies
- Public Policy: Potential applications with simulation models
- Challenges due to delayed feedback and complexity
Key Concepts of Reinforcement Learning
- Agent-Environment Interaction:
- Agent: Learns and makes decisions
- Environment: The system the agent interacts with
- Feedback Loop:
- Agent takes an action
- Environment provides state and reward
- Agent updates its strategy based on feedback
- Components:
- State: Observation of the environment at a given time
- Action: Decision made by the agent
- Reward: Feedback signal indicating success or failure
- Goal:
- Learn a policy that maximizes cumulative rewards over time
Differences from Other Machine Learning Paradigms
- Supervised Learning:
- Learning from labeled data
- Predicting outputs from inputs
- Unsupervised Learning:
- Finding patterns in unlabeled data
- Clustering, dimensionality reduction
- Reinforcement Learning:
- Learning from interactions with the environment
- No explicit labeled data
- Focus on sequential decision-making and long-term rewards
Types of Reinforcement Learning Methods
- Model-Free Methods:
- Do not require a model of the environment
- Value-Based Methods: (Covered in this document)
- Estimate the value of actions
- Example: Q-Learning
- Policy-Based Methods: (Covered in RL Part 2)
- Directly optimize the policy function
- Model-Based Methods:
- Build a model of the environment’s dynamics
- Plan actions using the model
- Multi-Agent RL:
- Multiple agents learning and interacting
- Coordination and competition dynamics
- Overview:
- A toolkit for developing and comparing RL algorithms
- Provides a variety of environments with a consistent interface
- Successor to OpenAI Gym, now maintained by Farama
- Key Features:
- Common Interface:
- Easy to switch between different environments
- Simplifies testing algorithms across multiple settings
- Environment Categories:
- Classic Control: Simple physical systems (e.g., CartPole)
- Robotics: Complex simulations with multiple degrees of freedom
- Games: Simple games like Blackjack, Atari 2600 games
- Custom Environments: Users can create their own
- Example Environments:
- CartPole: Balance a pole on a moving cart
- Lunar Lander: Control a lander to touch down safely
- Blackjack: Card game simulation
- Atari Games: Classic games for complex RL tasks
- Basic Usage:
import gymnasium as gym
env = gym.make("LunarLander-v2", render_mode="human")
observation, info = env.reset(seed=42)
for _ in range(1000):
action = env.action_space.sample() # this is where you would insert your policy
observation, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
observation, info = env.reset()
env.close()
Running Example: Blackjack Environment
Blackjack Rules Recap
- Objective: Get a hand total as close to 21 as possible without exceeding it
- Gameplay:
- Player and Dealer: Each starts with two cards
- Card Values:
- Number cards: Face value (2-10)
- Face cards (J, Q, K): 10
- Ace: 1 or 11 (player’s choice)
- Player Actions:
- Hit: Take another card
- Stand: End turn with current hand
- Dealer Actions:
- Reveals one card initially
- Plays after the player stands
- Must hit until reaching a certain total (usually 17)
- Winning Conditions:
- Player’s hand total is higher than dealer’s without exceeding 21
- Dealer busts (exceeds 21), player doesn’t
Blackjack Environment Code
Gymnasium (formerly OpenAI Gym) is a popular toolkit for testing reinforcement learning algorithms. It provides simulation environments for a variety of RL tasks and a simple common interface for interacting the environments. In this notebook we will work with the Blackjack environment, which plays the popular casino game Blackjack. We will introduce the basic mechanics of the Gymnasium Blackjack environment by manually playing a hand.
# First we will install the Gymnasium package
# !pip install gymnasium
import gymnasium as gym
import torch
import torch.nn.functional as F
import random
import numpy as np
from IPython import display
from collections import deque, OrderedDict
import matplotlib.pyplot as plt
"""
Here we interact directly with the Blackjack environment to get
a feel for how it works
"""
# Create the environment
env = gym.make("Blackjack-v1", render_mode="rgb_array")
# Deal the cards / sample an initial state
obs = env.reset()[0]
# Render a visualization in the notebook
plt.imshow(env.render())
plt.show()
print(obs)
# Loop as long as the hand has not finished
done = False
while not done:
# Choose an action: 1 is hit, 0 is stand
action = int(input("Hit (1) / Stand (0): "))
# Provide the action to the environment and update the game state
# The environment returns three values that we care about:
# - obs: The current state (or "observation", equivalent in this case)
# - reward: The reward earned in the current step
# - done: A boolean indicating whether the hand is done or in-progress
obs, reward, done, truncated, info = env.step(action)
# Render the updated state in the notebook
display.clear_output(wait=True)
plt.imshow(env.render())
plt.show()
print(obs, reward, done, truncated, info)
env.close()
A few notes to take away:
- We first created our environment with
gym.make
.
- We initialize the environment (deal the cards, in this case) with
env.reset()
.
- Initializing the environment returns a game state (which we assign to the variable
obs
). The state is a tuple containing the information (player_current_total, dealer_card, usable_ace)
.
- We iterate over turns until the game terminates. In each turn we choose an
action
(“hit” or “stay”).
- When we provide our selected action to the environment,
env.step
updates the state of the environment.
env.step
also provides a reward
in each step. For this environment, reward
is 1.0 if we win the hand, -1.0 if we lose the hand, and 0.0 otherwise.
Interacting with the Blackjack Environment
- State Representation:
- Player’s Current Hand Total
- Dealer’s Visible Card
- Usable Ace Indicator: Whether the player has an ace counted as 11
- Actions:
- Hit (1): Take another card
- Stand (0): Keep current hand
- Rewards:
- +1: Player wins
- 0: Draw
- -1: Player loses
Implementing a Simple Policy
The aim of Reinforcement Learning is to learn effective strategies for automatically selecting an action in each time period based on the current state of the environment. Rules for selecting an action based on the current state are known as policies. Here, we will manually create a simple heuristic policy and demonstrate how it controls the environment. We will evaluate this policy by playing 50,000 hands of Blackjack and counting the fraction of hands won under this policy.
def simple_policy(state):
"""
This simple policy always hits (draws another card) if the total value of
the player's hand is less than 17, and stays if the value of the player's
hand is greater than or equal to 17.
"""
# The first component of the state is the player's hand
# If that is less than 17, hit. Otherwise stay.
if state[0] < 17:
return 1
else:
return 0
Running Simulations
- Purpose: Evaluate the effectiveness of the policy
- Simulation Steps:
- Initialize Environment:
- Loop Over Episodes:
- Apply policy to decide actions
- Collect rewards and track outcomes
- Collect Statistics:
- Calculate win rate
- Analyze policy performance
- Sample Results:
- Simple policy might achieve a win rate around 41%
Key Elements in Reinforcement Learning
States
- Definition: Information that captures the current situation
- Characteristics:
- Must include all relevant details to make optimal decisions
- Should summarize past and present information necessary for future predictions
- In Blackjack:
- Hand Total: Sum of card values
- Dealer’s Visible Card: Provides context on dealer’s potential hand
- Usable Ace: Flexibility in counting an ace as 1 or 11
Actions
- Definition: Choices available to the agent at each decision point
- In Blackjack:
- Hit (1): Request another card
- Stand (0): End turn
Rewards
- Definition: Feedback signal indicating the result of an action
- Purpose: Guides the agent toward desirable outcomes
- In Blackjack:
- +1: Winning the hand
- -1: Losing the hand
- 0: Game in progress or a draw
Connecting Back to the Dice Game
Identifying States, Actions, and Rewards
- States:
- The current number showing on the die
- Potentially a special state indicating the game is over
- Actions:
- Roll Again: Choose to roll the die one more time
- Stop: Choose to stop and collect the current payout
- Rewards:
- Immediate Reward: Amount received when choosing to stop
- Zero Reward: If choosing to roll again or if the game continues
- Game Over Without Reward: If a 1 is rolled