Policy Gradient Methods: Learning Actions Directly
Value-based RL asks "what's this state worth?" Policy gradients ask "what should I do here?"
The Two Paradigms
Value-Based Methods (DQN, etc.)
- Learn Q(s, a): expected return from taking action a in state s
- Policy is implicit: pick argmax Q(s, a)
- Works well for discrete actions
Policy-Based Methods
- Learn pi(a|s) directly: probability of action a in state s
- Optimize the policy parameters to maximize expected return
- Works for continuous actions, stochastic policies
# Value-based: implicit policy
def value_based_policy(state, q_network):
q_values = q_network(state) # [batch, num_actions]
return q_values.argmax(dim=-1) # Pick best action
# Policy-based: explicit policy
def policy_based_action(state, policy_network):
action_probs = policy_network(state) # [batch, num_actions]
dist = torch.distributions.Categorical(action_probs)
return dist.sample() # Sample from distribution
Why Policy Gradients?
Advantages over value-based:
- Continuous actions: Q-learning needs argmax, impossible in continuous space
- Stochastic policies: Sometimes optimal strategy is random (rock-paper-scissors)
- Smoother optimization: Small parameter changes = small behavior changes
- Better convergence guarantees: Gradient ascent on a well-defined objective
The tradeoff:
- Higher variance (sampling actions introduces noise)
- Often less sample-efficient than value-based
The Objective Function
We want to maximize expected return:
# The RL objective
J(theta) = E[sum of rewards when following policy pi_theta]
# In code:
def rl_objective(policy, env, num_episodes=100):
"""Estimate expected return by sampling episodes."""
total_return = 0
for _ in range(num_episodes):
state = env.reset()
episode_return = 0
done = False
while not done:
action = policy.sample_action(state)
state, reward, done, _ = env.step(action)
episode_return += reward
total_return += episode_return
return total_return / num_episodes
Goal: Find parameters theta that maximize J(theta).
The Policy Gradient Theorem
The key insight: we can compute gradients of expected return.
The theorem:
nabla J(theta) = E[ sum_t nabla log pi(a_t|s_t) * G_t ]
Where G_t is the return from timestep t onward.
In plain English:
- Increase probability of actions that led to high returns
- Decrease probability of actions that led to low returns
- Weight by how good/bad the outcome was
Deriving the Gradient (Tyla's Deep Dive)
The derivation uses the "log derivative trick":
# We want: nabla E[R] where R depends on actions sampled from pi
#
# Step 1: Write expectation as integral
# E[R] = sum over trajectories: P(trajectory) * R(trajectory)
#
# Step 2: Take gradient
# nabla E[R] = sum: nabla P(tau) * R(tau)
#
# Step 3: Log trick: nabla P = P * nabla log P
# nabla E[R] = sum: P(tau) * nabla log P(tau) * R(tau)
# = E[ nabla log P(tau) * R(tau) ]
#
# Step 4: log P(trajectory) = sum of log pi(a_t|s_t)
# (transitions don't depend on theta)
def policy_gradient_estimate(policy, trajectory):
"""
Estimate gradient from a single trajectory.
trajectory: list of (state, action, reward) tuples
"""
log_probs = []
returns = []
# Compute log probabilities
for state, action, _ in trajectory:
log_prob = policy.log_prob(state, action)
log_probs.append(log_prob)
# Compute returns (reward-to-go)
G = 0
for _, _, reward in reversed(trajectory):
G = reward + 0.99 * G # gamma = 0.99
returns.insert(0, G)
returns = torch.tensor(returns)
log_probs = torch.stack(log_probs)
# Policy gradient: sum of log_prob * return
loss = -(log_probs * returns).sum() # Negative for gradient ascent
return loss
REINFORCE Algorithm
The simplest policy gradient algorithm:
class REINFORCE:
def __init__(self, policy_network, lr=1e-3, gamma=0.99):
self.policy = policy_network
self.optimizer = torch.optim.Adam(policy_network.parameters(), lr=lr)
self.gamma = gamma
def select_action(self, state):
"""Sample action from policy."""
state = torch.FloatTensor(state).unsqueeze(0)
probs = self.policy(state)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
return action.item(), dist.log_prob(action)
def compute_returns(self, rewards):
"""Compute discounted returns for each timestep."""
returns = []
G = 0
for r in reversed(rewards):
G = r + self.gamma * G
returns.insert(0, G)
return torch.tensor(returns)
def update(self, log_probs, rewards):
"""Update policy using collected episode."""
returns = self.compute_returns(rewards)
# Normalize returns (helps with stability)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
# Policy gradient loss
loss = []
for log_prob, G in zip(log_probs, returns):
loss.append(-log_prob * G)
loss = torch.stack(loss).sum()
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
return loss.item()
Training Loop
def train_reinforce(env, agent, num_episodes=1000):
"""Train REINFORCE agent."""
episode_rewards = []
for episode in range(num_episodes):
state = env.reset()
log_probs = []
rewards = []
done = False
# Collect episode
while not done:
action, log_prob = agent.select_action(state)
next_state, reward, done, _ = env.step(action)
log_probs.append(log_prob)
rewards.append(reward)
state = next_state
# Update policy
loss = agent.update(log_probs, rewards)
episode_rewards.append(sum(rewards))
if episode % 100 == 0:
avg_reward = sum(episode_rewards[-100:]) / min(100, len(episode_rewards))
print(f"Episode {episode}, Avg Reward: {avg_reward:.2f}")
return episode_rewards
The Variance Problem
REINFORCE has high variance. Why?
# The gradient estimate:
# nabla J = E[ log_prob * return ]
#
# Problem 1: Returns can be huge or tiny
# Episode 1: Return = 200
# Episode 2: Return = 50
# Episode 3: Return = 300
#
# The gradient bounces around wildly!
# Problem 2: Credit assignment
# In a 100-step episode, did step 1's action matter for the final reward?
# REINFORCE doesn't know - it assigns same credit to all actions
High variance = slow learning, unstable training.
Baseline Subtraction
Key insight: we can subtract any function of state without changing the gradient.
# Original: nabla J = E[ log pi(a|s) * G ]
# With baseline: nabla J = E[ log pi(a|s) * (G - b(s)) ]
#
# Why does this work?
# E[ log pi(a|s) * b(s) ] = 0
# (because sum over actions of pi * log pi * constant = constant * nabla 1 = 0)
class REINFORCEWithBaseline:
def __init__(self, policy_network, value_network, lr=1e-3, gamma=0.99):
self.policy = policy_network
self.value = value_network # The baseline
self.policy_optimizer = torch.optim.Adam(policy_network.parameters(), lr=lr)
self.value_optimizer = torch.optim.Adam(value_network.parameters(), lr=lr)
self.gamma = gamma
def update(self, states, log_probs, rewards):
"""Update with baseline subtraction."""
returns = self.compute_returns(rewards)
states = torch.FloatTensor(states)
# Baseline: learned value function
values = self.value(states).squeeze()
# Advantage = Return - Baseline
advantages = returns - values.detach()
# Policy loss (gradient ascent on advantage-weighted log probs)
policy_loss = -(torch.stack(log_probs) * advantages).sum()
# Value loss (MSE between predicted and actual returns)
value_loss = F.mse_loss(values, returns)
# Update policy
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step()
# Update baseline
self.value_optimizer.zero_grad()
value_loss.backward()
self.value_optimizer.step()
return policy_loss.item(), value_loss.item()
Why Baseline Helps
# Without baseline:
# Action A led to return 100 -> increase probability a lot
# Action B led to return 95 -> increase probability almost as much
#
# But maybe average return is 90!
# A was great (+10 above average)
# B was okay (+5 above average)
# With baseline (average return):
# Action A: advantage = 100 - 90 = +10 -> increase a lot
# Action B: advantage = 95 - 90 = +5 -> increase a little
# Action C: advantage = 80 - 90 = -10 -> DECREASE probability
# The baseline centers the signal, reducing variance
Other Variance Reduction Techniques
# 1. Reward normalization
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
# 2. Reward-to-go (only future rewards matter)
def reward_to_go(rewards, gamma):
"""Use only future rewards, not past."""
rtg = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
rtg.insert(0, G)
return rtg
# 3. Discount factor (earlier actions matter more)
# Already built into return computation
# 4. Multiple parallel environments
def parallel_collect(envs, policy, steps):
"""Collect from many envs at once - averages out variance."""
pass
# 5. Entropy bonus (encourages exploration, smooths gradients)
def entropy_bonus(probs, coefficient=0.01):
entropy = -(probs * probs.log()).sum(dim=-1)
return coefficient * entropy.mean()
Putting It Together: Complete REINFORCE
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=64):
super().__init__()
self.fc1 = nn.Linear(state_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, action_dim)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.softmax(self.fc3(x), dim=-1)
return x
class ValueNetwork(nn.Module):
def __init__(self, state_dim, hidden_dim=64):
super().__init__()
self.fc1 = nn.Linear(state_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)
# Usage
policy = PolicyNetwork(state_dim=4, action_dim=2)
value = ValueNetwork(state_dim=4)
agent = REINFORCEWithBaseline(policy, value)
# Train on CartPole
import gym
env = gym.make('CartPole-v1')
rewards = train_reinforce(env, agent, num_episodes=1000)
Capstone Connection
Policy gradients are the foundation of RLHF.
In RLHF for language models:
- State = conversation history + current tokens
- Action = next token to generate
- Policy = the language model itself!
- Reward = human preference signal (via reward model)
# RLHF uses policy gradients!
#
# The LM is a policy: pi(next_token | previous_tokens)
# We optimize: E[ reward_model(full_response) ]
#
# Using policy gradients:
# nabla J = E[ log pi(response) * reward(response) ]
#
# This is exactly REINFORCE on text generation!
def rlhf_gradient_sketch(lm, reward_model, prompt, response):
"""
Sketch of RLHF gradient computation.
"""
# Get log probability of the response
log_prob = lm.log_prob(response, given=prompt)
# Get reward from reward model
reward = reward_model(prompt + response)
# Policy gradient
loss = -log_prob * reward
return loss
Sycophancy connection:
- If reward model was trained on human preferences that reward agreement...
- Policy gradient will increase probability of agreeable responses
- Even when those responses are wrong!
Understanding policy gradients = understanding how RLHF can go wrong.
🎓 Tyla's Exercise
Derive the policy gradient theorem from first principles. Start with J(theta) = E[R] and show how to get nabla J = E[nabla log pi * R].
Prove that baseline subtraction doesn't change the expected gradient. Show that E[nabla log pi(a|s) * b(s)] = 0 for any function b that doesn't depend on a.
Analyze the variance of the REINFORCE estimator. How does variance scale with episode length? With return magnitude?
Optimal baseline: Prove that the variance-minimizing baseline is b(s) = E[G | s], the value function.
💻 Aaliyah's Exercise
Implement REINFORCE from scratch and train it on CartPole:
def build_reinforce_agent(state_dim, action_dim):
"""
Build a complete REINFORCE agent with:
1. Policy network (2 hidden layers, softmax output)
2. Value network for baseline
3. Training loop with logging
4. Return normalization
5. Entropy bonus
"""
# Your implementation here
pass
def train_and_plot(env_name='CartPole-v1', num_episodes=1000):
"""
1. Create environment and agent
2. Train for num_episodes
3. Plot learning curve
4. Compare with and without baseline
5. Print final average reward
"""
pass
# Bonus: Implement and compare these variance reduction techniques:
# - No baseline
# - Mean return baseline (constant)
# - Learned value function baseline
# - Reward-to-go vs full episode return
📚 Maneesha's Reflection
Credit assignment: REINFORCE treats all actions in an episode as equally responsible for the outcome. How is this similar to/different from how humans learn from experience? What does this suggest about learning design?
Variance and learning: High variance gradients make learning slow and unstable. What's the analog in human learning? When does "noisy feedback" hurt learning, and how do good teachers reduce that noise?
The exploration-exploitation tradeoff: Stochastic policies naturally explore. How do human learners balance trying new things vs. sticking with what works? What can RL teach us about designing learning experiences that encourage appropriate risk-taking?
RLHF implications: If AI systems learn from human feedback via policy gradients, what happens when the feedback signal is biased? How might this inform how we design feedback systems for human learners?