Policy Gradient Methods: Learning Actions Directly

Value-based RL asks "what's this state worth?" Policy gradients ask "what should I do here?"


The Two Paradigms

Value-Based Methods (DQN, etc.)

Policy-Based Methods

# Value-based: implicit policy
def value_based_policy(state, q_network):
    q_values = q_network(state)  # [batch, num_actions]
    return q_values.argmax(dim=-1)  # Pick best action

# Policy-based: explicit policy
def policy_based_action(state, policy_network):
    action_probs = policy_network(state)  # [batch, num_actions]
    dist = torch.distributions.Categorical(action_probs)
    return dist.sample()  # Sample from distribution

Why Policy Gradients?

Advantages over value-based:

  1. Continuous actions: Q-learning needs argmax, impossible in continuous space
  2. Stochastic policies: Sometimes optimal strategy is random (rock-paper-scissors)
  3. Smoother optimization: Small parameter changes = small behavior changes
  4. Better convergence guarantees: Gradient ascent on a well-defined objective

The tradeoff:


The Objective Function

We want to maximize expected return:

# The RL objective
J(theta) = E[sum of rewards when following policy pi_theta]

# In code:
def rl_objective(policy, env, num_episodes=100):
    """Estimate expected return by sampling episodes."""
    total_return = 0

    for _ in range(num_episodes):
        state = env.reset()
        episode_return = 0
        done = False

        while not done:
            action = policy.sample_action(state)
            state, reward, done, _ = env.step(action)
            episode_return += reward

        total_return += episode_return

    return total_return / num_episodes

Goal: Find parameters theta that maximize J(theta).


The Policy Gradient Theorem

The key insight: we can compute gradients of expected return.

The theorem:

nabla J(theta) = E[ sum_t nabla log pi(a_t|s_t) * G_t ]

Where G_t is the return from timestep t onward.

In plain English:


Deriving the Gradient (Tyla's Deep Dive)

The derivation uses the "log derivative trick":

# We want: nabla E[R] where R depends on actions sampled from pi
#
# Step 1: Write expectation as integral
# E[R] = sum over trajectories: P(trajectory) * R(trajectory)
#
# Step 2: Take gradient
# nabla E[R] = sum: nabla P(tau) * R(tau)
#
# Step 3: Log trick: nabla P = P * nabla log P
# nabla E[R] = sum: P(tau) * nabla log P(tau) * R(tau)
#            = E[ nabla log P(tau) * R(tau) ]
#
# Step 4: log P(trajectory) = sum of log pi(a_t|s_t)
# (transitions don't depend on theta)

def policy_gradient_estimate(policy, trajectory):
    """
    Estimate gradient from a single trajectory.

    trajectory: list of (state, action, reward) tuples
    """
    log_probs = []
    returns = []

    # Compute log probabilities
    for state, action, _ in trajectory:
        log_prob = policy.log_prob(state, action)
        log_probs.append(log_prob)

    # Compute returns (reward-to-go)
    G = 0
    for _, _, reward in reversed(trajectory):
        G = reward + 0.99 * G  # gamma = 0.99
        returns.insert(0, G)

    returns = torch.tensor(returns)
    log_probs = torch.stack(log_probs)

    # Policy gradient: sum of log_prob * return
    loss = -(log_probs * returns).sum()  # Negative for gradient ascent

    return loss

REINFORCE Algorithm

The simplest policy gradient algorithm:

class REINFORCE:
    def __init__(self, policy_network, lr=1e-3, gamma=0.99):
        self.policy = policy_network
        self.optimizer = torch.optim.Adam(policy_network.parameters(), lr=lr)
        self.gamma = gamma

    def select_action(self, state):
        """Sample action from policy."""
        state = torch.FloatTensor(state).unsqueeze(0)
        probs = self.policy(state)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        return action.item(), dist.log_prob(action)

    def compute_returns(self, rewards):
        """Compute discounted returns for each timestep."""
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        return torch.tensor(returns)

    def update(self, log_probs, rewards):
        """Update policy using collected episode."""
        returns = self.compute_returns(rewards)

        # Normalize returns (helps with stability)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)

        # Policy gradient loss
        loss = []
        for log_prob, G in zip(log_probs, returns):
            loss.append(-log_prob * G)

        loss = torch.stack(loss).sum()

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        return loss.item()

Training Loop

def train_reinforce(env, agent, num_episodes=1000):
    """Train REINFORCE agent."""
    episode_rewards = []

    for episode in range(num_episodes):
        state = env.reset()
        log_probs = []
        rewards = []
        done = False

        # Collect episode
        while not done:
            action, log_prob = agent.select_action(state)
            next_state, reward, done, _ = env.step(action)

            log_probs.append(log_prob)
            rewards.append(reward)
            state = next_state

        # Update policy
        loss = agent.update(log_probs, rewards)
        episode_rewards.append(sum(rewards))

        if episode % 100 == 0:
            avg_reward = sum(episode_rewards[-100:]) / min(100, len(episode_rewards))
            print(f"Episode {episode}, Avg Reward: {avg_reward:.2f}")

    return episode_rewards

The Variance Problem

REINFORCE has high variance. Why?

# The gradient estimate:
# nabla J = E[ log_prob * return ]
#
# Problem 1: Returns can be huge or tiny
# Episode 1: Return = 200
# Episode 2: Return = 50
# Episode 3: Return = 300
#
# The gradient bounces around wildly!

# Problem 2: Credit assignment
# In a 100-step episode, did step 1's action matter for the final reward?
# REINFORCE doesn't know - it assigns same credit to all actions

High variance = slow learning, unstable training.


Baseline Subtraction

Key insight: we can subtract any function of state without changing the gradient.

# Original: nabla J = E[ log pi(a|s) * G ]
# With baseline: nabla J = E[ log pi(a|s) * (G - b(s)) ]
#
# Why does this work?
# E[ log pi(a|s) * b(s) ] = 0
# (because sum over actions of pi * log pi * constant = constant * nabla 1 = 0)

class REINFORCEWithBaseline:
    def __init__(self, policy_network, value_network, lr=1e-3, gamma=0.99):
        self.policy = policy_network
        self.value = value_network  # The baseline
        self.policy_optimizer = torch.optim.Adam(policy_network.parameters(), lr=lr)
        self.value_optimizer = torch.optim.Adam(value_network.parameters(), lr=lr)
        self.gamma = gamma

    def update(self, states, log_probs, rewards):
        """Update with baseline subtraction."""
        returns = self.compute_returns(rewards)
        states = torch.FloatTensor(states)

        # Baseline: learned value function
        values = self.value(states).squeeze()

        # Advantage = Return - Baseline
        advantages = returns - values.detach()

        # Policy loss (gradient ascent on advantage-weighted log probs)
        policy_loss = -(torch.stack(log_probs) * advantages).sum()

        # Value loss (MSE between predicted and actual returns)
        value_loss = F.mse_loss(values, returns)

        # Update policy
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()

        # Update baseline
        self.value_optimizer.zero_grad()
        value_loss.backward()
        self.value_optimizer.step()

        return policy_loss.item(), value_loss.item()

Why Baseline Helps

# Without baseline:
# Action A led to return 100 -> increase probability a lot
# Action B led to return 95 -> increase probability almost as much
#
# But maybe average return is 90!
# A was great (+10 above average)
# B was okay (+5 above average)

# With baseline (average return):
# Action A: advantage = 100 - 90 = +10 -> increase a lot
# Action B: advantage = 95 - 90 = +5 -> increase a little
# Action C: advantage = 80 - 90 = -10 -> DECREASE probability

# The baseline centers the signal, reducing variance

Other Variance Reduction Techniques

# 1. Reward normalization
returns = (returns - returns.mean()) / (returns.std() + 1e-8)

# 2. Reward-to-go (only future rewards matter)
def reward_to_go(rewards, gamma):
    """Use only future rewards, not past."""
    rtg = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
        rtg.insert(0, G)
    return rtg

# 3. Discount factor (earlier actions matter more)
# Already built into return computation

# 4. Multiple parallel environments
def parallel_collect(envs, policy, steps):
    """Collect from many envs at once - averages out variance."""
    pass

# 5. Entropy bonus (encourages exploration, smooths gradients)
def entropy_bonus(probs, coefficient=0.01):
    entropy = -(probs * probs.log()).sum(dim=-1)
    return coefficient * entropy.mean()

Putting It Together: Complete REINFORCE

class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, action_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.softmax(self.fc3(x), dim=-1)
        return x

class ValueNetwork(nn.Module):
    def __init__(self, state_dim, hidden_dim=64):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

# Usage
policy = PolicyNetwork(state_dim=4, action_dim=2)
value = ValueNetwork(state_dim=4)
agent = REINFORCEWithBaseline(policy, value)

# Train on CartPole
import gym
env = gym.make('CartPole-v1')
rewards = train_reinforce(env, agent, num_episodes=1000)

Capstone Connection

Policy gradients are the foundation of RLHF.

In RLHF for language models:

# RLHF uses policy gradients!
#
# The LM is a policy: pi(next_token | previous_tokens)
# We optimize: E[ reward_model(full_response) ]
#
# Using policy gradients:
# nabla J = E[ log pi(response) * reward(response) ]
#
# This is exactly REINFORCE on text generation!

def rlhf_gradient_sketch(lm, reward_model, prompt, response):
    """
    Sketch of RLHF gradient computation.
    """
    # Get log probability of the response
    log_prob = lm.log_prob(response, given=prompt)

    # Get reward from reward model
    reward = reward_model(prompt + response)

    # Policy gradient
    loss = -log_prob * reward

    return loss

Sycophancy connection:

Understanding policy gradients = understanding how RLHF can go wrong.


🎓 Tyla's Exercise

  1. Derive the policy gradient theorem from first principles. Start with J(theta) = E[R] and show how to get nabla J = E[nabla log pi * R].

  2. Prove that baseline subtraction doesn't change the expected gradient. Show that E[nabla log pi(a|s) * b(s)] = 0 for any function b that doesn't depend on a.

  3. Analyze the variance of the REINFORCE estimator. How does variance scale with episode length? With return magnitude?

  4. Optimal baseline: Prove that the variance-minimizing baseline is b(s) = E[G | s], the value function.


💻 Aaliyah's Exercise

Implement REINFORCE from scratch and train it on CartPole:

def build_reinforce_agent(state_dim, action_dim):
    """
    Build a complete REINFORCE agent with:
    1. Policy network (2 hidden layers, softmax output)
    2. Value network for baseline
    3. Training loop with logging
    4. Return normalization
    5. Entropy bonus
    """
    # Your implementation here
    pass

def train_and_plot(env_name='CartPole-v1', num_episodes=1000):
    """
    1. Create environment and agent
    2. Train for num_episodes
    3. Plot learning curve
    4. Compare with and without baseline
    5. Print final average reward
    """
    pass

# Bonus: Implement and compare these variance reduction techniques:
# - No baseline
# - Mean return baseline (constant)
# - Learned value function baseline
# - Reward-to-go vs full episode return

📚 Maneesha's Reflection

  1. Credit assignment: REINFORCE treats all actions in an episode as equally responsible for the outcome. How is this similar to/different from how humans learn from experience? What does this suggest about learning design?

  2. Variance and learning: High variance gradients make learning slow and unstable. What's the analog in human learning? When does "noisy feedback" hurt learning, and how do good teachers reduce that noise?

  3. The exploration-exploitation tradeoff: Stochastic policies naturally explore. How do human learners balance trying new things vs. sticking with what works? What can RL teach us about designing learning experiences that encourage appropriate risk-taking?

  4. RLHF implications: If AI systems learn from human feedback via policy gradients, what happens when the feedback signal is biased? How might this inform how we design feedback systems for human learners?