Deep Q-Networks: Foundations

From Q-tables to neural networks: scaling reinforcement learning to complex environments.

Why Deep Q-Networks?

Remember Q-learning? We learned optimal action-values $Q^*(s, a)$ by storing them in a table. But what happens when your state space is continuous, or astronomically large?

The CartPole Problem:

4 continuous observations (cart position, velocity, pole angle, angular velocity)
Infinite possible states
A Q-table would need infinite entries

The Solution: Replace the table with a neural network that learns the Q-function:

$$s \to (Q^(s, a_1), Q^(s, a_2), \ldots, Q^*(s, a_n))$$

The network takes a state as input and outputs Q-values for all possible actions.

The Bellman Target Problem

In tabular Q-learning, we updated:

$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$$

With a neural network, we want to minimize the temporal difference (TD) error:

$$L(\theta) = \mathbb{E} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta) - Q(s, a; \theta) \right)^2 \right]$$

The Problem: Both the prediction $Q(s, a; \theta)$ and target $r + \gamma \max_{a'} Q(s', a'; \theta)$ depend on the same parameters $\theta$. This creates a moving target that causes training instability.

Experience Replay

Problem: Sequential experiences are highly correlated. If the agent keeps seeing similar states in a row, it overfits to recent experience and forgets older lessons.

Solution: Store experiences in a replay buffer and sample randomly:

@dataclass
class ReplayBufferSamples:
    obs: Float[Tensor, "batch_size *obs_shape"]
    actions: Float[Tensor, "batch_size *action_shape"]
    rewards: Float[Tensor, "batch_size"]
    terminated: Bool[Tensor, "batch_size"]
    next_obs: Float[Tensor, "batch_size *obs_shape"]


class ReplayBuffer:
    def __init__(self, buffer_size: int, obs_shape: tuple, seed: int):
        self.buffer_size = buffer_size
        self.obs = np.empty((0, *obs_shape), dtype=np.float32)
        self.actions = np.empty(0, dtype=np.int32)
        self.rewards = np.empty(0, dtype=np.float32)
        self.terminated = np.empty(0, dtype=bool)
        self.next_obs = np.empty((0, *obs_shape), dtype=np.float32)
        self.rng = np.random.default_rng(seed)

    def add(self, obs, action, reward, terminated, next_obs):
        """Add experience, keeping buffer at max size."""
        self.obs = np.concatenate([self.obs, [obs]])[-self.buffer_size:]
        self.actions = np.concatenate([self.actions, [action]])[-self.buffer_size:]
        self.rewards = np.concatenate([self.rewards, [reward]])[-self.buffer_size:]
        self.terminated = np.concatenate([self.terminated, [terminated]])[-self.buffer_size:]
        self.next_obs = np.concatenate([self.next_obs, [next_obs]])[-self.buffer_size:]

    def sample(self, batch_size: int) -> ReplayBufferSamples:
        """Sample random batch for training."""
        indices = self.rng.integers(0, len(self.obs), batch_size)
        return ReplayBufferSamples(
            obs=torch.tensor(self.obs[indices]),
            actions=torch.tensor(self.actions[indices]),
            rewards=torch.tensor(self.rewards[indices]),
            terminated=torch.tensor(self.terminated[indices]),
            next_obs=torch.tensor(self.next_obs[indices]),
        )

Benefits:

Breaks correlation between consecutive samples
Reuses experiences multiple times (data efficiency)
Smooths learning over many past behaviors

Target Networks

Problem: The moving target issue. When we update $\theta$, both our predictions AND our targets change, causing oscillation.

Solution: Maintain a separate target network with frozen parameters $\theta^-$:

$$L(\theta) = \mathbb{E} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right]$$

Periodically copy $\theta \to \theta^-$ (every $C$ steps):

class DQNTrainer:
    def __init__(self, obs_shape, num_actions):
        self.q_network = QNetwork(obs_shape, num_actions)
        self.target_network = QNetwork(obs_shape, num_actions)
        # Initialize target with same weights
        self.target_network.load_state_dict(self.q_network.state_dict())

    def update_target(self):
        """Copy Q-network weights to target network."""
        self.target_network.load_state_dict(self.q_network.state_dict())

Why This Works:

Target values stay fixed between updates
Reduces oscillation and divergence
Gives Q-network a stable target to chase

The Q-Network Architecture

For CartPole (4 observations, 2 actions):

class QNetwork(nn.Module):
    def __init__(
        self,
        obs_shape: tuple[int],
        num_actions: int,
        hidden_sizes: list[int] = [120, 84]
    ):
        super().__init__()
        assert len(obs_shape) == 1, "Expecting vector observations"

        # Build layers: Linear-ReLU-Linear-ReLU-Linear
        in_features_list = [obs_shape[0]] + hidden_sizes
        out_features_list = hidden_sizes + [num_actions]

        layers = []
        for i, (in_f, out_f) in enumerate(zip(in_features_list, out_features_list)):
            layers.append(nn.Linear(in_f, out_f))
            if i < len(in_features_list) - 1:  # No ReLU after last layer
                layers.append(nn.ReLU())

        self.layers = nn.Sequential(*layers)

    def forward(self, x: Tensor) -> Tensor:
        return self.layers(x)


# For CartPole: 4 -> 120 -> ReLU -> 84 -> ReLU -> 2
net = QNetwork(obs_shape=(4,), num_actions=2)
print(f"Parameters: {sum(p.numel() for p in net.parameters())}")  # 10,934

Key Design Decisions:

No final activation (Q-values can be negative)
Hidden layers capture state representations
Output dimension = number of actions

Epsilon-Greedy Exploration

Balance exploitation (use learned Q-values) with exploration (try random actions):

def epsilon_greedy_policy(
    q_network: QNetwork,
    obs: np.ndarray,
    epsilon: float,
    num_actions: int,
    rng: np.random.Generator
) -> int:
    """Select action using epsilon-greedy policy."""
    if rng.random() < epsilon:
        return rng.integers(0, num_actions)
    else:
        obs_tensor = torch.tensor(obs, dtype=torch.float32)
        q_values = q_network(obs_tensor)
        return q_values.argmax().item()


def linear_schedule(
    current_step: int,
    start_e: float,
    end_e: float,
    exploration_fraction: float,
    total_timesteps: int
) -> float:
    """Linearly decay epsilon from start_e to end_e."""
    return start_e + (end_e - start_e) * min(
        current_step / (exploration_fraction * total_timesteps), 1.0
    )


# Typical schedule: 1.0 -> 0.1 over first 20% of training
# Step 0: epsilon = 1.0 (100% random)
# Step 100,000: epsilon = 0.1 (10% random)

The Complete DQN Algorithm

Putting it all together:

Initialize Q-network Q(s, a; theta)
Initialize target network Q(s, a; theta-) with theta- = theta
Initialize replay buffer D with capacity N
Initialize epsilon = 1.0

For episode = 1 to M:
    Initialize state s

    For t = 1 to T:
        # Select action
        With probability epsilon: action a = random
        Otherwise: action a = argmax_a Q(s, a; theta)

        # Execute action
        Take action a, observe reward r and next state s'
        Store (s, a, r, done, s') in D

        # Sample and train
        Sample random minibatch of (s_j, a_j, r_j, d_j, s'_j) from D

        # Compute targets
        y_j = r_j                          if d_j = True (terminal)
        y_j = r_j + gamma * max_a' Q(s'_j, a'; theta-)   otherwise

        # Gradient descent on (y_j - Q(s_j, a_j; theta))^2

        # Update target network every C steps
        Every C steps: theta- <- theta

        # Decay epsilon
        epsilon = linear_schedule(step)

        s = s'

The Training Step in Code

def training_step(self, step: int):
    """One gradient update from replay buffer."""
    # Sample batch
    data = self.buffer.sample(self.batch_size)

    # Compute target Q-values (no gradient!)
    with torch.no_grad():
        target_max = self.target_network(data.next_obs).max(dim=-1).values

    # Current Q-value predictions
    q_values = self.q_network(data.obs)
    predicted_q = q_values[range(len(data.actions)), data.actions]

    # TD target: r + gamma * max Q(s', a') * (1 - done)
    td_target = data.rewards + self.gamma * target_max * (1 - data.terminated.float())

    # TD error and loss
    td_error = td_target - predicted_q
    loss = td_error.pow(2).mean()

    # Gradient descent
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()

    # Periodically update target network
    if step % self.target_update_freq == 0:
        self.target_network.load_state_dict(self.q_network.state_dict())

    return loss.item()

Terminated vs Truncated

A subtle but important distinction:

# Environment returns both terminated and truncated
next_obs, reward, terminated, truncated, info = env.step(action)

# terminated: Episode ended due to environment rules
#   - CartPole fell over (bad)
#   - Game over
#   - Q(terminal state) = 0 (no future rewards)

# truncated: Episode ended due to time limit
#   - Reached 500 steps
#   - Artificial cutoff for training
#   - Agent could have continued!

# In TD target, use ONLY terminated:
td_target = reward + gamma * max_q * (1 - terminated)

# Why? If we used truncated, agent learns:
# "Step 499 has no future value" -> No incentive to keep balancing!

Probe Environments for Debugging

Before tackling CartPole, test on simple environments:

class Probe1(gym.Env):
    """Constant observation [0.0], reward +1. Test: Can it learn value = 1?"""
    def step(self, action):
        return np.array([0.0]), 1.0, True, True, {}


class Probe2(gym.Env):
    """Observation [-1] or [+1], reward = observation. Test: Can it differentiate?"""
    def reset(self):
        self.obs = 1.0 if random() < 0.5 else -1.0
        return np.array([self.obs]), {}

    def step(self, action):
        return np.array([self.obs]), self.obs, True, True, {}


class Probe3(gym.Env):
    """Two timesteps, reward at end. Test: Does it understand discounting?"""
    # obs [0] -> obs [1] -> reward +1
    # Q([0]) should = gamma, Q([1]) should = 1


class Probe4(gym.Env):
    """Two actions, action 0 = -1 reward, action 1 = +1 reward. Test: Policy learning."""


class Probe5(gym.Env):
    """Match action to observation. Test: Observation-conditioned policy."""

If probe fails, you know exactly what's broken:

Probe 1 fails: Loss function or optimizer broken
Probe 2 fails: Network can't differentiate observations
Probe 3 fails: Discounting or multi-step handling wrong
Probe 4 fails: Not learning from action-reward relationship
Probe 5 fails: Observation-action mapping broken

Capstone Connection

DQN provides the foundation for alignment research:

Reward Hacking: DQN will optimize exactly what you reward. In Montezuma's Revenge, researchers added small reward for picking up keys. The agent learned to get close to keys, trigger the reward, back away, and repeat - never actually using them.
Sycophancy Detection: We can use DQN-style value estimation to detect when models are optimizing for user approval rather than accuracy:

# Sycophancy probe: Does the model's value function
# overweight responses that agree with user?
sycophantic_response = "Yes, you're absolutely right!"
honest_response = "Actually, 2+2=4, not 5."

# If Q(agree) >> Q(correct), we have a problem

Corrigibility: The target network provides a template for "frozen" value systems that don't update during deployment - a key property for safe AI systems.

🎓 Tyla's Exercise

Prove convergence: Under what conditions does DQN converge? The original paper claims it doesn't diverge, but doesn't prove convergence. What assumptions would you need?
Maximization bias: The $\max$ operator in $\max_{a'} Q(s', a')$ introduces upward bias. If $Q(s', a_1) = Q(s', a_2) = 0$ but our estimates have noise, $\max(\hat{Q}(s', a_1), \hat{Q}(s', a_2)) > 0$. How does this compound over episodes?
Buffer distribution shift: As the policy improves, old experiences in the buffer become off-policy. Derive how the effective discount factor changes when training on experiences from an older policy.

💻 Aaliyah's Exercise

Implement the complete DQN training loop:

def train_dqn(
    env_id: str = "CartPole-v1",
    total_timesteps: int = 500_000,
    buffer_size: int = 10_000,
    batch_size: int = 128,
    gamma: float = 0.99,
    learning_rate: float = 2.5e-4,
    target_update_freq: int = 1000,
    start_epsilon: float = 1.0,
    end_epsilon: float = 0.1,
    exploration_fraction: float = 0.2,
):
    """
    Your tasks:
    1. Initialize environment, networks, buffer, optimizer
    2. Implement the main training loop
    3. Fill buffer before training starts (buffer_size steps)
    4. Implement epsilon decay schedule
    5. Log episode rewards, TD loss, Q-values, epsilon
    6. Test on Probe1-5 before CartPole

    Success criteria:
    - Pass all 5 probe environments
    - Solve CartPole (avg reward > 475 over 100 episodes)
    - Training completes in < 10 minutes
    """
    pass


# Debugging checklist:
# [ ] Buffer stores (s, a, r, terminated, s') correctly
# [ ] Terminated vs truncated handled properly
# [ ] Target network updated at right frequency
# [ ] Epsilon decays as expected
# [ ] Q-values trend toward 1/(1-gamma) for good states

📚 Maneesha's Reflection

The Table-to-Network Transition: When we moved from Q-tables to neural networks, we gained generalization but lost interpretability. A Q-table tells you exactly what the agent values. A neural network is a black box. What are the instructional design implications of teaching with interpretable vs. black-box systems?
Experience Replay as Spaced Repetition: The replay buffer randomly resurfaces old experiences, similar to spaced repetition in human learning. What can DQN's success with replay tell us about optimal learning schedules for humans?
The Target Network Metaphor: The target network provides a stable goal while the main network learns. In education, we might call this "scaffolding" - providing fixed structure while learners develop. When should educational scaffolding be updated vs. kept fixed?
Probe Environments as Diagnostic Assessment: The probe environments isolate specific capabilities (value learning, differentiation, discounting, policy learning). How could you design similar diagnostic assessments for human learners mastering a complex skill?

Deep Q-Networks: Foundations #

Why Deep Q-Networks? #

The Bellman Target Problem #

Experience Replay #

Target Networks #

The Q-Network Architecture #

Epsilon-Greedy Exploration #

The Complete DQN Algorithm #

The Training Step in Code #

Terminated vs Truncated #

Probe Environments for Debugging #

Capstone Connection #

🎓 Tyla's Exercise #

💻 Aaliyah's Exercise #

📚 Maneesha's Reflection #