Tabular RL Methods

From theory to computation: algorithms that find optimal policies when we can enumerate all states.


What Are Tabular Methods?

Tabular methods store values explicitly for every state-action pair. They work when:

The "table" is literally an array:

# Q-table: value of each (state, action) pair
Q = np.zeros((num_states, num_actions))

# V-table: value of each state
V = np.zeros(num_states)

# Policy table (deterministic): action for each state
pi = np.zeros(num_states, dtype=int)

Two Paradigms: Planning vs Learning

Planning (Dynamic Programming):

Learning (Model-Free):

Today we cover both.


Dynamic Programming: The Foundation

When we know the environment dynamics, we can solve MDPs exactly.

Key insight: The Bellman equation is a system of linear equations. We can solve it iteratively or directly.

# The Bellman equation for policy pi:
# V_pi(s) = sum_s' T(s'|s, pi(s)) * [R(s, pi(s), s') + gamma * V_pi(s')]

# In matrix form:
# V = R_pi + gamma * P_pi @ V
# V = (I - gamma * P_pi)^(-1) @ R_pi

Policy Evaluation: Computing V for a Fixed Policy

Given a policy $\pi$, we want to find $V_\pi$.

Iterative approach (simpler, always works):

$$ V_{k+1}(s) = \sum_{s'} T(s'|s, \pi(s)) \left[R(s, \pi(s), s') + \gamma V_k(s')\right] $$

def policy_evaluation_iterative(env, policy, gamma=0.99, theta=1e-8):
    """
    Compute V_pi by iterating the Bellman equation.

    Args:
        env: Environment with env.T and env.R
        policy: Array of shape (num_states,) with action for each state
        gamma: Discount factor
        theta: Convergence threshold
    """
    num_states = env.num_states
    V = np.zeros(num_states)

    while True:
        V_new = np.zeros(num_states)

        for s in range(num_states):
            a = policy[s]
            for s_next in range(num_states):
                prob = env.T[s, a, s_next]
                reward = env.R[s, a, s_next]
                V_new[s] += prob * (reward + gamma * V[s_next])

        # Check convergence
        delta = np.max(np.abs(V_new - V))
        V = V_new

        if delta < theta:
            break

    return V

Policy Evaluation: The Vectorized Version

For efficiency, avoid explicit loops:

def policy_evaluation_vectorized(env, policy, gamma=0.99, theta=1e-8):
    """
    Vectorized policy evaluation using matrix operations.
    """
    num_states = env.num_states
    V = np.zeros(num_states)

    # Get transition and reward matrices for this policy
    # T_pi[s, s'] = T[s, pi(s), s']
    state_indices = np.arange(num_states)
    T_pi = env.T[state_indices, policy, :]  # Shape: (num_states, num_states)
    R_pi = env.R[state_indices, policy, :]  # Shape: (num_states, num_states)

    # Expected reward for each state: r_pi[s] = sum_s' T_pi[s,s'] * R_pi[s,s']
    r_pi = np.sum(T_pi * R_pi, axis=1)

    while True:
        # Bellman update: V = r_pi + gamma * T_pi @ V
        V_new = r_pi + gamma * (T_pi @ V)

        if np.max(np.abs(V_new - V)) < theta:
            return V_new
        V = V_new

Policy Evaluation: Exact Solution

We can also solve the linear system directly:

$$ V_\pi = (I - \gamma P_\pi)^{-1} r_\pi $$

def policy_evaluation_exact(env, policy, gamma=0.99):
    """
    Solve the Bellman equation directly via matrix inversion.
    """
    num_states = env.num_states
    state_indices = np.arange(num_states)

    T_pi = env.T[state_indices, policy, :]
    R_pi = env.R[state_indices, policy, :]
    r_pi = np.sum(T_pi * R_pi, axis=1)

    # Solve: (I - gamma * T_pi) @ V = r_pi
    A = np.eye(num_states) - gamma * T_pi
    V = np.linalg.solve(A, r_pi)

    return V

Policy Improvement: Finding a Better Policy

Given $V_\pi$, we can find a better policy by being greedy with respect to Q:

$$ \pi'(s) = \arg\max_a Q_\pi(s, a) = \arg\max_a \sum_{s'} T(s'|s,a)\left[R(s,a,s') + \gamma V_\pi(s')\right] $$

def policy_improvement(env, V, gamma=0.99):
    """
    Compute the greedy policy with respect to V.

    Returns:
        policy: Array of shape (num_states,) with best action for each state
    """
    num_states = env.num_states
    num_actions = env.num_actions

    # Compute Q(s, a) for all state-action pairs
    # Q[s, a] = sum_s' T[s, a, s'] * (R[s, a, s'] + gamma * V[s'])
    Q = np.zeros((num_states, num_actions))

    for s in range(num_states):
        for a in range(num_actions):
            for s_next in range(num_states):
                prob = env.T[s, a, s_next]
                reward = env.R[s, a, s_next]
                Q[s, a] += prob * (reward + gamma * V[s_next])

    # Greedy policy: take action with highest Q-value
    policy = np.argmax(Q, axis=1)

    return policy

Policy Improvement: Vectorized

def policy_improvement_vectorized(env, V, gamma=0.99):
    """
    Vectorized policy improvement.
    """
    # Q[s, a] = sum_s' T[s,a,s'] * (R[s,a,s'] + gamma * V[s'])
    # Using einsum for clarity
    Q = np.einsum('sas,sas->sa', env.T, env.R + gamma * V)

    return np.argmax(Q, axis=1)

Policy Iteration: The Full Algorithm

Alternate between evaluation and improvement until convergence:

def policy_iteration(env, gamma=0.99):
    """
    Find optimal policy via policy iteration.

    Returns:
        policy: Optimal policy
        V: Value function for optimal policy
    """
    num_states = env.num_states

    # Initialize with arbitrary policy
    policy = np.zeros(num_states, dtype=int)

    while True:
        # Policy Evaluation
        V = policy_evaluation_exact(env, policy, gamma)

        # Policy Improvement
        new_policy = policy_improvement(env, V, gamma)

        # Check for convergence
        if np.array_equal(new_policy, policy):
            return policy, V

        policy = new_policy

Why does this converge?


Value Iteration: A Shortcut

Value iteration combines evaluation and improvement in one update:

$$ V_{k+1}(s) = \max_a \sum_{s'} T(s'|s,a)\left[R(s,a,s') + \gamma V_k(s')\right] $$

def value_iteration(env, gamma=0.99, theta=1e-8):
    """
    Find optimal value function and policy via value iteration.
    """
    num_states = env.num_states
    num_actions = env.num_actions
    V = np.zeros(num_states)

    while True:
        V_new = np.zeros(num_states)

        for s in range(num_states):
            # Compute Q(s, a) for all actions
            q_values = np.zeros(num_actions)
            for a in range(num_actions):
                for s_next in range(num_states):
                    prob = env.T[s, a, s_next]
                    reward = env.R[s, a, s_next]
                    q_values[a] += prob * (reward + gamma * V[s_next])

            # Take max over actions
            V_new[s] = np.max(q_values)

        # Check convergence
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new

    # Extract policy from converged V
    policy = policy_improvement(env, V, gamma)

    return policy, V

Value Iteration: Vectorized

def value_iteration_vectorized(env, gamma=0.99, theta=1e-8):
    """
    Vectorized value iteration.
    """
    num_states = env.num_states
    V = np.zeros(num_states)

    while True:
        # Compute Q[s, a] for all state-action pairs at once
        Q = np.einsum('sas,sas->sa', env.T, env.R + gamma * V)

        # V = max over actions
        V_new = np.max(Q, axis=1)

        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new

    policy = np.argmax(Q, axis=1)
    return policy, V

Comparing Policy Iteration and Value Iteration

Aspect Policy Iteration Value Iteration
Per-iteration cost High (full policy eval) Low (single backup)
Iterations to converge Few Many
Best when States << Iterations Iterations << States

In practice:


When DP Fails: Model-Free Methods

What if we don't know $T$ and $R$? We learn from experience.

Key idea: Replace expectations with samples.

$$ \mathbb{E}[X] \approx \frac{1}{N}\sum_{i=1}^N x_i $$

Instead of computing $\sum_{s'} T(s'|s,a) [\ldots]$, we collect actual transitions.


Monte Carlo Methods

The simplest model-free approach: run full episodes, compute returns.

def monte_carlo_prediction(env, policy, num_episodes=1000, gamma=0.99):
    """
    Estimate V_pi using Monte Carlo sampling.

    Idea: V(s) = average return starting from s
    """
    returns = {s: [] for s in range(env.num_states)}

    for episode in range(num_episodes):
        # Generate an episode following policy
        trajectory = []
        state = env.reset()
        done = False

        while not done:
            action = policy[state]
            next_state, reward, done, _ = env.step(action)
            trajectory.append((state, reward))
            state = next_state

        # Compute returns for each state visited
        G = 0
        for state, reward in reversed(trajectory):
            G = reward + gamma * G
            returns[state].append(G)

    # Average returns
    V = np.zeros(env.num_states)
    for s in range(env.num_states):
        if returns[s]:
            V[s] = np.mean(returns[s])

    return V

Monte Carlo Control: Finding Optimal Policies

For control, we need Q-values (so we can improve without knowing $T$):

def monte_carlo_control(env, num_episodes=10000, gamma=0.99, epsilon=0.1):
    """
    Monte Carlo control with epsilon-greedy exploration.
    """
    num_states = env.num_states
    num_actions = env.num_actions

    Q = np.zeros((num_states, num_actions))
    returns = {(s, a): [] for s in range(num_states) for a in range(num_actions)}

    for episode in range(num_episodes):
        # Generate episode with epsilon-greedy policy
        trajectory = []
        state = env.reset()
        done = False

        while not done:
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(num_actions)
            else:
                action = np.argmax(Q[state])

            next_state, reward, done, _ = env.step(action)
            trajectory.append((state, action, reward))
            state = next_state

        # Update Q-values
        G = 0
        for state, action, reward in reversed(trajectory):
            G = reward + gamma * G
            returns[(state, action)].append(G)
            Q[state, action] = np.mean(returns[(state, action)])

    # Final policy
    policy = np.argmax(Q, axis=1)
    return policy, Q

The Problem with Monte Carlo

Monte Carlo requires complete episodes:

Solution: Temporal Difference (TD) Learning


Temporal Difference Learning

TD methods bootstrap: update using estimated future values.

TD(0) update for V:

$$ V(s) \leftarrow V(s) + \alpha \left[r + \gamma V(s') - V(s)\right] $$

The term $\delta = r + \gamma V(s') - V(s)$ is the TD error.

def td_prediction(env, policy, num_episodes=1000, alpha=0.1, gamma=0.99):
    """
    TD(0) prediction: estimate V_pi.
    """
    V = np.zeros(env.num_states)

    for episode in range(num_episodes):
        state = env.reset()
        done = False

        while not done:
            action = policy[state]
            next_state, reward, done, _ = env.step(action)

            # TD update
            td_error = reward + gamma * V[next_state] - V[state]
            V[state] += alpha * td_error

            state = next_state

    return V

TD vs Monte Carlo: The Tradeoff

Aspect Monte Carlo TD(0)
Bias Unbiased Biased (bootstraps)
Variance High Lower
Learns from Complete episodes Single transitions
Works for Episodic tasks All tasks

TD(0) is biased because it uses $V(s')$ which is itself an estimate. But this bias decreases as learning progresses.


SARSA: On-Policy TD Control

SARSA learns Q-values using the transition $(s, a, r, s', a')$.

$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[r + \gamma Q(s', a') - Q(s, a)\right] $$

def sarsa(env, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
    """
    SARSA: On-policy TD control.

    The name comes from: S(tate), A(ction), R(eward), S'(next state), A'(next action)
    """
    Q = np.zeros((env.num_states, env.num_actions))

    for episode in range(num_episodes):
        state = env.reset()
        action = epsilon_greedy(Q, state, epsilon)
        done = False

        while not done:
            next_state, reward, done, _ = env.step(action)
            next_action = epsilon_greedy(Q, next_state, epsilon)

            # SARSA update
            td_target = reward + gamma * Q[next_state, next_action]
            td_error = td_target - Q[state, action]
            Q[state, action] += alpha * td_error

            state = next_state
            action = next_action

    return np.argmax(Q, axis=1), Q


def epsilon_greedy(Q, state, epsilon):
    """Select action using epsilon-greedy policy."""
    if np.random.random() < epsilon:
        return np.random.randint(Q.shape[1])
    return np.argmax(Q[state])

Q-Learning: Off-Policy TD Control

Q-learning uses the best next action, regardless of what we actually take:

$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right] $$

def q_learning(env, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
    """
    Q-Learning: Off-policy TD control.

    Key difference from SARSA: uses max_a' Q(s', a') instead of Q(s', a')
    """
    Q = np.zeros((env.num_states, env.num_actions))

    for episode in range(num_episodes):
        state = env.reset()
        done = False

        while not done:
            # Choose action (behavior policy)
            action = epsilon_greedy(Q, state, epsilon)

            next_state, reward, done, _ = env.step(action)

            # Q-learning update (target policy is greedy)
            td_target = reward + gamma * np.max(Q[next_state])
            td_error = td_target - Q[state, action]
            Q[state, action] += alpha * td_error

            state = next_state

    return np.argmax(Q, axis=1), Q

SARSA vs Q-Learning: On-Policy vs Off-Policy

On-policy (SARSA):

Off-policy (Q-Learning):

# The key difference in one line:
# SARSA:      td_target = r + gamma * Q[s', a']     # a' is what we'll actually do
# Q-Learning: td_target = r + gamma * max_a Q[s', a']  # what's best

The Cliff Walking Example

Classic example showing SARSA vs Q-Learning behavior:

Start: S                     Goal: G
[S][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][G]
[C][C][C][C][C][C][C][C][C][C][C][C]  <- Cliff (reward = -100, reset)

Regular step: reward = -1

Eligibility Traces: TD(lambda)

TD(lambda) blends TD and Monte Carlo:

def td_lambda(env, policy, num_episodes=1000, alpha=0.1, gamma=0.99, lambd=0.8):
    """
    TD(lambda) with eligibility traces.
    """
    V = np.zeros(env.num_states)

    for episode in range(num_episodes):
        state = env.reset()
        done = False

        # Eligibility traces
        E = np.zeros(env.num_states)

        while not done:
            action = policy[state]
            next_state, reward, done, _ = env.step(action)

            # TD error
            td_error = reward + gamma * V[next_state] - V[state]

            # Update eligibility
            E[state] += 1  # accumulating traces

            # Update all states proportional to their eligibility
            V += alpha * td_error * E

            # Decay eligibility
            E *= gamma * lambd

            state = next_state

    return V

Capstone Connection: RL in AI Alignment

Understanding tabular RL is crucial for alignment work:

Reward Modeling

Safe Exploration

Value Alignment

Scalable Oversight


🎓 Tyla's Exercise

Theoretical Challenges

  1. Convergence of Value Iteration: Prove that value iteration converges to $V^*$ by showing that the Bellman optimality operator is a contraction with factor $\gamma$.

  2. Policy Iteration Correctness: Prove that policy improvement never decreases the value function, i.e., $V_{\pi'} \geq V_\pi$ for all states.

  3. TD(0) Convergence: Under what conditions does TD(0) converge to the true value function? What role does the learning rate schedule play?

  4. On-Policy vs Off-Policy: Formally characterize the difference between SARSA and Q-Learning in terms of the Bellman equations they're approximating.

  5. Bias-Variance in TD(lambda): Derive how the bias and variance of TD(lambda) estimates change as a function of lambda.


💻 Aaliyah's Exercise

Build a Complete RL Library

class TabularRL:
    """Complete tabular RL implementation."""

    def __init__(self, env):
        self.env = env
        self.num_states = env.observation_space.n
        self.num_actions = env.action_space.n

    # === Dynamic Programming ===

    def policy_evaluation(self, policy, gamma=0.99, method='iterative'):
        """
        Evaluate a policy.

        Args:
            policy: Array of shape (num_states,)
            gamma: Discount factor
            method: 'iterative' or 'exact'
        """
        # YOUR CODE HERE
        pass

    def policy_improvement(self, V, gamma=0.99):
        """
        Compute greedy policy with respect to V.
        """
        # YOUR CODE HERE
        pass

    def policy_iteration(self, gamma=0.99):
        """
        Find optimal policy via policy iteration.
        """
        # YOUR CODE HERE
        pass

    def value_iteration(self, gamma=0.99, theta=1e-8):
        """
        Find optimal policy via value iteration.
        """
        # YOUR CODE HERE
        pass

    # === Model-Free Learning ===

    def monte_carlo(self, num_episodes=1000, gamma=0.99, epsilon=0.1):
        """
        Monte Carlo control with epsilon-greedy.
        """
        # YOUR CODE HERE
        pass

    def sarsa(self, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
        """
        SARSA: On-policy TD control.
        """
        # YOUR CODE HERE
        pass

    def q_learning(self, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
        """
        Q-Learning: Off-policy TD control.
        """
        # YOUR CODE HERE
        pass

    # === Utilities ===

    def evaluate_policy(self, policy, num_episodes=100, gamma=0.99):
        """
        Empirically evaluate a policy by running episodes.
        """
        # YOUR CODE HERE
        pass

    def visualize_policy(self, policy):
        """
        Visualize policy (environment-specific).
        """
        # YOUR CODE HERE
        pass


# Test on CliffWalking-v0
import gymnasium as gym

env = gym.make('CliffWalking-v0')
rl = TabularRL(env)

# Compare algorithms
policy_sarsa, Q_sarsa = rl.sarsa(num_episodes=500)
policy_ql, Q_ql = rl.q_learning(num_episodes=500)

print("SARSA policy performance:", rl.evaluate_policy(policy_sarsa))
print("Q-Learning policy performance:", rl.evaluate_policy(policy_ql))

📚 Maneesha's Reflection

Pedagogical Analysis

  1. Dynamic Programming as Curriculum Design: Policy iteration alternates between evaluation (assessment) and improvement (adaptation). How does this mirror effective instructional design? What would "value iteration" look like in curriculum development?

  2. The Exploration-Exploitation Tradeoff in Education: Students must balance practicing what they know (exploitation) vs. trying new approaches (exploration). How do epsilon-greedy strategies inform teaching practice? What's the optimal "epsilon" for a learner?

  3. Bootstrapping and Scaffolding: TD learning "bootstraps" from incomplete information. How does this relate to scaffolded instruction, where learners build on partial understanding?

  4. On-Policy vs Off-Policy Learning: SARSA learns from its own behavior; Q-Learning can learn from others' experiences. What are the educational analogs? When should students learn from their own mistakes vs. observing others?

  5. The Credit Assignment Problem: When learning fails, which "step" was wrong? This is challenging in both RL and education. How do effective teachers help students identify where their reasoning went wrong?

  6. Tabular Limits as Expertise Boundaries: Tabular methods fail when state spaces are too large. Similarly, expert knowledge must generalize beyond memorized cases. How do experts "generalize" their tabular knowledge? What can RL teach us about expertise development?


Summary: The Tabular RL Toolkit

Planning (Known MDP): | Algorithm | Use Case | |-----------|----------| | Policy Evaluation | Compute V for given policy | | Policy Iteration | Find optimal policy (few iterations) | | Value Iteration | Find optimal policy (simple) |

Learning (Unknown MDP): | Algorithm | Type | Key Feature | |-----------|------|-------------| | Monte Carlo | Episode-based | Unbiased but high variance | | TD(0) | One-step | Bootstraps, lower variance | | SARSA | On-policy TD | Safe, learns epsilon-greedy optimal | | Q-Learning | Off-policy TD | Learns true optimal, can diverge |


What's Next?

Tabular methods don't scale. Real problems have:

Solution: Function Approximation

Key insight: Everything we learned today remains foundational. Deep RL is just tabular RL with neural network approximators.