Tabular RL Methods

From theory to computation: algorithms that find optimal policies when we can enumerate all states.

What Are Tabular Methods?

Tabular methods store values explicitly for every state-action pair. They work when:

State space is small enough to enumerate
Action space is discrete and manageable
We can visit states multiple times

The "table" is literally an array:

# Q-table: value of each (state, action) pair
Q = np.zeros((num_states, num_actions))

# V-table: value of each state
V = np.zeros(num_states)

# Policy table (deterministic): action for each state
pi = np.zeros(num_states, dtype=int)

Two Paradigms: Planning vs Learning

Planning (Dynamic Programming):

We know the MDP ($T$ and $R$)
We can compute values directly
Algorithms: Policy Evaluation, Policy Iteration, Value Iteration

Learning (Model-Free):

We don't know $T$ and $R$
We learn from experience (samples)
Algorithms: Monte Carlo, TD Learning, Q-Learning, SARSA

Today we cover both.

Dynamic Programming: The Foundation

When we know the environment dynamics, we can solve MDPs exactly.

Key insight: The Bellman equation is a system of linear equations. We can solve it iteratively or directly.

# The Bellman equation for policy pi:
# V_pi(s) = sum_s' T(s'|s, pi(s)) * [R(s, pi(s), s') + gamma * V_pi(s')]

# In matrix form:
# V = R_pi + gamma * P_pi @ V
# V = (I - gamma * P_pi)^(-1) @ R_pi

Policy Evaluation: Computing V for a Fixed Policy

Given a policy $\pi$, we want to find $V_\pi$.

Iterative approach (simpler, always works):

$$ V_{k+1}(s) = \sum_{s'} T(s'|s, \pi(s)) \left[R(s, \pi(s), s') + \gamma V_k(s')\right] $$

def policy_evaluation_iterative(env, policy, gamma=0.99, theta=1e-8):
    """
    Compute V_pi by iterating the Bellman equation.

    Args:
        env: Environment with env.T and env.R
        policy: Array of shape (num_states,) with action for each state
        gamma: Discount factor
        theta: Convergence threshold
    """
    num_states = env.num_states
    V = np.zeros(num_states)

    while True:
        V_new = np.zeros(num_states)

        for s in range(num_states):
            a = policy[s]
            for s_next in range(num_states):
                prob = env.T[s, a, s_next]
                reward = env.R[s, a, s_next]
                V_new[s] += prob * (reward + gamma * V[s_next])

        # Check convergence
        delta = np.max(np.abs(V_new - V))
        V = V_new

        if delta < theta:
            break

    return V

Policy Evaluation: The Vectorized Version

For efficiency, avoid explicit loops:

def policy_evaluation_vectorized(env, policy, gamma=0.99, theta=1e-8):
    """
    Vectorized policy evaluation using matrix operations.
    """
    num_states = env.num_states
    V = np.zeros(num_states)

    # Get transition and reward matrices for this policy
    # T_pi[s, s'] = T[s, pi(s), s']
    state_indices = np.arange(num_states)
    T_pi = env.T[state_indices, policy, :]  # Shape: (num_states, num_states)
    R_pi = env.R[state_indices, policy, :]  # Shape: (num_states, num_states)

    # Expected reward for each state: r_pi[s] = sum_s' T_pi[s,s'] * R_pi[s,s']
    r_pi = np.sum(T_pi * R_pi, axis=1)

    while True:
        # Bellman update: V = r_pi + gamma * T_pi @ V
        V_new = r_pi + gamma * (T_pi @ V)

        if np.max(np.abs(V_new - V)) < theta:
            return V_new
        V = V_new

Policy Evaluation: Exact Solution

We can also solve the linear system directly:

$$ V_\pi = (I - \gamma P_\pi)^{-1} r_\pi $$

def policy_evaluation_exact(env, policy, gamma=0.99):
    """
    Solve the Bellman equation directly via matrix inversion.
    """
    num_states = env.num_states
    state_indices = np.arange(num_states)

    T_pi = env.T[state_indices, policy, :]
    R_pi = env.R[state_indices, policy, :]
    r_pi = np.sum(T_pi * R_pi, axis=1)

    # Solve: (I - gamma * T_pi) @ V = r_pi
    A = np.eye(num_states) - gamma * T_pi
    V = np.linalg.solve(A, r_pi)

    return V

Policy Improvement: Finding a Better Policy

Given $V_\pi$, we can find a better policy by being greedy with respect to Q:

$$ \pi'(s) = \arg\max_a Q_\pi(s, a) = \arg\max_a \sum_{s'} T(s'|s,a)\left[R(s,a,s') + \gamma V_\pi(s')\right] $$

def policy_improvement(env, V, gamma=0.99):
    """
    Compute the greedy policy with respect to V.

    Returns:
        policy: Array of shape (num_states,) with best action for each state
    """
    num_states = env.num_states
    num_actions = env.num_actions

    # Compute Q(s, a) for all state-action pairs
    # Q[s, a] = sum_s' T[s, a, s'] * (R[s, a, s'] + gamma * V[s'])
    Q = np.zeros((num_states, num_actions))

    for s in range(num_states):
        for a in range(num_actions):
            for s_next in range(num_states):
                prob = env.T[s, a, s_next]
                reward = env.R[s, a, s_next]
                Q[s, a] += prob * (reward + gamma * V[s_next])

    # Greedy policy: take action with highest Q-value
    policy = np.argmax(Q, axis=1)

    return policy

Policy Improvement: Vectorized

def policy_improvement_vectorized(env, V, gamma=0.99):
    """
    Vectorized policy improvement.
    """
    # Q[s, a] = sum_s' T[s,a,s'] * (R[s,a,s'] + gamma * V[s'])
    # Using einsum for clarity
    Q = np.einsum('sas,sas->sa', env.T, env.R + gamma * V)

    return np.argmax(Q, axis=1)

Policy Iteration: The Full Algorithm

Alternate between evaluation and improvement until convergence:

def policy_iteration(env, gamma=0.99):
    """
    Find optimal policy via policy iteration.

    Returns:
        policy: Optimal policy
        V: Value function for optimal policy
    """
    num_states = env.num_states

    # Initialize with arbitrary policy
    policy = np.zeros(num_states, dtype=int)

    while True:
        # Policy Evaluation
        V = policy_evaluation_exact(env, policy, gamma)

        # Policy Improvement
        new_policy = policy_improvement(env, V, gamma)

        # Check for convergence
        if np.array_equal(new_policy, policy):
            return policy, V

        policy = new_policy

Why does this converge?

Each improvement step produces a strictly better policy (unless already optimal)
Finite number of policies means we must converge

Value Iteration: A Shortcut

Value iteration combines evaluation and improvement in one update:

$$ V_{k+1}(s) = \max_a \sum_{s'} T(s'|s,a)\left[R(s,a,s') + \gamma V_k(s')\right] $$

def value_iteration(env, gamma=0.99, theta=1e-8):
    """
    Find optimal value function and policy via value iteration.
    """
    num_states = env.num_states
    num_actions = env.num_actions
    V = np.zeros(num_states)

    while True:
        V_new = np.zeros(num_states)

        for s in range(num_states):
            # Compute Q(s, a) for all actions
            q_values = np.zeros(num_actions)
            for a in range(num_actions):
                for s_next in range(num_states):
                    prob = env.T[s, a, s_next]
                    reward = env.R[s, a, s_next]
                    q_values[a] += prob * (reward + gamma * V[s_next])

            # Take max over actions
            V_new[s] = np.max(q_values)

        # Check convergence
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new

    # Extract policy from converged V
    policy = policy_improvement(env, V, gamma)

    return policy, V

Value Iteration: Vectorized

def value_iteration_vectorized(env, gamma=0.99, theta=1e-8):
    """
    Vectorized value iteration.
    """
    num_states = env.num_states
    V = np.zeros(num_states)

    while True:
        # Compute Q[s, a] for all state-action pairs at once
        Q = np.einsum('sas,sas->sa', env.T, env.R + gamma * V)

        # V = max over actions
        V_new = np.max(Q, axis=1)

        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new

    policy = np.argmax(Q, axis=1)
    return policy, V

Comparing Policy Iteration and Value Iteration

Aspect	Policy Iteration	Value Iteration
Per-iteration cost	High (full policy eval)	Low (single backup)
Iterations to converge	Few	Many
Best when	States << Iterations	Iterations << States

In practice:

Value iteration is simpler and usually preferred
Policy iteration can be faster for some problems
Both find the same optimal policy

When DP Fails: Model-Free Methods

What if we don't know $T$ and $R$? We learn from experience.

Key idea: Replace expectations with samples.

$$ \mathbb{E}[X] \approx \frac{1}{N}\sum_{i=1}^N x_i $$

Instead of computing $\sum_{s'} T(s'|s,a) [\ldots]$, we collect actual transitions.

Monte Carlo Methods

The simplest model-free approach: run full episodes, compute returns.

def monte_carlo_prediction(env, policy, num_episodes=1000, gamma=0.99):
    """
    Estimate V_pi using Monte Carlo sampling.

    Idea: V(s) = average return starting from s
    """
    returns = {s: [] for s in range(env.num_states)}

    for episode in range(num_episodes):
        # Generate an episode following policy
        trajectory = []
        state = env.reset()
        done = False

        while not done:
            action = policy[state]
            next_state, reward, done, _ = env.step(action)
            trajectory.append((state, reward))
            state = next_state

        # Compute returns for each state visited
        G = 0
        for state, reward in reversed(trajectory):
            G = reward + gamma * G
            returns[state].append(G)

    # Average returns
    V = np.zeros(env.num_states)
    for s in range(env.num_states):
        if returns[s]:
            V[s] = np.mean(returns[s])

    return V

Monte Carlo Control: Finding Optimal Policies

For control, we need Q-values (so we can improve without knowing $T$):

def monte_carlo_control(env, num_episodes=10000, gamma=0.99, epsilon=0.1):
    """
    Monte Carlo control with epsilon-greedy exploration.
    """
    num_states = env.num_states
    num_actions = env.num_actions

    Q = np.zeros((num_states, num_actions))
    returns = {(s, a): [] for s in range(num_states) for a in range(num_actions)}

    for episode in range(num_episodes):
        # Generate episode with epsilon-greedy policy
        trajectory = []
        state = env.reset()
        done = False

        while not done:
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(num_actions)
            else:
                action = np.argmax(Q[state])

            next_state, reward, done, _ = env.step(action)
            trajectory.append((state, action, reward))
            state = next_state

        # Update Q-values
        G = 0
        for state, action, reward in reversed(trajectory):
            G = reward + gamma * G
            returns[(state, action)].append(G)
            Q[state, action] = np.mean(returns[(state, action)])

    # Final policy
    policy = np.argmax(Q, axis=1)
    return policy, Q

The Problem with Monte Carlo

Monte Carlo requires complete episodes:

Must wait until episode ends to learn
High variance in return estimates
Doesn't work for continuing (non-episodic) tasks

Solution: Temporal Difference (TD) Learning

Temporal Difference Learning

TD methods bootstrap: update using estimated future values.

TD(0) update for V:

$$ V(s) \leftarrow V(s) + \alpha \left[r + \gamma V(s') - V(s)\right] $$

The term $\delta = r + \gamma V(s') - V(s)$ is the TD error.

def td_prediction(env, policy, num_episodes=1000, alpha=0.1, gamma=0.99):
    """
    TD(0) prediction: estimate V_pi.
    """
    V = np.zeros(env.num_states)

    for episode in range(num_episodes):
        state = env.reset()
        done = False

        while not done:
            action = policy[state]
            next_state, reward, done, _ = env.step(action)

            # TD update
            td_error = reward + gamma * V[next_state] - V[state]
            V[state] += alpha * td_error

            state = next_state

    return V

TD vs Monte Carlo: The Tradeoff

Aspect	Monte Carlo	TD(0)
Bias	Unbiased	Biased (bootstraps)
Variance	High	Lower
Learns from	Complete episodes	Single transitions
Works for	Episodic tasks	All tasks

TD(0) is biased because it uses $V(s')$ which is itself an estimate. But this bias decreases as learning progresses.

SARSA: On-Policy TD Control

SARSA learns Q-values using the transition $(s, a, r, s', a')$.

$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[r + \gamma Q(s', a') - Q(s, a)\right] $$

def sarsa(env, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
    """
    SARSA: On-policy TD control.

    The name comes from: S(tate), A(ction), R(eward), S'(next state), A'(next action)
    """
    Q = np.zeros((env.num_states, env.num_actions))

    for episode in range(num_episodes):
        state = env.reset()
        action = epsilon_greedy(Q, state, epsilon)
        done = False

        while not done:
            next_state, reward, done, _ = env.step(action)
            next_action = epsilon_greedy(Q, next_state, epsilon)

            # SARSA update
            td_target = reward + gamma * Q[next_state, next_action]
            td_error = td_target - Q[state, action]
            Q[state, action] += alpha * td_error

            state = next_state
            action = next_action

    return np.argmax(Q, axis=1), Q


def epsilon_greedy(Q, state, epsilon):
    """Select action using epsilon-greedy policy."""
    if np.random.random() < epsilon:
        return np.random.randint(Q.shape[1])
    return np.argmax(Q[state])

Q-Learning: Off-Policy TD Control

Q-learning uses the best next action, regardless of what we actually take:

$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right] $$

def q_learning(env, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
    """
    Q-Learning: Off-policy TD control.

    Key difference from SARSA: uses max_a' Q(s', a') instead of Q(s', a')
    """
    Q = np.zeros((env.num_states, env.num_actions))

    for episode in range(num_episodes):
        state = env.reset()
        done = False

        while not done:
            # Choose action (behavior policy)
            action = epsilon_greedy(Q, state, epsilon)

            next_state, reward, done, _ = env.step(action)

            # Q-learning update (target policy is greedy)
            td_target = reward + gamma * np.max(Q[next_state])
            td_error = td_target - Q[state, action]
            Q[state, action] += alpha * td_error

            state = next_state

    return np.argmax(Q, axis=1), Q

SARSA vs Q-Learning: On-Policy vs Off-Policy

On-policy (SARSA):

Learns about the policy being followed (including exploration)
Safer behavior during learning
Converges to optimal epsilon-greedy policy

Off-policy (Q-Learning):

Learns optimal policy regardless of behavior
Can learn from any data (replay buffers)
Converges to optimal greedy policy

# The key difference in one line:
# SARSA:      td_target = r + gamma * Q[s', a']     # a' is what we'll actually do
# Q-Learning: td_target = r + gamma * max_a Q[s', a']  # what's best

The Cliff Walking Example

Classic example showing SARSA vs Q-Learning behavior:

Start: S                     Goal: G
[S][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][G]
[C][C][C][C][C][C][C][C][C][C][C][C]  <- Cliff (reward = -100, reset)

Regular step: reward = -1

Q-Learning finds the optimal path (along the cliff edge)
SARSA finds a safer path (away from cliff) because it accounts for exploration mistakes

Eligibility Traces: TD(lambda)

TD(lambda) blends TD and Monte Carlo:

$\lambda = 0$: Pure TD (one-step)
$\lambda = 1$: Monte Carlo (full return)
$0 < \lambda < 1$: Weighted combination

def td_lambda(env, policy, num_episodes=1000, alpha=0.1, gamma=0.99, lambd=0.8):
    """
    TD(lambda) with eligibility traces.
    """
    V = np.zeros(env.num_states)

    for episode in range(num_episodes):
        state = env.reset()
        done = False

        # Eligibility traces
        E = np.zeros(env.num_states)

        while not done:
            action = policy[state]
            next_state, reward, done, _ = env.step(action)

            # TD error
            td_error = reward + gamma * V[next_state] - V[state]

            # Update eligibility
            E[state] += 1  # accumulating traces

            # Update all states proportional to their eligibility
            V += alpha * td_error * E

            # Decay eligibility
            E *= gamma * lambd

            state = next_state

    return V

Capstone Connection: RL in AI Alignment

Understanding tabular RL is crucial for alignment work:

Reward Modeling

RLHF uses reward models trained on human preferences
The reward model is the specification of human values
Understanding how policies optimize rewards reveals alignment risks

Safe Exploration

SARSA's conservative behavior vs Q-learning's optimism
How do we ensure AI systems explore safely?
On-policy vs off-policy has safety implications

Value Alignment

Value functions in RL $\approx$ values in ethics
Inverse RL: inferring rewards from behavior
The alignment problem: whose values? Which rewards?

Scalable Oversight

How do we evaluate AI behavior we can't fully understand?
RL provides tools for reasoning about long-term consequences
Value iteration as a model for iterated amplification

🎓 Tyla's Exercise

Theoretical Challenges

Convergence of Value Iteration: Prove that value iteration converges to $V^*$ by showing that the Bellman optimality operator is a contraction with factor $\gamma$.
Policy Iteration Correctness: Prove that policy improvement never decreases the value function, i.e., $V_{\pi'} \geq V_\pi$ for all states.
TD(0) Convergence: Under what conditions does TD(0) converge to the true value function? What role does the learning rate schedule play?
On-Policy vs Off-Policy: Formally characterize the difference between SARSA and Q-Learning in terms of the Bellman equations they're approximating.
Bias-Variance in TD(lambda): Derive how the bias and variance of TD(lambda) estimates change as a function of lambda.

💻 Aaliyah's Exercise

Build a Complete RL Library

class TabularRL:
    """Complete tabular RL implementation."""

    def __init__(self, env):
        self.env = env
        self.num_states = env.observation_space.n
        self.num_actions = env.action_space.n

    # === Dynamic Programming ===

    def policy_evaluation(self, policy, gamma=0.99, method='iterative'):
        """
        Evaluate a policy.

        Args:
            policy: Array of shape (num_states,)
            gamma: Discount factor
            method: 'iterative' or 'exact'
        """
        # YOUR CODE HERE
        pass

    def policy_improvement(self, V, gamma=0.99):
        """
        Compute greedy policy with respect to V.
        """
        # YOUR CODE HERE
        pass

    def policy_iteration(self, gamma=0.99):
        """
        Find optimal policy via policy iteration.
        """
        # YOUR CODE HERE
        pass

    def value_iteration(self, gamma=0.99, theta=1e-8):
        """
        Find optimal policy via value iteration.
        """
        # YOUR CODE HERE
        pass

    # === Model-Free Learning ===

    def monte_carlo(self, num_episodes=1000, gamma=0.99, epsilon=0.1):
        """
        Monte Carlo control with epsilon-greedy.
        """
        # YOUR CODE HERE
        pass

    def sarsa(self, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
        """
        SARSA: On-policy TD control.
        """
        # YOUR CODE HERE
        pass

    def q_learning(self, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
        """
        Q-Learning: Off-policy TD control.
        """
        # YOUR CODE HERE
        pass

    # === Utilities ===

    def evaluate_policy(self, policy, num_episodes=100, gamma=0.99):
        """
        Empirically evaluate a policy by running episodes.
        """
        # YOUR CODE HERE
        pass

    def visualize_policy(self, policy):
        """
        Visualize policy (environment-specific).
        """
        # YOUR CODE HERE
        pass


# Test on CliffWalking-v0
import gymnasium as gym

env = gym.make('CliffWalking-v0')
rl = TabularRL(env)

# Compare algorithms
policy_sarsa, Q_sarsa = rl.sarsa(num_episodes=500)
policy_ql, Q_ql = rl.q_learning(num_episodes=500)

print("SARSA policy performance:", rl.evaluate_policy(policy_sarsa))
print("Q-Learning policy performance:", rl.evaluate_policy(policy_ql))

📚 Maneesha's Reflection

Pedagogical Analysis

Dynamic Programming as Curriculum Design: Policy iteration alternates between evaluation (assessment) and improvement (adaptation). How does this mirror effective instructional design? What would "value iteration" look like in curriculum development?
The Exploration-Exploitation Tradeoff in Education: Students must balance practicing what they know (exploitation) vs. trying new approaches (exploration). How do epsilon-greedy strategies inform teaching practice? What's the optimal "epsilon" for a learner?
Bootstrapping and Scaffolding: TD learning "bootstraps" from incomplete information. How does this relate to scaffolded instruction, where learners build on partial understanding?
On-Policy vs Off-Policy Learning: SARSA learns from its own behavior; Q-Learning can learn from others' experiences. What are the educational analogs? When should students learn from their own mistakes vs. observing others?
The Credit Assignment Problem: When learning fails, which "step" was wrong? This is challenging in both RL and education. How do effective teachers help students identify where their reasoning went wrong?
Tabular Limits as Expertise Boundaries: Tabular methods fail when state spaces are too large. Similarly, expert knowledge must generalize beyond memorized cases. How do experts "generalize" their tabular knowledge? What can RL teach us about expertise development?

Summary: The Tabular RL Toolkit

Planning (Known MDP): | Algorithm | Use Case | |-----------|----------| | Policy Evaluation | Compute V for given policy | | Policy Iteration | Find optimal policy (few iterations) | | Value Iteration | Find optimal policy (simple) |

Learning (Unknown MDP): | Algorithm | Type | Key Feature | |-----------|------|-------------| | Monte Carlo | Episode-based | Unbiased but high variance | | TD(0) | One-step | Bootstraps, lower variance | | SARSA | On-policy TD | Safe, learns epsilon-greedy optimal | | Q-Learning | Off-policy TD | Learns true optimal, can diverge |

What's Next?

Tabular methods don't scale. Real problems have:

Continuous states (positions, velocities)
High-dimensional observations (images)
Combinatorially large state spaces

Solution: Function Approximation

Replace tables with neural networks
Generalize across similar states
This leads to Deep RL: DQN, Policy Gradients, PPO

Key insight: Everything we learned today remains foundational. Deep RL is just tabular RL with neural network approximators.

Tabular RL Methods #

What Are Tabular Methods? #

Two Paradigms: Planning vs Learning #

Dynamic Programming: The Foundation #

Policy Evaluation: Computing V for a Fixed Policy #

Policy Evaluation: The Vectorized Version #

Policy Evaluation: Exact Solution #

Policy Improvement: Finding a Better Policy #

Policy Improvement: Vectorized #

Policy Iteration: The Full Algorithm #

Value Iteration: A Shortcut #

Value Iteration: Vectorized #

Comparing Policy Iteration and Value Iteration #

When DP Fails: Model-Free Methods #

Monte Carlo Methods #

Monte Carlo Control: Finding Optimal Policies #

The Problem with Monte Carlo #

Temporal Difference Learning #

TD vs Monte Carlo: The Tradeoff #

SARSA: On-Policy TD Control #

Q-Learning: Off-Policy TD Control #

SARSA vs Q-Learning: On-Policy vs Off-Policy #

The Cliff Walking Example #

Eligibility Traces: TD(lambda) #

Capstone Connection: RL in AI Alignment #

🎓 Tyla's Exercise #

💻 Aaliyah's Exercise #

📚 Maneesha's Reflection #

Summary: The Tabular RL Toolkit #

What's Next? #