Tabular RL Methods
From theory to computation: algorithms that find optimal policies when we can enumerate all states.
What Are Tabular Methods?
Tabular methods store values explicitly for every state-action pair. They work when:
- State space is small enough to enumerate
- Action space is discrete and manageable
- We can visit states multiple times
The "table" is literally an array:
# Q-table: value of each (state, action) pair
Q = np.zeros((num_states, num_actions))
# V-table: value of each state
V = np.zeros(num_states)
# Policy table (deterministic): action for each state
pi = np.zeros(num_states, dtype=int)
Two Paradigms: Planning vs Learning
Planning (Dynamic Programming):
- We know the MDP ($T$ and $R$)
- We can compute values directly
- Algorithms: Policy Evaluation, Policy Iteration, Value Iteration
Learning (Model-Free):
- We don't know $T$ and $R$
- We learn from experience (samples)
- Algorithms: Monte Carlo, TD Learning, Q-Learning, SARSA
Today we cover both.
Dynamic Programming: The Foundation
When we know the environment dynamics, we can solve MDPs exactly.
Key insight: The Bellman equation is a system of linear equations. We can solve it iteratively or directly.
# The Bellman equation for policy pi:
# V_pi(s) = sum_s' T(s'|s, pi(s)) * [R(s, pi(s), s') + gamma * V_pi(s')]
# In matrix form:
# V = R_pi + gamma * P_pi @ V
# V = (I - gamma * P_pi)^(-1) @ R_pi
Policy Evaluation: Computing V for a Fixed Policy
Given a policy $\pi$, we want to find $V_\pi$.
Iterative approach (simpler, always works):
$$ V_{k+1}(s) = \sum_{s'} T(s'|s, \pi(s)) \left[R(s, \pi(s), s') + \gamma V_k(s')\right] $$
def policy_evaluation_iterative(env, policy, gamma=0.99, theta=1e-8):
"""
Compute V_pi by iterating the Bellman equation.
Args:
env: Environment with env.T and env.R
policy: Array of shape (num_states,) with action for each state
gamma: Discount factor
theta: Convergence threshold
"""
num_states = env.num_states
V = np.zeros(num_states)
while True:
V_new = np.zeros(num_states)
for s in range(num_states):
a = policy[s]
for s_next in range(num_states):
prob = env.T[s, a, s_next]
reward = env.R[s, a, s_next]
V_new[s] += prob * (reward + gamma * V[s_next])
# Check convergence
delta = np.max(np.abs(V_new - V))
V = V_new
if delta < theta:
break
return V
Policy Evaluation: The Vectorized Version
For efficiency, avoid explicit loops:
def policy_evaluation_vectorized(env, policy, gamma=0.99, theta=1e-8):
"""
Vectorized policy evaluation using matrix operations.
"""
num_states = env.num_states
V = np.zeros(num_states)
# Get transition and reward matrices for this policy
# T_pi[s, s'] = T[s, pi(s), s']
state_indices = np.arange(num_states)
T_pi = env.T[state_indices, policy, :] # Shape: (num_states, num_states)
R_pi = env.R[state_indices, policy, :] # Shape: (num_states, num_states)
# Expected reward for each state: r_pi[s] = sum_s' T_pi[s,s'] * R_pi[s,s']
r_pi = np.sum(T_pi * R_pi, axis=1)
while True:
# Bellman update: V = r_pi + gamma * T_pi @ V
V_new = r_pi + gamma * (T_pi @ V)
if np.max(np.abs(V_new - V)) < theta:
return V_new
V = V_new
Policy Evaluation: Exact Solution
We can also solve the linear system directly:
$$ V_\pi = (I - \gamma P_\pi)^{-1} r_\pi $$
def policy_evaluation_exact(env, policy, gamma=0.99):
"""
Solve the Bellman equation directly via matrix inversion.
"""
num_states = env.num_states
state_indices = np.arange(num_states)
T_pi = env.T[state_indices, policy, :]
R_pi = env.R[state_indices, policy, :]
r_pi = np.sum(T_pi * R_pi, axis=1)
# Solve: (I - gamma * T_pi) @ V = r_pi
A = np.eye(num_states) - gamma * T_pi
V = np.linalg.solve(A, r_pi)
return V
Policy Improvement: Finding a Better Policy
Given $V_\pi$, we can find a better policy by being greedy with respect to Q:
$$ \pi'(s) = \arg\max_a Q_\pi(s, a) = \arg\max_a \sum_{s'} T(s'|s,a)\left[R(s,a,s') + \gamma V_\pi(s')\right] $$
def policy_improvement(env, V, gamma=0.99):
"""
Compute the greedy policy with respect to V.
Returns:
policy: Array of shape (num_states,) with best action for each state
"""
num_states = env.num_states
num_actions = env.num_actions
# Compute Q(s, a) for all state-action pairs
# Q[s, a] = sum_s' T[s, a, s'] * (R[s, a, s'] + gamma * V[s'])
Q = np.zeros((num_states, num_actions))
for s in range(num_states):
for a in range(num_actions):
for s_next in range(num_states):
prob = env.T[s, a, s_next]
reward = env.R[s, a, s_next]
Q[s, a] += prob * (reward + gamma * V[s_next])
# Greedy policy: take action with highest Q-value
policy = np.argmax(Q, axis=1)
return policy
Policy Improvement: Vectorized
def policy_improvement_vectorized(env, V, gamma=0.99):
"""
Vectorized policy improvement.
"""
# Q[s, a] = sum_s' T[s,a,s'] * (R[s,a,s'] + gamma * V[s'])
# Using einsum for clarity
Q = np.einsum('sas,sas->sa', env.T, env.R + gamma * V)
return np.argmax(Q, axis=1)
Policy Iteration: The Full Algorithm
Alternate between evaluation and improvement until convergence:
def policy_iteration(env, gamma=0.99):
"""
Find optimal policy via policy iteration.
Returns:
policy: Optimal policy
V: Value function for optimal policy
"""
num_states = env.num_states
# Initialize with arbitrary policy
policy = np.zeros(num_states, dtype=int)
while True:
# Policy Evaluation
V = policy_evaluation_exact(env, policy, gamma)
# Policy Improvement
new_policy = policy_improvement(env, V, gamma)
# Check for convergence
if np.array_equal(new_policy, policy):
return policy, V
policy = new_policy
Why does this converge?
- Each improvement step produces a strictly better policy (unless already optimal)
- Finite number of policies means we must converge
Value Iteration: A Shortcut
Value iteration combines evaluation and improvement in one update:
$$ V_{k+1}(s) = \max_a \sum_{s'} T(s'|s,a)\left[R(s,a,s') + \gamma V_k(s')\right] $$
def value_iteration(env, gamma=0.99, theta=1e-8):
"""
Find optimal value function and policy via value iteration.
"""
num_states = env.num_states
num_actions = env.num_actions
V = np.zeros(num_states)
while True:
V_new = np.zeros(num_states)
for s in range(num_states):
# Compute Q(s, a) for all actions
q_values = np.zeros(num_actions)
for a in range(num_actions):
for s_next in range(num_states):
prob = env.T[s, a, s_next]
reward = env.R[s, a, s_next]
q_values[a] += prob * (reward + gamma * V[s_next])
# Take max over actions
V_new[s] = np.max(q_values)
# Check convergence
if np.max(np.abs(V_new - V)) < theta:
break
V = V_new
# Extract policy from converged V
policy = policy_improvement(env, V, gamma)
return policy, V
Value Iteration: Vectorized
def value_iteration_vectorized(env, gamma=0.99, theta=1e-8):
"""
Vectorized value iteration.
"""
num_states = env.num_states
V = np.zeros(num_states)
while True:
# Compute Q[s, a] for all state-action pairs at once
Q = np.einsum('sas,sas->sa', env.T, env.R + gamma * V)
# V = max over actions
V_new = np.max(Q, axis=1)
if np.max(np.abs(V_new - V)) < theta:
break
V = V_new
policy = np.argmax(Q, axis=1)
return policy, V
Comparing Policy Iteration and Value Iteration
| Aspect | Policy Iteration | Value Iteration |
|---|---|---|
| Per-iteration cost | High (full policy eval) | Low (single backup) |
| Iterations to converge | Few | Many |
| Best when | States << Iterations | Iterations << States |
In practice:
- Value iteration is simpler and usually preferred
- Policy iteration can be faster for some problems
- Both find the same optimal policy
When DP Fails: Model-Free Methods
What if we don't know $T$ and $R$? We learn from experience.
Key idea: Replace expectations with samples.
$$ \mathbb{E}[X] \approx \frac{1}{N}\sum_{i=1}^N x_i $$
Instead of computing $\sum_{s'} T(s'|s,a) [\ldots]$, we collect actual transitions.
Monte Carlo Methods
The simplest model-free approach: run full episodes, compute returns.
def monte_carlo_prediction(env, policy, num_episodes=1000, gamma=0.99):
"""
Estimate V_pi using Monte Carlo sampling.
Idea: V(s) = average return starting from s
"""
returns = {s: [] for s in range(env.num_states)}
for episode in range(num_episodes):
# Generate an episode following policy
trajectory = []
state = env.reset()
done = False
while not done:
action = policy[state]
next_state, reward, done, _ = env.step(action)
trajectory.append((state, reward))
state = next_state
# Compute returns for each state visited
G = 0
for state, reward in reversed(trajectory):
G = reward + gamma * G
returns[state].append(G)
# Average returns
V = np.zeros(env.num_states)
for s in range(env.num_states):
if returns[s]:
V[s] = np.mean(returns[s])
return V
Monte Carlo Control: Finding Optimal Policies
For control, we need Q-values (so we can improve without knowing $T$):
def monte_carlo_control(env, num_episodes=10000, gamma=0.99, epsilon=0.1):
"""
Monte Carlo control with epsilon-greedy exploration.
"""
num_states = env.num_states
num_actions = env.num_actions
Q = np.zeros((num_states, num_actions))
returns = {(s, a): [] for s in range(num_states) for a in range(num_actions)}
for episode in range(num_episodes):
# Generate episode with epsilon-greedy policy
trajectory = []
state = env.reset()
done = False
while not done:
# Epsilon-greedy action selection
if np.random.random() < epsilon:
action = np.random.randint(num_actions)
else:
action = np.argmax(Q[state])
next_state, reward, done, _ = env.step(action)
trajectory.append((state, action, reward))
state = next_state
# Update Q-values
G = 0
for state, action, reward in reversed(trajectory):
G = reward + gamma * G
returns[(state, action)].append(G)
Q[state, action] = np.mean(returns[(state, action)])
# Final policy
policy = np.argmax(Q, axis=1)
return policy, Q
The Problem with Monte Carlo
Monte Carlo requires complete episodes:
- Must wait until episode ends to learn
- High variance in return estimates
- Doesn't work for continuing (non-episodic) tasks
Solution: Temporal Difference (TD) Learning
Temporal Difference Learning
TD methods bootstrap: update using estimated future values.
TD(0) update for V:
$$ V(s) \leftarrow V(s) + \alpha \left[r + \gamma V(s') - V(s)\right] $$
The term $\delta = r + \gamma V(s') - V(s)$ is the TD error.
def td_prediction(env, policy, num_episodes=1000, alpha=0.1, gamma=0.99):
"""
TD(0) prediction: estimate V_pi.
"""
V = np.zeros(env.num_states)
for episode in range(num_episodes):
state = env.reset()
done = False
while not done:
action = policy[state]
next_state, reward, done, _ = env.step(action)
# TD update
td_error = reward + gamma * V[next_state] - V[state]
V[state] += alpha * td_error
state = next_state
return V
TD vs Monte Carlo: The Tradeoff
| Aspect | Monte Carlo | TD(0) |
|---|---|---|
| Bias | Unbiased | Biased (bootstraps) |
| Variance | High | Lower |
| Learns from | Complete episodes | Single transitions |
| Works for | Episodic tasks | All tasks |
TD(0) is biased because it uses $V(s')$ which is itself an estimate. But this bias decreases as learning progresses.
SARSA: On-Policy TD Control
SARSA learns Q-values using the transition $(s, a, r, s', a')$.
$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[r + \gamma Q(s', a') - Q(s, a)\right] $$
def sarsa(env, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
"""
SARSA: On-policy TD control.
The name comes from: S(tate), A(ction), R(eward), S'(next state), A'(next action)
"""
Q = np.zeros((env.num_states, env.num_actions))
for episode in range(num_episodes):
state = env.reset()
action = epsilon_greedy(Q, state, epsilon)
done = False
while not done:
next_state, reward, done, _ = env.step(action)
next_action = epsilon_greedy(Q, next_state, epsilon)
# SARSA update
td_target = reward + gamma * Q[next_state, next_action]
td_error = td_target - Q[state, action]
Q[state, action] += alpha * td_error
state = next_state
action = next_action
return np.argmax(Q, axis=1), Q
def epsilon_greedy(Q, state, epsilon):
"""Select action using epsilon-greedy policy."""
if np.random.random() < epsilon:
return np.random.randint(Q.shape[1])
return np.argmax(Q[state])
Q-Learning: Off-Policy TD Control
Q-learning uses the best next action, regardless of what we actually take:
$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right] $$
def q_learning(env, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
"""
Q-Learning: Off-policy TD control.
Key difference from SARSA: uses max_a' Q(s', a') instead of Q(s', a')
"""
Q = np.zeros((env.num_states, env.num_actions))
for episode in range(num_episodes):
state = env.reset()
done = False
while not done:
# Choose action (behavior policy)
action = epsilon_greedy(Q, state, epsilon)
next_state, reward, done, _ = env.step(action)
# Q-learning update (target policy is greedy)
td_target = reward + gamma * np.max(Q[next_state])
td_error = td_target - Q[state, action]
Q[state, action] += alpha * td_error
state = next_state
return np.argmax(Q, axis=1), Q
SARSA vs Q-Learning: On-Policy vs Off-Policy
On-policy (SARSA):
- Learns about the policy being followed (including exploration)
- Safer behavior during learning
- Converges to optimal epsilon-greedy policy
Off-policy (Q-Learning):
- Learns optimal policy regardless of behavior
- Can learn from any data (replay buffers)
- Converges to optimal greedy policy
# The key difference in one line:
# SARSA: td_target = r + gamma * Q[s', a'] # a' is what we'll actually do
# Q-Learning: td_target = r + gamma * max_a Q[s', a'] # what's best
The Cliff Walking Example
Classic example showing SARSA vs Q-Learning behavior:
Start: S Goal: G
[S][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][G]
[C][C][C][C][C][C][C][C][C][C][C][C] <- Cliff (reward = -100, reset)
Regular step: reward = -1
- Q-Learning finds the optimal path (along the cliff edge)
- SARSA finds a safer path (away from cliff) because it accounts for exploration mistakes
Eligibility Traces: TD(lambda)
TD(lambda) blends TD and Monte Carlo:
- $\lambda = 0$: Pure TD (one-step)
- $\lambda = 1$: Monte Carlo (full return)
- $0 < \lambda < 1$: Weighted combination
def td_lambda(env, policy, num_episodes=1000, alpha=0.1, gamma=0.99, lambd=0.8):
"""
TD(lambda) with eligibility traces.
"""
V = np.zeros(env.num_states)
for episode in range(num_episodes):
state = env.reset()
done = False
# Eligibility traces
E = np.zeros(env.num_states)
while not done:
action = policy[state]
next_state, reward, done, _ = env.step(action)
# TD error
td_error = reward + gamma * V[next_state] - V[state]
# Update eligibility
E[state] += 1 # accumulating traces
# Update all states proportional to their eligibility
V += alpha * td_error * E
# Decay eligibility
E *= gamma * lambd
state = next_state
return V
Capstone Connection: RL in AI Alignment
Understanding tabular RL is crucial for alignment work:
Reward Modeling
- RLHF uses reward models trained on human preferences
- The reward model is the specification of human values
- Understanding how policies optimize rewards reveals alignment risks
Safe Exploration
- SARSA's conservative behavior vs Q-learning's optimism
- How do we ensure AI systems explore safely?
- On-policy vs off-policy has safety implications
Value Alignment
- Value functions in RL $\approx$ values in ethics
- Inverse RL: inferring rewards from behavior
- The alignment problem: whose values? Which rewards?
Scalable Oversight
- How do we evaluate AI behavior we can't fully understand?
- RL provides tools for reasoning about long-term consequences
- Value iteration as a model for iterated amplification
🎓 Tyla's Exercise
Theoretical Challenges
Convergence of Value Iteration: Prove that value iteration converges to $V^*$ by showing that the Bellman optimality operator is a contraction with factor $\gamma$.
Policy Iteration Correctness: Prove that policy improvement never decreases the value function, i.e., $V_{\pi'} \geq V_\pi$ for all states.
TD(0) Convergence: Under what conditions does TD(0) converge to the true value function? What role does the learning rate schedule play?
On-Policy vs Off-Policy: Formally characterize the difference between SARSA and Q-Learning in terms of the Bellman equations they're approximating.
Bias-Variance in TD(lambda): Derive how the bias and variance of TD(lambda) estimates change as a function of lambda.
💻 Aaliyah's Exercise
Build a Complete RL Library
class TabularRL:
"""Complete tabular RL implementation."""
def __init__(self, env):
self.env = env
self.num_states = env.observation_space.n
self.num_actions = env.action_space.n
# === Dynamic Programming ===
def policy_evaluation(self, policy, gamma=0.99, method='iterative'):
"""
Evaluate a policy.
Args:
policy: Array of shape (num_states,)
gamma: Discount factor
method: 'iterative' or 'exact'
"""
# YOUR CODE HERE
pass
def policy_improvement(self, V, gamma=0.99):
"""
Compute greedy policy with respect to V.
"""
# YOUR CODE HERE
pass
def policy_iteration(self, gamma=0.99):
"""
Find optimal policy via policy iteration.
"""
# YOUR CODE HERE
pass
def value_iteration(self, gamma=0.99, theta=1e-8):
"""
Find optimal policy via value iteration.
"""
# YOUR CODE HERE
pass
# === Model-Free Learning ===
def monte_carlo(self, num_episodes=1000, gamma=0.99, epsilon=0.1):
"""
Monte Carlo control with epsilon-greedy.
"""
# YOUR CODE HERE
pass
def sarsa(self, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
"""
SARSA: On-policy TD control.
"""
# YOUR CODE HERE
pass
def q_learning(self, num_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
"""
Q-Learning: Off-policy TD control.
"""
# YOUR CODE HERE
pass
# === Utilities ===
def evaluate_policy(self, policy, num_episodes=100, gamma=0.99):
"""
Empirically evaluate a policy by running episodes.
"""
# YOUR CODE HERE
pass
def visualize_policy(self, policy):
"""
Visualize policy (environment-specific).
"""
# YOUR CODE HERE
pass
# Test on CliffWalking-v0
import gymnasium as gym
env = gym.make('CliffWalking-v0')
rl = TabularRL(env)
# Compare algorithms
policy_sarsa, Q_sarsa = rl.sarsa(num_episodes=500)
policy_ql, Q_ql = rl.q_learning(num_episodes=500)
print("SARSA policy performance:", rl.evaluate_policy(policy_sarsa))
print("Q-Learning policy performance:", rl.evaluate_policy(policy_ql))
📚 Maneesha's Reflection
Pedagogical Analysis
Dynamic Programming as Curriculum Design: Policy iteration alternates between evaluation (assessment) and improvement (adaptation). How does this mirror effective instructional design? What would "value iteration" look like in curriculum development?
The Exploration-Exploitation Tradeoff in Education: Students must balance practicing what they know (exploitation) vs. trying new approaches (exploration). How do epsilon-greedy strategies inform teaching practice? What's the optimal "epsilon" for a learner?
Bootstrapping and Scaffolding: TD learning "bootstraps" from incomplete information. How does this relate to scaffolded instruction, where learners build on partial understanding?
On-Policy vs Off-Policy Learning: SARSA learns from its own behavior; Q-Learning can learn from others' experiences. What are the educational analogs? When should students learn from their own mistakes vs. observing others?
The Credit Assignment Problem: When learning fails, which "step" was wrong? This is challenging in both RL and education. How do effective teachers help students identify where their reasoning went wrong?
Tabular Limits as Expertise Boundaries: Tabular methods fail when state spaces are too large. Similarly, expert knowledge must generalize beyond memorized cases. How do experts "generalize" their tabular knowledge? What can RL teach us about expertise development?
Summary: The Tabular RL Toolkit
Planning (Known MDP): | Algorithm | Use Case | |-----------|----------| | Policy Evaluation | Compute V for given policy | | Policy Iteration | Find optimal policy (few iterations) | | Value Iteration | Find optimal policy (simple) |
Learning (Unknown MDP): | Algorithm | Type | Key Feature | |-----------|------|-------------| | Monte Carlo | Episode-based | Unbiased but high variance | | TD(0) | One-step | Bootstraps, lower variance | | SARSA | On-policy TD | Safe, learns epsilon-greedy optimal | | Q-Learning | Off-policy TD | Learns true optimal, can diverge |
What's Next?
Tabular methods don't scale. Real problems have:
- Continuous states (positions, velocities)
- High-dimensional observations (images)
- Combinatorially large state spaces
Solution: Function Approximation
- Replace tables with neural networks
- Generalize across similar states
- This leads to Deep RL: DQN, Policy Gradients, PPO
Key insight: Everything we learned today remains foundational. Deep RL is just tabular RL with neural network approximators.