Introduction to Reinforcement Learning
The foundation of all RL: agents, environments, and the mathematics of sequential decision making.
What is Reinforcement Learning?
Reinforcement learning is fundamentally different from supervised learning. Instead of learning from labeled examples, an agent learns by interacting with an environment and receiving rewards.
The core loop:
Agent observes state s
|
v
Agent chooses action a using policy pi
|
v
Environment returns new state s' and reward r
|
v
Agent updates its understanding
|
(repeat)
Key components:
- Agent: The decision-maker (what we're training)
- Environment: Everything outside the agent
- State: A description of the current situation
- Action: What the agent can do
- Reward: Scalar feedback signal
The Agent-Environment Interface
# The basic RL interaction loop
def rl_loop(agent, env, num_steps):
state = env.reset()
for t in range(num_steps):
# Agent chooses action based on current state
action = agent.select_action(state)
# Environment responds
next_state, reward, done, info = env.step(action)
# Agent learns from this experience
agent.learn(state, action, reward, next_state)
state = next_state
if done:
break
The trajectory (or rollout) is the full history:
$$ \tau = s_0, a_0, r_1, s_1, a_1, r_2, s_2, a_2, r_3, \ldots $$
Note the indexing: action $a_t$ is taken at state $s_t$, which produces reward $r_{t+1}$ and next state $s_{t+1}$.
Markov Decision Processes (MDPs)
An MDP formalizes RL problems with five components:
- State space $\mathcal{S}$: All possible states
- Action space $\mathcal{A}$: All possible actions
- Transition function $T(s' | s, a)$: Probability of reaching $s'$ from $s$ via $a$
- Reward function $R(s, a, s')$: Reward for transition
- Discount factor $\gamma \in [0, 1)$: How much to value future rewards
The Markov property is crucial: the future depends only on the current state, not the history.
$$ P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t) $$
Understanding MDPs Through Code
class SimpleMDP:
"""A minimal MDP implementation."""
def __init__(self, num_states, num_actions):
self.num_states = num_states
self.num_actions = num_actions
# Transition probabilities: T[s, a, s'] = P(s' | s, a)
self.T = np.zeros((num_states, num_actions, num_states))
# Rewards: R[s, a, s'] = reward for this transition
self.R = np.zeros((num_states, num_actions, num_states))
def step(self, state, action):
"""Sample next state and reward."""
# Sample next state from transition distribution
probs = self.T[state, action, :]
next_state = np.random.choice(self.num_states, p=probs)
# Get reward
reward = self.R[state, action, next_state]
return next_state, reward
Policies: Deterministic vs Stochastic
A policy $\pi$ tells the agent what to do in each state.
Deterministic policy: $a = \pi(s)$
- Maps each state to exactly one action
- Simple but can be suboptimal in some games
def deterministic_policy(state):
"""Always returns the same action for a given state."""
action_map = {0: 1, 1: 0, 2: 1, 3: 2} # state -> action
return action_map[state]
Stochastic policy: $a \sim \pi(\cdot | s)$
- Returns a probability distribution over actions
- Essential for exploration and some optimal strategies
def stochastic_policy(state):
"""Returns action probabilities for a given state."""
# Example: mostly take action 0, sometimes explore
if state == 0:
return np.array([0.8, 0.1, 0.1]) # prob of each action
else:
return np.array([0.2, 0.6, 0.2])
def sample_action(state):
probs = stochastic_policy(state)
return np.random.choice(len(probs), p=probs)
Why Stochastic Policies?
Consider rock-paper-scissors:
- Any deterministic policy is exploitable
- The optimal policy is uniform random: $\pi(a|s) = 1/3$ for all actions
In RL:
- Stochastic policies enable exploration
- They're essential for policy gradient methods
- They can represent mixed strategies in games
Value Functions
The state-value function $V_\pi(s)$ answers: "How good is it to be in state $s$ if I follow policy $\pi$?"
$$ V_\pi(s) = \mathbb{E}\pi\left[\sum{t=0}^{\infty} \gamma^t r_{t+1} \mid s_0 = s\right] $$
The action-value function $Q_\pi(s, a)$ answers: "How good is it to take action $a$ in state $s$, then follow $\pi$?"
$$ Q_\pi(s, a) = \mathbb{E}\pi\left[\sum{t=0}^{\infty} \gamma^t r_{t+1} \mid s_0 = s, a_0 = a\right] $$
The relationship:
$$ V_\pi(s) = \sum_a \pi(a|s) \cdot Q_\pi(s, a) $$
Value Functions in Code
def compute_return(rewards, gamma):
"""
Compute discounted return from a list of rewards.
G_t = r_{t+1} + gamma * r_{t+2} + gamma^2 * r_{t+3} + ...
"""
G = 0
for r in reversed(rewards):
G = r + gamma * G
return G
# Example: rewards = [1, 1, 1, 1, 1], gamma = 0.9
# Return = 1 + 0.9 + 0.81 + 0.729 + 0.6561 = 4.0951
For tabular environments, we can store V and Q as arrays:
# V[s] = value of state s
V = np.zeros(num_states)
# Q[s, a] = value of taking action a in state s
Q = np.zeros((num_states, num_actions))
The Bellman Equation
The key recursive insight: value today = reward now + discounted value tomorrow.
For V:
$$ V_\pi(s) = \sum_a \pi(a|s) \sum_{s'} T(s'|s,a) \left[R(s,a,s') + \gamma V_\pi(s')\right] $$
For Q:
$$ Q_\pi(s,a) = \sum_{s'} T(s'|s,a) \left[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q_\pi(s',a')\right] $$
The Bellman Equation Intuition
Think of it as "expected immediate reward + expected future value":
def bellman_expectation_V(state, policy, T, R, V, gamma):
"""
Compute V(s) using the Bellman expectation equation.
V(s) = sum over actions of:
policy(a|s) * sum over next states of:
T(s'|s,a) * [R(s,a,s') + gamma * V(s')]
"""
value = 0
for action in range(num_actions):
action_prob = policy[state, action]
for next_state in range(num_states):
transition_prob = T[state, action, next_state]
reward = R[state, action, next_state]
future_value = gamma * V[next_state]
value += action_prob * transition_prob * (reward + future_value)
return value
Optimal Value Functions
The optimal value function $V^*(s)$ is the best possible value:
$$ V^*(s) = \max_\pi V_\pi(s) $$
The optimal Q-function $Q^*(s, a)$:
$$ Q^*(s, a) = \max_\pi Q_\pi(s, a) $$
The Bellman optimality equation:
$$ V^(s) = \max_a \sum_{s'} T(s'|s,a) \left[R(s,a,s') + \gamma V^(s')\right] $$
$$ Q^(s,a) = \sum_{s'} T(s'|s,a) \left[R(s,a,s') + \gamma \max_{a'} Q^(s',a')\right] $$
Discount Factors: Why and How
The discount factor $\gamma$ controls how much we value future rewards.
| $\gamma$ | Behavior | Use Case |
|---|---|---|
| 0.0 | Greedy, only immediate reward | Very short tasks |
| 0.9 | Balanced | Most tasks |
| 0.99 | Long-horizon planning | Complex games |
| 1.0 | No discounting | Finite episodes only |
# Effect of gamma on a stream of constant rewards
def discounted_sum(reward_per_step, gamma, horizon=100):
return sum(reward_per_step * (gamma ** t) for t in range(horizon))
# gamma=0.9: reward=1 -> sum ~= 10
# gamma=0.99: reward=1 -> sum ~= 100
# gamma=1.0: reward=1 -> sum = horizon (can diverge!)
Why Discount?
Three reasons:
- Mathematical: Without discounting, infinite sums can diverge
- Practical: Future rewards are uncertain; discounting encodes this
- Economic: Money now is worth more than money later (opportunity cost)
Geometric discounting is special:
$$ \frac{\gamma^{t+1}}{\gamma^t} = \gamma \quad \text{(constant ratio)} $$
This gives the Bellman equation its elegant recursive form.
A Complete Example: Gridworld
class GridWorld:
"""
Simple 4x4 grid. Agent starts at (0,0), goal at (3,3).
Reward: -1 per step, +10 for reaching goal.
"""
def __init__(self):
self.size = 4
self.state = (0, 0)
self.goal = (3, 3)
def reset(self):
self.state = (0, 0)
return self.state
def step(self, action):
# Actions: 0=up, 1=right, 2=down, 3=left
moves = [(-1, 0), (0, 1), (1, 0), (0, -1)]
dy, dx = moves[action]
new_y = max(0, min(self.size-1, self.state[0] + dy))
new_x = max(0, min(self.size-1, self.state[1] + dx))
self.state = (new_y, new_x)
if self.state == self.goal:
return self.state, 10, True # state, reward, done
return self.state, -1, False
Capstone Connection: Why RL Matters for AI Safety
Reinforcement learning is the foundation of modern AI alignment work:
RLHF (RL from Human Feedback)
- Used to train ChatGPT, Claude, and other LLMs
- Human preferences become the reward signal
- Understanding RL is essential to understanding alignment techniques
Reward Hacking
- Agents optimize for the specified reward, not the intended goal
- Classic alignment problem: Goodhart's Law in action
- RL formalism helps us reason about this precisely
Value Learning
- How do we infer human values from behavior?
- Inverse RL tries to learn reward functions from demonstrations
- Critical for building AI systems that help rather than harm
Key insight: The reward function is the specification of what we want. Getting it right is the core alignment challenge.
🎓 Tyla's Exercise
Theoretical Understanding
Prove the Bellman equation: Starting from the definition of $V_\pi(s)$, derive the recursive form. Show each step and explain why the Markov property is essential.
Optimal policy existence: Given an MDP with $\gamma < 1$, prove that there always exists a deterministic optimal policy. (Hint: Consider the Bellman optimality operator as a contraction mapping.)
Discount factor analysis: For the 3-state MDP below, find the threshold value of $\gamma$ where the optimal policy changes:
- State 0: Action L goes to state 1 with reward +1; Action R goes to state 2 with reward 0
- State 1: Always returns to state 0 with reward 0
- State 2: Always returns to state 0 with reward +2
Value function uniqueness: Show that for any policy $\pi$ and discount factor $\gamma < 1$, the value function $V_\pi$ is unique.
💻 Aaliyah's Exercise
Implementation Challenge
Build a complete MDP toolkit:
class MDP:
def __init__(self, num_states, num_actions, T, R, gamma=0.99):
"""
Initialize an MDP.
Args:
num_states: Number of states
num_actions: Number of actions
T: Transition matrix, shape (num_states, num_actions, num_states)
T[s, a, s'] = P(s' | s, a)
R: Reward matrix, shape (num_states, num_actions, num_states)
R[s, a, s'] = reward for transition
gamma: Discount factor
"""
# YOUR CODE HERE
pass
def compute_V_from_Q(self, Q, policy):
"""
Given Q-values and a policy, compute V.
V(s) = sum_a policy(a|s) * Q(s, a)
"""
# YOUR CODE HERE
pass
def compute_Q_from_V(self, V):
"""
Given V, compute Q using the Bellman equation.
Q(s, a) = sum_s' T(s'|s,a) * [R(s,a,s') + gamma * V(s')]
"""
# YOUR CODE HERE
pass
def bellman_backup(self, V):
"""
Perform one Bellman backup: V_new(s) = max_a Q(s, a)
Returns: Updated V and the greedy policy
"""
# YOUR CODE HERE
pass
def test_mdp():
"""Test your implementation on the 3-state example."""
# Set up the 3-state MDP from the theory section
T = np.zeros((3, 2, 3)) # (states, actions, next_states)
R = np.zeros((3, 2, 3))
# State 0: L->1, R->2
T[0, 0, 1] = 1.0 # L goes to state 1
T[0, 1, 2] = 1.0 # R goes to state 2
R[0, 0, 1] = 1.0 # Reward for L
# State 1: returns to 0
T[1, :, 0] = 1.0
# State 2: returns to 0 with reward 2
T[2, :, 0] = 1.0
R[2, :, 0] = 2.0
mdp = MDP(3, 2, T, R, gamma=0.99)
# Test your methods here
# YOUR CODE HERE
pass
📚 Maneesha's Reflection
Pedagogical Analysis
The Credit Assignment Problem: When an agent receives a reward, how does it know which of its past actions was responsible? This is one of the deepest challenges in RL. How would you explain this to someone who has never heard of RL?
The Exploration-Exploitation Tradeoff: An agent must balance trying new things (exploration) vs. doing what it knows works (exploitation). What analogies from human learning capture this tension? How do students balance this in their own learning?
Why Markov?: The Markov assumption (future depends only on present, not history) is clearly false for many real-world situations. What information would you need to add to a "state" to make a non-Markovian problem Markovian? What are the costs of doing so?
Reward Design as Pedagogy: Designing reward functions is like designing assessment rubrics. What makes a good reward signal? What are the risks of poorly designed rewards in RL systems vs. poorly designed rubrics in education?
RL and Human Learning: How do concepts from RL (reward, policy, value function) map to theories of human learning and motivation? Where does the analogy break down?
Key Takeaways
- RL = Learning from interaction through a reward signal
- MDPs provide the mathematical foundation with $(S, A, T, R, \gamma)$
- Policies can be deterministic or stochastic
- Value functions quantify "how good" states and actions are
- Bellman equations give the recursive structure we need to compute values
- Discount factors handle infinite horizons and express time preference
Next up: How do we actually compute these values and find optimal policies?