Introduction to Reinforcement Learning

The foundation of all RL: agents, environments, and the mathematics of sequential decision making.


What is Reinforcement Learning?

Reinforcement learning is fundamentally different from supervised learning. Instead of learning from labeled examples, an agent learns by interacting with an environment and receiving rewards.

The core loop:

Agent observes state s
    |
    v
Agent chooses action a using policy pi
    |
    v
Environment returns new state s' and reward r
    |
    v
Agent updates its understanding
    |
    (repeat)

Key components:


The Agent-Environment Interface

# The basic RL interaction loop
def rl_loop(agent, env, num_steps):
    state = env.reset()

    for t in range(num_steps):
        # Agent chooses action based on current state
        action = agent.select_action(state)

        # Environment responds
        next_state, reward, done, info = env.step(action)

        # Agent learns from this experience
        agent.learn(state, action, reward, next_state)

        state = next_state
        if done:
            break

The trajectory (or rollout) is the full history:

$$ \tau = s_0, a_0, r_1, s_1, a_1, r_2, s_2, a_2, r_3, \ldots $$

Note the indexing: action $a_t$ is taken at state $s_t$, which produces reward $r_{t+1}$ and next state $s_{t+1}$.


Markov Decision Processes (MDPs)

An MDP formalizes RL problems with five components:

  1. State space $\mathcal{S}$: All possible states
  2. Action space $\mathcal{A}$: All possible actions
  3. Transition function $T(s' | s, a)$: Probability of reaching $s'$ from $s$ via $a$
  4. Reward function $R(s, a, s')$: Reward for transition
  5. Discount factor $\gamma \in [0, 1)$: How much to value future rewards

The Markov property is crucial: the future depends only on the current state, not the history.

$$ P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t) $$


Understanding MDPs Through Code

class SimpleMDP:
    """A minimal MDP implementation."""

    def __init__(self, num_states, num_actions):
        self.num_states = num_states
        self.num_actions = num_actions

        # Transition probabilities: T[s, a, s'] = P(s' | s, a)
        self.T = np.zeros((num_states, num_actions, num_states))

        # Rewards: R[s, a, s'] = reward for this transition
        self.R = np.zeros((num_states, num_actions, num_states))

    def step(self, state, action):
        """Sample next state and reward."""
        # Sample next state from transition distribution
        probs = self.T[state, action, :]
        next_state = np.random.choice(self.num_states, p=probs)

        # Get reward
        reward = self.R[state, action, next_state]

        return next_state, reward

Policies: Deterministic vs Stochastic

A policy $\pi$ tells the agent what to do in each state.

Deterministic policy: $a = \pi(s)$

def deterministic_policy(state):
    """Always returns the same action for a given state."""
    action_map = {0: 1, 1: 0, 2: 1, 3: 2}  # state -> action
    return action_map[state]

Stochastic policy: $a \sim \pi(\cdot | s)$

def stochastic_policy(state):
    """Returns action probabilities for a given state."""
    # Example: mostly take action 0, sometimes explore
    if state == 0:
        return np.array([0.8, 0.1, 0.1])  # prob of each action
    else:
        return np.array([0.2, 0.6, 0.2])

def sample_action(state):
    probs = stochastic_policy(state)
    return np.random.choice(len(probs), p=probs)

Why Stochastic Policies?

Consider rock-paper-scissors:

In RL:


Value Functions

The state-value function $V_\pi(s)$ answers: "How good is it to be in state $s$ if I follow policy $\pi$?"

$$ V_\pi(s) = \mathbb{E}\pi\left[\sum{t=0}^{\infty} \gamma^t r_{t+1} \mid s_0 = s\right] $$

The action-value function $Q_\pi(s, a)$ answers: "How good is it to take action $a$ in state $s$, then follow $\pi$?"

$$ Q_\pi(s, a) = \mathbb{E}\pi\left[\sum{t=0}^{\infty} \gamma^t r_{t+1} \mid s_0 = s, a_0 = a\right] $$

The relationship:

$$ V_\pi(s) = \sum_a \pi(a|s) \cdot Q_\pi(s, a) $$


Value Functions in Code

def compute_return(rewards, gamma):
    """
    Compute discounted return from a list of rewards.

    G_t = r_{t+1} + gamma * r_{t+2} + gamma^2 * r_{t+3} + ...
    """
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
    return G

# Example: rewards = [1, 1, 1, 1, 1], gamma = 0.9
# Return = 1 + 0.9 + 0.81 + 0.729 + 0.6561 = 4.0951

For tabular environments, we can store V and Q as arrays:

# V[s] = value of state s
V = np.zeros(num_states)

# Q[s, a] = value of taking action a in state s
Q = np.zeros((num_states, num_actions))

The Bellman Equation

The key recursive insight: value today = reward now + discounted value tomorrow.

For V:

$$ V_\pi(s) = \sum_a \pi(a|s) \sum_{s'} T(s'|s,a) \left[R(s,a,s') + \gamma V_\pi(s')\right] $$

For Q:

$$ Q_\pi(s,a) = \sum_{s'} T(s'|s,a) \left[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q_\pi(s',a')\right] $$


The Bellman Equation Intuition

Think of it as "expected immediate reward + expected future value":

def bellman_expectation_V(state, policy, T, R, V, gamma):
    """
    Compute V(s) using the Bellman expectation equation.

    V(s) = sum over actions of:
           policy(a|s) * sum over next states of:
               T(s'|s,a) * [R(s,a,s') + gamma * V(s')]
    """
    value = 0
    for action in range(num_actions):
        action_prob = policy[state, action]

        for next_state in range(num_states):
            transition_prob = T[state, action, next_state]
            reward = R[state, action, next_state]
            future_value = gamma * V[next_state]

            value += action_prob * transition_prob * (reward + future_value)

    return value

Optimal Value Functions

The optimal value function $V^*(s)$ is the best possible value:

$$ V^*(s) = \max_\pi V_\pi(s) $$

The optimal Q-function $Q^*(s, a)$:

$$ Q^*(s, a) = \max_\pi Q_\pi(s, a) $$

The Bellman optimality equation:

$$ V^(s) = \max_a \sum_{s'} T(s'|s,a) \left[R(s,a,s') + \gamma V^(s')\right] $$

$$ Q^(s,a) = \sum_{s'} T(s'|s,a) \left[R(s,a,s') + \gamma \max_{a'} Q^(s',a')\right] $$


Discount Factors: Why and How

The discount factor $\gamma$ controls how much we value future rewards.

$\gamma$ Behavior Use Case
0.0 Greedy, only immediate reward Very short tasks
0.9 Balanced Most tasks
0.99 Long-horizon planning Complex games
1.0 No discounting Finite episodes only
# Effect of gamma on a stream of constant rewards
def discounted_sum(reward_per_step, gamma, horizon=100):
    return sum(reward_per_step * (gamma ** t) for t in range(horizon))

# gamma=0.9:  reward=1 -> sum ~= 10
# gamma=0.99: reward=1 -> sum ~= 100
# gamma=1.0:  reward=1 -> sum = horizon (can diverge!)

Why Discount?

Three reasons:

  1. Mathematical: Without discounting, infinite sums can diverge
  2. Practical: Future rewards are uncertain; discounting encodes this
  3. Economic: Money now is worth more than money later (opportunity cost)

Geometric discounting is special:

$$ \frac{\gamma^{t+1}}{\gamma^t} = \gamma \quad \text{(constant ratio)} $$

This gives the Bellman equation its elegant recursive form.


A Complete Example: Gridworld

class GridWorld:
    """
    Simple 4x4 grid. Agent starts at (0,0), goal at (3,3).
    Reward: -1 per step, +10 for reaching goal.
    """
    def __init__(self):
        self.size = 4
        self.state = (0, 0)
        self.goal = (3, 3)

    def reset(self):
        self.state = (0, 0)
        return self.state

    def step(self, action):
        # Actions: 0=up, 1=right, 2=down, 3=left
        moves = [(-1, 0), (0, 1), (1, 0), (0, -1)]
        dy, dx = moves[action]

        new_y = max(0, min(self.size-1, self.state[0] + dy))
        new_x = max(0, min(self.size-1, self.state[1] + dx))
        self.state = (new_y, new_x)

        if self.state == self.goal:
            return self.state, 10, True  # state, reward, done
        return self.state, -1, False

Capstone Connection: Why RL Matters for AI Safety

Reinforcement learning is the foundation of modern AI alignment work:

RLHF (RL from Human Feedback)

Reward Hacking

Value Learning

Key insight: The reward function is the specification of what we want. Getting it right is the core alignment challenge.


🎓 Tyla's Exercise

Theoretical Understanding

  1. Prove the Bellman equation: Starting from the definition of $V_\pi(s)$, derive the recursive form. Show each step and explain why the Markov property is essential.

  2. Optimal policy existence: Given an MDP with $\gamma < 1$, prove that there always exists a deterministic optimal policy. (Hint: Consider the Bellman optimality operator as a contraction mapping.)

  3. Discount factor analysis: For the 3-state MDP below, find the threshold value of $\gamma$ where the optimal policy changes:

    • State 0: Action L goes to state 1 with reward +1; Action R goes to state 2 with reward 0
    • State 1: Always returns to state 0 with reward 0
    • State 2: Always returns to state 0 with reward +2
  4. Value function uniqueness: Show that for any policy $\pi$ and discount factor $\gamma < 1$, the value function $V_\pi$ is unique.


💻 Aaliyah's Exercise

Implementation Challenge

Build a complete MDP toolkit:

class MDP:
    def __init__(self, num_states, num_actions, T, R, gamma=0.99):
        """
        Initialize an MDP.

        Args:
            num_states: Number of states
            num_actions: Number of actions
            T: Transition matrix, shape (num_states, num_actions, num_states)
               T[s, a, s'] = P(s' | s, a)
            R: Reward matrix, shape (num_states, num_actions, num_states)
               R[s, a, s'] = reward for transition
            gamma: Discount factor
        """
        # YOUR CODE HERE
        pass

    def compute_V_from_Q(self, Q, policy):
        """
        Given Q-values and a policy, compute V.

        V(s) = sum_a policy(a|s) * Q(s, a)
        """
        # YOUR CODE HERE
        pass

    def compute_Q_from_V(self, V):
        """
        Given V, compute Q using the Bellman equation.

        Q(s, a) = sum_s' T(s'|s,a) * [R(s,a,s') + gamma * V(s')]
        """
        # YOUR CODE HERE
        pass

    def bellman_backup(self, V):
        """
        Perform one Bellman backup: V_new(s) = max_a Q(s, a)

        Returns: Updated V and the greedy policy
        """
        # YOUR CODE HERE
        pass


def test_mdp():
    """Test your implementation on the 3-state example."""
    # Set up the 3-state MDP from the theory section
    T = np.zeros((3, 2, 3))  # (states, actions, next_states)
    R = np.zeros((3, 2, 3))

    # State 0: L->1, R->2
    T[0, 0, 1] = 1.0  # L goes to state 1
    T[0, 1, 2] = 1.0  # R goes to state 2
    R[0, 0, 1] = 1.0  # Reward for L

    # State 1: returns to 0
    T[1, :, 0] = 1.0

    # State 2: returns to 0 with reward 2
    T[2, :, 0] = 1.0
    R[2, :, 0] = 2.0

    mdp = MDP(3, 2, T, R, gamma=0.99)

    # Test your methods here
    # YOUR CODE HERE
    pass

📚 Maneesha's Reflection

Pedagogical Analysis

  1. The Credit Assignment Problem: When an agent receives a reward, how does it know which of its past actions was responsible? This is one of the deepest challenges in RL. How would you explain this to someone who has never heard of RL?

  2. The Exploration-Exploitation Tradeoff: An agent must balance trying new things (exploration) vs. doing what it knows works (exploitation). What analogies from human learning capture this tension? How do students balance this in their own learning?

  3. Why Markov?: The Markov assumption (future depends only on present, not history) is clearly false for many real-world situations. What information would you need to add to a "state" to make a non-Markovian problem Markovian? What are the costs of doing so?

  4. Reward Design as Pedagogy: Designing reward functions is like designing assessment rubrics. What makes a good reward signal? What are the risks of poorly designed rewards in RL systems vs. poorly designed rubrics in education?

  5. RL and Human Learning: How do concepts from RL (reward, policy, value function) map to theories of human learning and motivation? Where does the analogy break down?


Key Takeaways

  1. RL = Learning from interaction through a reward signal
  2. MDPs provide the mathematical foundation with $(S, A, T, R, \gamma)$
  3. Policies can be deterministic or stochastic
  4. Value functions quantify "how good" states and actions are
  5. Bellman equations give the recursive structure we need to compute values
  6. Discount factors handle infinite horizons and express time preference

Next up: How do we actually compute these values and find optimal policies?