Skip to content

Monte Carlo Methods — Learning from Complete Episodes

Dynamic programming needs a model. Monte Carlo needs nothing but episodes. Run the policy, watch the returns, average them. The simplest idea in RL — and the one that unlocks everything downstream.

Type: Build Languages: Python Prerequisites: Phase 9 · 01 (MDPs), Phase 9 · 02 (Dynamic Programming) Time: ~75 minutes

The Problem

Dynamic programming is elegant, but it assumes you can query P(s' | s, a) for every state and action. Almost nothing in the real world works that way. A robot cannot analytically compute the distribution over camera pixels after a joint torque. A pricing algorithm cannot integrate over every possible customer reaction. An LLM cannot enumerate all possible continuations after a token.

You need a method that only needs the ability to sample from the environment. Run the policy. Get a trajectory s_0, a_0, r_1, s_1, a_1, r_2, …, s_T. Use it to estimate values. That is Monte Carlo.

The shift from DP to MC is philosophically important: we move from known model + exact backup to sampled rollouts + averaged return. The variance jumps, but the applicability explodes. Every RL algorithm after this lesson — TD, Q-learning, REINFORCE, PPO, GRPO — is a Monte Carlo estimator at heart, sometimes with bootstrapping layered on top.

The Concept

Monte Carlo: rollout, compute returns, average; first-visit vs every-visit

The core idea, in one line: V^π(s) = E_π[G_t | s_t = s] ≈ (1/N) Σ_i G^{(i)}(s) where G^{(i)}(s) are observed returns following visits to s under policy π.

First-visit vs every-visit MC. Given an episode that visits state s multiple times, first-visit MC only counts the return from the first visit; every-visit MC counts all visits. Both are unbiased in the limit. First-visit is simpler to analyze (iid samples). Every-visit uses more data per episode and typically converges faster in practice.

Incremental mean. Instead of storing all returns, update the running average:

V_n(s) = V_{n-1}(s) + (1/n) [G_n - V_{n-1}(s)]

Reorganize: V_new = V_old + α · (target - V_old) with α = 1/n. Swap 1/n for a constant step-size α ∈ (0, 1) and you get a non-stationary MC estimator that tracks changes in π. That move is the entire jump from MC to TD to every modern RL algorithm.

Exploration is now a problem. DP touched every state by enumeration. MC only sees states the policy visits. If π is deterministic, whole regions of the state space never get sampled, and their value estimates stay at zero forever. Three fixes, in historical order:

  1. Exploring starts. Start each episode from a random (s, a) pair. Guarantees coverage; unrealistic in practice (you cannot "reset" a robot into an arbitrary state).
  2. ε-greedy. Act greedy w.r.t. current Q, but with probability ε pick a random action. All state-action pairs get sampled asymptotically.
  3. Off-policy MC. Collect data under a behavior policy μ, learn about target policy π via importance sampling. High variance, but it's the bridge to replay-buffer methods like DQN.

Monte Carlo Control. Evaluate → improve → evaluate, just like policy iteration, but evaluation is sampling-based:

  1. Run π, get an episode.
  2. Update Q(s, a) from observed returns.
  3. Make π ε-greedy w.r.t. Q.
  4. Repeat.

Converges to Q* and π* with probability 1 under mild conditions (every pair visited infinitely often, α satisfies Robbins-Monro).

Build It

Step 1: rollout → list of (s, a, r)

python
def rollout(env, policy, max_steps=200):
    trajectory = []
    s = env.reset()
    for _ in range(max_steps):
        a = policy(s)
        s_next, r, done = env.step(s, a)
        trajectory.append((s, a, r))
        s = s_next
        if done:
            break
    return trajectory

No model, only env.reset() and env.step(s, a). Same interface as a gym environment but stripped down.

Step 2: compute returns (reverse sweep)

python
def returns_from(trajectory, gamma):
    returns = []
    G = 0.0
    for _, _, r in reversed(trajectory):
        G = r + gamma * G
        returns.append(G)
    return list(reversed(returns))

One pass, O(T). The backward recurrence G_t = r_{t+1} + γ G_{t+1} avoids re-summing.

Step 3: first-visit MC evaluation

python
def mc_policy_evaluation(env, policy, episodes, gamma=0.99):
    V = defaultdict(float)
    counts = defaultdict(int)
    for _ in range(episodes):
        trajectory = rollout(env, policy)
        returns = returns_from(trajectory, gamma)
        seen = set()
        for t, ((s, _, _), G) in enumerate(zip(trajectory, returns)):
            if s in seen:
                continue
            seen.add(s)
            counts[s] += 1
            V[s] += (G - V[s]) / counts[s]
    return V

Three lines do the work: mark state as seen on first visit, increment count, update running mean.

Step 4: ε-greedy MC control (on-policy)

python
def mc_control(env, episodes, gamma=0.99, epsilon=0.1):
    Q = defaultdict(lambda: {a: 0.0 for a in ACTIONS})
    counts = defaultdict(lambda: {a: 0 for a in ACTIONS})

    def policy(s):
        if random() < epsilon:
            return choice(ACTIONS)
        return max(Q[s], key=Q[s].get)

    for _ in range(episodes):
        trajectory = rollout(env, policy)
        returns = returns_from(trajectory, gamma)
        seen = set()
        for (s, a, _), G in zip(trajectory, returns):
            if (s, a) in seen:
                continue
            seen.add((s, a))
            counts[s][a] += 1
            Q[s][a] += (G - Q[s][a]) / counts[s][a]
    return Q, policy

Step 5: compare to DP gold standard

Your MC estimate of V^π should agree with the DP result from Lesson 02 as episodes → ∞. In practice: 50,000 episodes on 4×4 GridWorld gets you within ~0.1 of the DP answer.

Pitfalls

  • Infinite episodes. MC requires episodes to terminate. If your policy can loop forever, cap max_steps and treat the cap as implicit failure. GridWorld with a random policy routinely times out — that is normal, just make sure you count it correctly.
  • Variance. MC uses full returns. On long episodes, variance is huge — one unlucky reward at the end shifts V(s_0) by the same amount. TD methods (Lesson 04) cut this by bootstrapping.
  • State coverage. Greedy MC on a fresh Q with ties will only ever try one action. You must explore (ε-greedy, exploring starts, UCB).
  • Non-stationary policies. If π changes (as in MC control), old returns are from a different policy. Constant-α MC handles this; sample-average MC does not.
  • Off-policy importance sampling. The weights π(a|s)/μ(a|s) multiply across a trajectory. Variance explodes with horizon. Cap with per-decision weighted IS or switch to TD.

Use It

The 2026 role of Monte Carlo methods:

Use caseWhy MC
Short-horizon games (blackjack, poker)Episodes terminate naturally; returns are clean.
Offline evaluation of a logged policyAverage discounted returns over stored trajectories.
Monte Carlo Tree Search (AlphaZero)MC rollouts from tree leaves guide selection.
LLM RL evaluationCompute average reward over sampled completions for a given policy.
Baseline estimation in PPOThe advantage target A_t = G_t - V(s_t) uses an MC G_t.
Teaching RLSimplest algorithm that actually works — strip bootstrapping to see the core.

Modern deep-RL algorithms (PPO, SAC) interpolate between pure MC (full returns) and pure TD (one-step bootstrap) via n-step returns or GAE. Both endpoints are instances of the same estimator.

Ship It

Save as outputs/skill-mc-evaluator.md:

markdown
---
name: mc-evaluator
description: Evaluate a policy via Monte Carlo rollouts and produce a convergence report with DP-comparison if available.
version: 1.0.0
phase: 9
lesson: 3
tags: [rl, monte-carlo, evaluation]
---

Given an environment (episodic, with reset+step API) and a policy, output:

1. Method. First-visit vs every-visit MC. Reason.
2. Episode budget. Target number, variance diagnostic, expected standard error.
3. Exploration plan. ε schedule (if needed) or exploring starts.
4. Gold-standard comparison. DP-optimal V* if tabular; otherwise a bound from a Q-learning / PPO baseline.
5. Termination check. Max-step cap, timeouts, handling of non-terminating trajectories.

Refuse to run MC on non-episodic tasks without a finite horizon cap. Refuse to report V^π estimates from fewer than 100 episodes per state for tabular tasks. Flag any policy with zero-variance actions as an exploration risk.

Exercises

  1. Easy. Implement first-visit MC evaluation of the uniform-random policy on 4×4 GridWorld. Run 10,000 episodes. Plot V(0,0) as a function of episode count against the DP answer.
  2. Medium. Implement ε-greedy MC control with ε ∈ {0.01, 0.1, 0.3}. Compare mean return after 20,000 episodes. What does the curve look like? Where does the bias-variance tradeoff live?
  3. Hard. Implement off-policy MC with importance sampling: collect data under uniform-random policy μ, estimate V^π for the deterministic optimal policy π. Compare plain IS vs per-decision IS vs weighted IS. Which has lowest variance?

Key Terms

TermWhat people sayWhat it actually means
Monte Carlo"Random sampling"Estimate expectations by averaging over iid samples from the distribution.
Return G_t"Future reward"Sum of discounted rewards from step t to episode end: Σ_{k≥0} γ^k r_{t+k+1}.
First-visit MC"Count each state once"Only the first visit in an episode contributes to the value estimate.
Every-visit MC"Use all visits"Every visit contributes; slightly biased but more sample-efficient.
ε-greedy"Exploration noise"Pick greedy action with prob 1-ε; random action with prob ε.
Importance sampling"Correcting for sampling from the wrong distribution"Reweight returns by π(a|s)/μ(a|s) products to estimate V^π from μ data.
On-policy"Learn from my own data"Target policy = behavior policy. Vanilla MC, PPO, SARSA.
Off-policy"Learn from someone else's data"Target policy ≠ behavior policy. Importance-sampled MC, Q-learning, DQN.

Further Reading