← back to series

Module 5, Week 2: Causal Reinforcement Learning

Article 11 of 1318 min read

📊 Running Example: Dynamic Promotion Strategy

How do we learn optimal sequential decision policies from observational data? We'll explore the intersection of causal inference and reinforcement learning.

1. Introduction

Causal inference and reinforcement learning are deeply connected. Both deal with interventions and counterfactual reasoning, but RL focuses on sequential decision-making.

2. Off-Policy Evaluation

Off-policy evaluation (OPE) estimates the value of a new policy using data from a different (logging) policy.

Key Methods:

  • Importance Sampling: Reweight observed rewards by propensity ratios
  • Doubly Robust: Combine model-based and importance sampling
  • Model-based: Learn dynamics model and simulate new policy

3. Contextual Bandits

Contextual bandits are a simplified RL setting: one-step decision problems where we choose actions based on context to maximize rewards.

# Epsilon-greedy contextual bandit
class ContextualBandit:
    def __init__(self, n_actions, epsilon=0.1):
        self.n_actions = n_actions
        self.epsilon = epsilon
        self.models = [LinearRegression() for _ in range(n_actions)]

    def select_action(self, context):
        if np.random.rand() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            # Exploit: choose action with highest predicted reward
            q_values = [model.predict(context.reshape(1, -1))[0]
                       for model in self.models]
            return np.argmax(q_values)

    def update(self, context, action, reward):
        self.models[action].fit(context.reshape(1, -1), [reward])

4. Counterfactual Reasoning in RL

Counterfactual reasoning in RL asks: "What would have happened if the agent had taken a different action?" This enables learning from suboptimal historical policies.

  • Hindsight Experience Replay: Learn from failures by relabeling goals
  • Counterfactual Q-learning: Estimate Q-values for unobserved actions
  • Causal world models: Learn interventional dynamics

5. Key Takeaways

  • OPE enables safe policy evaluation from offline data
  • Contextual bandits bridge causal inference and RL
  • Counterfactual reasoning improves sample efficiency in RL

6. Next Week Preview

Module 6, Week 1: Real-World Applications

We'll explore practical applications of causal inference in A/B testing, tech platforms, healthcare, and policy evaluation.