Module 5, Week 2: Causal Reinforcement Learning

1. Introduction

Causal inference and reinforcement learning are deeply connected. Both deal with interventions and counterfactual reasoning, but RL focuses on sequential decision-making.

2. Off-Policy Evaluation

Off-policy evaluation (OPE) estimates the value of a new policy using data from a different (logging) policy.

Key Methods:

Importance Sampling: Reweight observed rewards by propensity ratios
Doubly Robust: Combine model-based and importance sampling
Model-based: Learn dynamics model and simulate new policy

3. Contextual Bandits

Contextual bandits are a simplified RL setting: one-step decision problems where we choose actions based on context to maximize rewards.

# Epsilon-greedy contextual bandit
class ContextualBandit:
    def __init__(self, n_actions, epsilon=0.1):
        self.n_actions = n_actions
        self.epsilon = epsilon
        self.models = [LinearRegression() for _ in range(n_actions)]

    def select_action(self, context):
        if np.random.rand() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            # Exploit: choose action with highest predicted reward
            q_values = [model.predict(context.reshape(1, -1))[0]
                       for model in self.models]
            return np.argmax(q_values)

    def update(self, context, action, reward):
        self.models[action].fit(context.reshape(1, -1), [reward])

4. Counterfactual Reasoning in RL

Counterfactual reasoning in RL asks: "What would have happened if the agent had taken a different action?" This enables learning from suboptimal historical policies.

Hindsight Experience Replay: Learn from failures by relabeling goals
Counterfactual Q-learning: Estimate Q-values for unobserved actions
Causal world models: Learn interventional dynamics

5. Key Takeaways

✓OPE enables safe policy evaluation from offline data
✓Contextual bandits bridge causal inference and RL
✓Counterfactual reasoning improves sample efficiency in RL

6. Next Week Preview

Module 6, Week 1: Real-World Applications

We'll explore practical applications of causal inference in A/B testing, tech platforms, healthcare, and policy evaluation.