1. Introduction
Causal inference and reinforcement learning are deeply connected. Both deal with interventions and counterfactual reasoning, but RL focuses on sequential decision-making.
2. Off-Policy Evaluation
Off-policy evaluation (OPE) estimates the value of a new policy using data from a different (logging) policy.
Key Methods:
- Importance Sampling: Reweight observed rewards by propensity ratios
- Doubly Robust: Combine model-based and importance sampling
- Model-based: Learn dynamics model and simulate new policy
3. Contextual Bandits
Contextual bandits are a simplified RL setting: one-step decision problems where we choose actions based on context to maximize rewards.
# Epsilon-greedy contextual bandit
class ContextualBandit:
def __init__(self, n_actions, epsilon=0.1):
self.n_actions = n_actions
self.epsilon = epsilon
self.models = [LinearRegression() for _ in range(n_actions)]
def select_action(self, context):
if np.random.rand() < self.epsilon:
return np.random.randint(self.n_actions)
else:
# Exploit: choose action with highest predicted reward
q_values = [model.predict(context.reshape(1, -1))[0]
for model in self.models]
return np.argmax(q_values)
def update(self, context, action, reward):
self.models[action].fit(context.reshape(1, -1), [reward])4. Counterfactual Reasoning in RL
Counterfactual reasoning in RL asks: "What would have happened if the agent had taken a different action?" This enables learning from suboptimal historical policies.
- Hindsight Experience Replay: Learn from failures by relabeling goals
- Counterfactual Q-learning: Estimate Q-values for unobserved actions
- Causal world models: Learn interventional dynamics
5. Key Takeaways
- ✓OPE enables safe policy evaluation from offline data
- ✓Contextual bandits bridge causal inference and RL
- ✓Counterfactual reasoning improves sample efficiency in RL
6. Next Week Preview
Module 6, Week 1: Real-World Applications
We'll explore practical applications of causal inference in A/B testing, tech platforms, healthcare, and policy evaluation.