1. Introduction: Beyond Matching
In the previous article, we covered matching and propensity score methods to handle confounding in observational data. These methods work well when we can measure and balance all confounders.
But what if:
- We have many confounders and want a more efficient approach?
- Treatment assignment depends on unobserved factors (endogeneity)?
- We have panel data with repeated observations over time?
- Treatment is assigned based on a threshold or cutoff?
This article covers regression-based methods that address these scenarios through different identification strategies.
2. Regression Adjustment
2.1 Linear Regression for Causal Effects
The simplest regression approach fits a linear model of the outcome on treatment and covariates:
Where:
- τ is the estimated treatment effect (coefficient on Wi)
- Xi are confounding covariates (e.g., customer age, past purchases, browsing history)
- The regression "controls for" X by conditioning on these variables
Example: Promo Effect Controlling for Loyalty
import statsmodels.formula.api as smf
import pandas as pd
# Simulate data
data = pd.DataFrame({
'purchase': [1, 0, 1, 1, 0, 1, 0, 1, 1, 0],
'promo': [1, 0, 1, 1, 0, 1, 0, 0, 1, 0],
'past_purchases': [5, 2, 8, 6, 1, 7, 3, 4, 9, 2],
'age': [35, 22, 41, 28, 19, 45, 31, 27, 38, 24]
})
# Regression with controls
model = smf.ols('purchase ~ promo + past_purchases + age', data=data).fit()
print(model.summary())
# Treatment effect = coefficient on 'promo'
print(f"Estimated ATE: {model.params['promo']:.3f}")
print(f"95% CI: [{model.conf_int().loc['promo', 0]:.3f}, "
f"{model.conf_int().loc['promo', 1]:.3f}]")2.2 Conditional Ignorability
For regression to identify causal effects, we need the conditional ignorability (or unconfoundedness) assumption:
Conditional Ignorability:
(Y(1), Y(0)) ⊥ W | X
In words: conditional on observed covariates X, treatment assignment W is independent of potential outcomes. All confounders are observed and included in X.
This is a strong assumption. If there are unobserved confounders (e.g., customer motivation, which we can't measure), regression will give biased estimates.
2.3 The Bad Controls Problem
Not all variables should be controlled for. Including certain variables can introduce bias:
⚠️ Don't Control For:
- Mediators: Variables on the causal path from treatment to outcome (blocks the effect)
- Colliders: Variables caused by both treatment and outcome (induces spurious correlation)
- Post-treatment variables: Variables affected by treatment (endogenous controls)
Example: Bad Control in Promo Study
Suppose we control for "number of items viewed after receiving promo." This is a post-treatment variable—the promo may cause customers to browse more, which then leads to purchases.
Controlling for browsing behavior blocks part of the causal pathway: Promo → Browsing → Purchase. We'd underestimate the true effect.
Rule of thumb: Only control for pre-treatment covariates that confound treatment assignment. Use DAGs (from Module 1, Week 2) to identify valid adjustment sets.
3. Difference-in-Differences (DiD)
Difference-in-Differences is a powerful method for panel data when treatment is introduced at a specific point in time.
3.1 Parallel Trends Assumption
DiD relies on the parallel trends assumption:
Parallel Trends:
In the absence of treatment, the average outcomes for treated and control groups would follow parallel trajectories over time.
Example: Promo Rollout by Region
Suppose your company rolls out the 20% promo in the Northeast region in March, but not in the Midwest (control region).
Pre-treatment (Jan-Feb): Both regions show similar purchase trends.
Post-treatment (Mar-Apr): Northeast purchases increase more than Midwest.
The DiD estimate compares the change in purchases in Northeast to the change in Midwest, removing time trends common to both regions.
3.2 DiD Estimation
The DiD estimator is:
Or equivalently, via regression:
Where:
- Treatedi: 1 if unit is in treated group, 0 otherwise
- Postt: 1 if time period is post-treatment, 0 otherwise
- τ (coefficient on interaction): DiD estimate of treatment effect
3.3 Python Implementation
import pandas as pd
import statsmodels.formula.api as smf
# Simulate panel data
data = pd.DataFrame({
'region': ['NE', 'NE', 'MW', 'MW', 'NE', 'NE', 'MW', 'MW'],
'time': ['Jan', 'Mar', 'Jan', 'Mar', 'Feb', 'Apr', 'Feb', 'Apr'],
'purchases': [100, 150, 95, 105, 102, 155, 97, 108]
})
data['treated'] = (data['region'] == 'NE').astype(int)
data['post'] = data['time'].isin(['Mar', 'Apr']).astype(int)
# DiD regression
model = smf.ols('purchases ~ treated + post + treated:post', data=data).fit()
print(model.summary())
# DiD estimate
tau_did = model.params['treated:post']
print(f"\nDiD Estimate: {tau_did:.2f}")
print(f"Interpretation: Promo increased purchases by {tau_did:.2f} on average")
# Manual calculation
ne_pre = data[(data['region']=='NE') & (data['post']==0)]['purchases'].mean()
ne_post = data[(data['region']=='NE') & (data['post']==1)]['purchases'].mean()
mw_pre = data[(data['region']=='MW') & (data['post']==0)]['purchases'].mean()
mw_post = data[(data['region']=='MW') & (data['post']==1)]['purchases'].mean()
tau_manual = (ne_post - ne_pre) - (mw_post - mw_pre)
print(f"Manual DiD: {tau_manual:.2f}")4. Instrumental Variables (IV)
When treatment is endogenous (correlated with unobserved confounders), standard regression is biased.Instrumental Variables provide a way to identify causal effects using an exogenous source of variation.
4.1 IV Assumptions
An instrument Z must satisfy three conditions:
- Relevance: Z is correlated with treatment W (Cov(Z, W) ≠ 0)
- Exclusion Restriction: Z affects Y only through W (not directly)
- Exogeneity: Z is uncorrelated with the error term (no confounding)
4.2 Two-Stage Least Squares (2SLS)
The standard IV estimator is Two-Stage Least Squares (2SLS):
Stage 1: First-Stage Regression
Wi = γ0 + γ1Zi + ui
Predict treatment Ŵi using the instrument Z. This gives us the exogenous variation in treatment.
Stage 2: Second-Stage Regression
Yi = β0 + τŴi + εi
Regress outcome on predicted treatment Ŵi. The coefficient τ is the causal effect.
4.3 Example: Email Deliverability as IV
Scenario:
Suppose customers self-select into receiving promotional emails (those who opted in). These customers likely differ in unobserved ways (motivation, brand affinity).
Instrument: Random email server delays cause some customers to receive the promo email, while others don't (due to spam filters or delivery failures)—independent of customer characteristics.
- Relevance: Email delivery predicts promo receipt
- Exclusion: Delivery only affects purchases through promo receipt (not directly)
- Exogeneity: Delivery is random (uncorrelated with customer traits)
from linearmodels.iv import IV2SLS
# Simulate data with endogeneity
data = pd.DataFrame({
'purchase': [1, 0, 1, 1, 0, 1, 0, 1, 1, 0],
'promo_received': [1, 0, 1, 1, 0, 1, 0, 0, 1, 0],
'email_delivered': [1, 0, 1, 1, 0, 1, 0, 0, 1, 1], # Instrument
'motivation': [5, 2, 4, 4, 1, 5, 2, 3, 5, 2] # Unobserved confounder
})
# 2SLS estimation
iv_model = IV2SLS(
dependent=data['purchase'],
endog=data['promo_received'], # Endogenous treatment
exog=None, # No other controls
instruments=data['email_delivered'] # Instrument
).fit()
print(iv_model.summary)
print(f"\nIV Estimate: {iv_model.params['promo_received']:.3f}")4.4 Weak Instruments Problem
⚠️ Weak Instruments:
If Z is only weakly correlated with W (small F-statistic in first stage), IV estimates are biased and have large standard errors.
Rule of thumb: F-statistic > 10 in first stage. If F < 10, instrument is weak.
5. Regression Discontinuity Design (RDD)
5.1 RDD Intuition
Regression Discontinuity exploits situations where treatment is assigned based on a threshold of a running variable.
Example: Purchase History Threshold
Suppose promos are automatically sent to customers with >5 past purchases, but not to those with ≤5.
Customers just above and below the threshold (e.g., 5 vs 6 purchases) are very similar, but one group gets the promo and the other doesn't. We can compare outcomes at the discontinuity.
The RDD estimate is the jump in average outcomes at the cutoff.
5.3 Implementation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Simulate RDD data
np.random.seed(42)
running_var = np.random.uniform(0, 10, 500)
cutoff = 5
treated = (running_var > cutoff).astype(int)
# Generate outcome with discontinuity at cutoff
baseline = 0.5 * running_var + np.random.normal(0, 1, 500)
treatment_effect = 3
outcome = baseline + treatment_effect * treated
data = pd.DataFrame({
'running_var': running_var,
'treated': treated,
'outcome': outcome
})
# Local linear regression around cutoff
bandwidth = 2
local_data = data[(data['running_var'] >= cutoff - bandwidth) &
(data['running_var'] <= cutoff + bandwidth)]
# Estimate RDD
model = smf.ols('outcome ~ running_var + treated + running_var:treated',
local_data).fit()
rdd_estimate = model.params['treated']
print(f"RDD Estimate: {rdd_estimate:.3f}")
print(f"True Effect: {treatment_effect}")
# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(data['running_var'], data['outcome'], alpha=0.3, s=10)
plt.axvline(cutoff, color='red', linestyle='--', label='Cutoff')
plt.xlabel('Running Variable (Past Purchases)')
plt.ylabel('Outcome (Purchase Rate)')
plt.title('Regression Discontinuity Design')
plt.legend()
plt.show()6. Fixed Effects Models
6.1 Individual Fixed Effects
Fixed effects models control for time-invariant unobserved heterogeneity by including unit-specific intercepts.
Where αi is a customer-specific fixed effect capturing all time-invariant characteristics (observed and unobserved).
Example: Customer-Level Panel Data
Track the same customers over multiple months. Some receive promos in certain months, others don't.
Fixed effects remove the influence of time-invariant traits (loyalty, income, preferences) by comparing each customer to themselves over time.
from linearmodels.panel import PanelOLS
# Simulate panel data
data = pd.DataFrame({
'customer_id': [1, 1, 2, 2, 3, 3, 4, 4],
'month': [1, 2, 1, 2, 1, 2, 1, 2],
'promo': [0, 1, 0, 0, 1, 1, 0, 1],
'purchase': [0, 1, 0, 0, 1, 1, 0, 1]
})
data = data.set_index(['customer_id', 'month'])
# Fixed effects regression
fe_model = PanelOLS(data['purchase'], data[['promo']], entity_effects=True).fit()
print(fe_model.summary)
print(f"\nFE Estimate: {fe_model.params['promo']:.3f}")6.2 Fixed Effects with DiD
Combining fixed effects with DiD provides a robust approach for panel data:
Where:
- αi: Unit fixed effects (control for time-invariant differences)
- λt: Time fixed effects (control for common time trends)
- τ: DiD estimate
7. Key Takeaways
- ✓Regression adjustment controls for observed confounders but requires conditional ignorability
- ✓DiD uses before-after comparisons to remove time-invariant confounding and common time trends
- ✓IV/2SLS handles endogeneity using exogenous instruments, but requires strong assumptions
- ✓RDD exploits threshold-based treatment assignment for local causal estimates
- ✓Fixed effects remove time-invariant unobserved heterogeneity in panel data
- ✓Each method has different identification assumptions—choose based on your setting
8. Next Week Preview
Module 3, Week 1: Double/Debiased Machine Learning
We'll enter the world of modern causal ML methods. Learn how to combine machine learning with causal inference using double/debiased ML (DML) to handle high-dimensional confounders and avoid regularization bias. We'll cover Neyman orthogonality, cross-fitting, and practical implementations.