Module 2, Week 2: Regression & Instrumental Variables

1. Introduction: Beyond Matching

In the previous article, we covered matching and propensity score methods to handle confounding in observational data. These methods work well when we can measure and balance all confounders.

But what if:

We have many confounders and want a more efficient approach?
Treatment assignment depends on unobserved factors (endogeneity)?
We have panel data with repeated observations over time?
Treatment is assigned based on a threshold or cutoff?

This article covers regression-based methods that address these scenarios through different identification strategies.

2. Regression Adjustment

2.1 Linear Regression for Causal Effects

The simplest regression approach fits a linear model of the outcome on treatment and covariates:

Y_i = β₀ + τW_i + β₁X_i1 + β₂X_i2 + ... + ε_i

Where:

τ is the estimated treatment effect (coefficient on W_i)
X_i are confounding covariates (e.g., customer age, past purchases, browsing history)
The regression "controls for" X by conditioning on these variables

Example: Promo Effect Controlling for Loyalty

import statsmodels.formula.api as smf
import pandas as pd

# Simulate data
data = pd.DataFrame({
    'purchase': [1, 0, 1, 1, 0, 1, 0, 1, 1, 0],
    'promo': [1, 0, 1, 1, 0, 1, 0, 0, 1, 0],
    'past_purchases': [5, 2, 8, 6, 1, 7, 3, 4, 9, 2],
    'age': [35, 22, 41, 28, 19, 45, 31, 27, 38, 24]
})

# Regression with controls
model = smf.ols('purchase ~ promo + past_purchases + age', data=data).fit()
print(model.summary())

# Treatment effect = coefficient on 'promo'
print(f"Estimated ATE: {model.params['promo']:.3f}")
print(f"95% CI: [{model.conf_int().loc['promo', 0]:.3f}, "
      f"{model.conf_int().loc['promo', 1]:.3f}]")

2.2 Conditional Ignorability

For regression to identify causal effects, we need the conditional ignorability (or unconfoundedness) assumption:

Conditional Ignorability:

(Y(1), Y(0)) ⊥ W | X

In words: conditional on observed covariates X, treatment assignment W is independent of potential outcomes. All confounders are observed and included in X.

This is a strong assumption. If there are unobserved confounders (e.g., customer motivation, which we can't measure), regression will give biased estimates.

2.3 The Bad Controls Problem

Not all variables should be controlled for. Including certain variables can introduce bias:

⚠️ Don't Control For:

Mediators: Variables on the causal path from treatment to outcome (blocks the effect)
Colliders: Variables caused by both treatment and outcome (induces spurious correlation)
Post-treatment variables: Variables affected by treatment (endogenous controls)

Example: Bad Control in Promo Study

Suppose we control for "number of items viewed after receiving promo." This is a post-treatment variable—the promo may cause customers to browse more, which then leads to purchases.

Controlling for browsing behavior blocks part of the causal pathway: Promo → Browsing → Purchase. We'd underestimate the true effect.

Rule of thumb: Only control for pre-treatment covariates that confound treatment assignment. Use DAGs (from Module 1, Week 2) to identify valid adjustment sets.

3. Difference-in-Differences (DiD)

Difference-in-Differences is a powerful method for panel data when treatment is introduced at a specific point in time.

3.1 Parallel Trends Assumption

DiD relies on the parallel trends assumption:

Parallel Trends:

In the absence of treatment, the average outcomes for treated and control groups would follow parallel trajectories over time.

Example: Promo Rollout by Region

Suppose your company rolls out the 20% promo in the Northeast region in March, but not in the Midwest (control region).

Pre-treatment (Jan-Feb): Both regions show similar purchase trends.

Post-treatment (Mar-Apr): Northeast purchases increase more than Midwest.

The DiD estimate compares the change in purchases in Northeast to the change in Midwest, removing time trends common to both regions.

3.2 DiD Estimation

The DiD estimator is:

τ_DiD = (Ȳ_treated,post - Ȳ_treated,pre) - (Ȳ_control,post - Ȳ_control,pre)

Or equivalently, via regression:

Y_it = β₀ + β₁Treated_i + β₂Post_t + τ(Treated_i × Post_t) + ε_it

Where:

Treated_i: 1 if unit is in treated group, 0 otherwise
Post_t: 1 if time period is post-treatment, 0 otherwise
τ (coefficient on interaction): DiD estimate of treatment effect

3.3 Python Implementation

import pandas as pd
import statsmodels.formula.api as smf

# Simulate panel data
data = pd.DataFrame({
    'region': ['NE', 'NE', 'MW', 'MW', 'NE', 'NE', 'MW', 'MW'],
    'time': ['Jan', 'Mar', 'Jan', 'Mar', 'Feb', 'Apr', 'Feb', 'Apr'],
    'purchases': [100, 150, 95, 105, 102, 155, 97, 108]
})

data['treated'] = (data['region'] == 'NE').astype(int)
data['post'] = data['time'].isin(['Mar', 'Apr']).astype(int)

# DiD regression
model = smf.ols('purchases ~ treated + post + treated:post', data=data).fit()
print(model.summary())

# DiD estimate
tau_did = model.params['treated:post']
print(f"\nDiD Estimate: {tau_did:.2f}")
print(f"Interpretation: Promo increased purchases by {tau_did:.2f} on average")

# Manual calculation
ne_pre = data[(data['region']=='NE') & (data['post']==0)]['purchases'].mean()
ne_post = data[(data['region']=='NE') & (data['post']==1)]['purchases'].mean()
mw_pre = data[(data['region']=='MW') & (data['post']==0)]['purchases'].mean()
mw_post = data[(data['region']=='MW') & (data['post']==1)]['purchases'].mean()

tau_manual = (ne_post - ne_pre) - (mw_post - mw_pre)
print(f"Manual DiD: {tau_manual:.2f}")

4. Instrumental Variables (IV)

When treatment is endogenous (correlated with unobserved confounders), standard regression is biased.Instrumental Variables provide a way to identify causal effects using an exogenous source of variation.

4.1 IV Assumptions

An instrument Z must satisfy three conditions:

Relevance: Z is correlated with treatment W (Cov(Z, W) ≠ 0)
Exclusion Restriction: Z affects Y only through W (not directly)
Exogeneity: Z is uncorrelated with the error term (no confounding)

4.2 Two-Stage Least Squares (2SLS)

The standard IV estimator is Two-Stage Least Squares (2SLS):

Stage 1: First-Stage Regression

W_i = γ₀ + γ₁Z_i + u_i

Predict treatment Ŵ_i using the instrument Z. This gives us the exogenous variation in treatment.

Stage 2: Second-Stage Regression

Y_i = β₀ + τŴ_i + ε_i

Regress outcome on predicted treatment Ŵ_i. The coefficient τ is the causal effect.

4.3 Example: Email Deliverability as IV

Scenario:

Suppose customers self-select into receiving promotional emails (those who opted in). These customers likely differ in unobserved ways (motivation, brand affinity).

Instrument: Random email server delays cause some customers to receive the promo email, while others don't (due to spam filters or delivery failures)—independent of customer characteristics.

Relevance: Email delivery predicts promo receipt
Exclusion: Delivery only affects purchases through promo receipt (not directly)
Exogeneity: Delivery is random (uncorrelated with customer traits)

from linearmodels.iv import IV2SLS

# Simulate data with endogeneity
data = pd.DataFrame({
    'purchase': [1, 0, 1, 1, 0, 1, 0, 1, 1, 0],
    'promo_received': [1, 0, 1, 1, 0, 1, 0, 0, 1, 0],
    'email_delivered': [1, 0, 1, 1, 0, 1, 0, 0, 1, 1],  # Instrument
    'motivation': [5, 2, 4, 4, 1, 5, 2, 3, 5, 2]  # Unobserved confounder
})

# 2SLS estimation
iv_model = IV2SLS(
    dependent=data['purchase'],
    endog=data['promo_received'],  # Endogenous treatment
    exog=None,  # No other controls
    instruments=data['email_delivered']  # Instrument
).fit()

print(iv_model.summary)
print(f"\nIV Estimate: {iv_model.params['promo_received']:.3f}")

4.4 Weak Instruments Problem

⚠️ Weak Instruments:

If Z is only weakly correlated with W (small F-statistic in first stage), IV estimates are biased and have large standard errors.

Rule of thumb: F-statistic > 10 in first stage. If F < 10, instrument is weak.

5. Regression Discontinuity Design (RDD)

5.1 RDD Intuition

Regression Discontinuity exploits situations where treatment is assigned based on a threshold of a running variable.

Example: Purchase History Threshold

Suppose promos are automatically sent to customers with >5 past purchases, but not to those with ≤5.

Customers just above and below the threshold (e.g., 5 vs 6 purchases) are very similar, but one group gets the promo and the other doesn't. We can compare outcomes at the discontinuity.

The RDD estimate is the jump in average outcomes at the cutoff.

5.2 Sharp vs Fuzzy RDD

Sharp RDD:

Treatment probability jumps from 0 to 1 at the cutoff (perfect compliance).

Fuzzy RDD:

Treatment probability increases at cutoff, but not to 100% (imperfect compliance). Use IV-style estimation to account for non-compliance.

5.3 Implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Simulate RDD data
np.random.seed(42)
running_var = np.random.uniform(0, 10, 500)
cutoff = 5
treated = (running_var > cutoff).astype(int)

# Generate outcome with discontinuity at cutoff
baseline = 0.5 * running_var + np.random.normal(0, 1, 500)
treatment_effect = 3
outcome = baseline + treatment_effect * treated

data = pd.DataFrame({
    'running_var': running_var,
    'treated': treated,
    'outcome': outcome
})

# Local linear regression around cutoff
bandwidth = 2
local_data = data[(data['running_var'] >= cutoff - bandwidth) &
                  (data['running_var'] <= cutoff + bandwidth)]

# Estimate RDD
model = smf.ols('outcome ~ running_var + treated + running_var:treated',
                local_data).fit()
rdd_estimate = model.params['treated']

print(f"RDD Estimate: {rdd_estimate:.3f}")
print(f"True Effect: {treatment_effect}")

# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(data['running_var'], data['outcome'], alpha=0.3, s=10)
plt.axvline(cutoff, color='red', linestyle='--', label='Cutoff')
plt.xlabel('Running Variable (Past Purchases)')
plt.ylabel('Outcome (Purchase Rate)')
plt.title('Regression Discontinuity Design')
plt.legend()
plt.show()

6. Fixed Effects Models

6.1 Individual Fixed Effects

Fixed effects models control for time-invariant unobserved heterogeneity by including unit-specific intercepts.

Y_it = α_i + τW_it + β'X_it + ε_it

Where α_i is a customer-specific fixed effect capturing all time-invariant characteristics (observed and unobserved).

Example: Customer-Level Panel Data

Track the same customers over multiple months. Some receive promos in certain months, others don't.

Fixed effects remove the influence of time-invariant traits (loyalty, income, preferences) by comparing each customer to themselves over time.

from linearmodels.panel import PanelOLS

# Simulate panel data
data = pd.DataFrame({
    'customer_id': [1, 1, 2, 2, 3, 3, 4, 4],
    'month': [1, 2, 1, 2, 1, 2, 1, 2],
    'promo': [0, 1, 0, 0, 1, 1, 0, 1],
    'purchase': [0, 1, 0, 0, 1, 1, 0, 1]
})

data = data.set_index(['customer_id', 'month'])

# Fixed effects regression
fe_model = PanelOLS(data['purchase'], data[['promo']], entity_effects=True).fit()
print(fe_model.summary)

print(f"\nFE Estimate: {fe_model.params['promo']:.3f}")

6.2 Fixed Effects with DiD

Combining fixed effects with DiD provides a robust approach for panel data:

Y_it = α_i + λ_t + τ(Treated_i × Post_t) + ε_it

Where:

α_i: Unit fixed effects (control for time-invariant differences)
λ_t: Time fixed effects (control for common time trends)
τ: DiD estimate

7. Key Takeaways

✓Regression adjustment controls for observed confounders but requires conditional ignorability
✓DiD uses before-after comparisons to remove time-invariant confounding and common time trends
✓IV/2SLS handles endogeneity using exogenous instruments, but requires strong assumptions
✓RDD exploits threshold-based treatment assignment for local causal estimates
✓Fixed effects remove time-invariant unobserved heterogeneity in panel data
✓Each method has different identification assumptions—choose based on your setting

8. Next Week Preview

Module 3, Week 1: Double/Debiased Machine Learning

We'll enter the world of modern causal ML methods. Learn how to combine machine learning with causal inference using double/debiased ML (DML) to handle high-dimensional confounders and avoid regularization bias. We'll cover Neyman orthogonality, cross-fitting, and practical implementations.

Module 2, Week 2: Regression & Instrumental Variables

📊 Running Example: Promotional Discount Campaign

Table of Contents

1. Introduction: Beyond Matching

2. Regression Adjustment

2.1 Linear Regression for Causal Effects

2.2 Conditional Ignorability

2.3 The Bad Controls Problem

3. Difference-in-Differences (DiD)

3.1 Parallel Trends Assumption

3.2 DiD Estimation

3.3 Python Implementation

4. Instrumental Variables (IV)

4.1 IV Assumptions

4.2 Two-Stage Least Squares (2SLS)

4.3 Example: Email Deliverability as IV

4.4 Weak Instruments Problem

5. Regression Discontinuity Design (RDD)

5.1 RDD Intuition

5.2 Sharp vs Fuzzy RDD

5.3 Implementation

6. Fixed Effects Models

6.1 Individual Fixed Effects

6.2 Fixed Effects with DiD

7. Key Takeaways

8. Next Week Preview