1. The Observational Data Challenge
In Week 1, we saw that randomization solves selection bias by making treatment assignment independent of potential outcomes. But what if we can't randomize? What if our marketing team already sent promos to customers they chose—likely targeting loyal customers, recent browsers, or high-value accounts?
Our Observational Data Problem:
• 5,000 customers received the 20% off promo (Treatment group)
• 15,000 customers didn't receive it (Control group)
• Treatment assignment was not random—marketing targeted based on customer characteristics
Selection bias: Treated customers differ systematically from control customers. Simply comparing outcomes would conflate the treatment effect with pre-existing differences.
The solution? Matching and propensity scores—methods that attempt to create comparable groups by adjusting for observed confounders.
2. Identification Strategy: Conditional Independence
Before diving into methods, we need a key assumption that makes causal inference possible with observational data:
Conditional Independence Assumption (CIA)
Also called: Unconfoundedness, Ignorability, Selection on Observables
Formal statement: (Y(1), Y(0)) ⊥ D | X
In words: Conditional on observed covariates X, treatment assignment is independent of potential outcomes. This means we've measured all confounders—all variables that affect both treatment and outcomes.
What This Means for Our Example
From our DAG analysis (Week 2), suppose we identified these confounders:
- Past purchases: Number of orders in last 90 days
- Account age: Days since customer registered
- Cart value: Total value of items currently in cart
- Browsing activity: Page views in last 7 days
- Email engagement: Open rate for marketing emails
CIA assumes that within groups of customers with the same values for these variables, promo assignment is "as-if random." If two customers have identical values for all X, the one who received the promo was essentially randomly selected.
⚠️ When CIA Fails:
CIA is untestable—we can't verify if unmeasured confounders exist. If marketing also targeted based onunobserved factors (e.g., customer sentiment from support interactions, predicted churn risk from a proprietary model), CIA is violated and our estimates will be biased. This is why domain knowledge and careful confounder selection are critical.
Positivity Assumption
Along with CIA, we need positivity (also called common support or overlap):
Positivity:
0 < P(D=1 | X=x) < 1 for all x in the support of X
In words: Every customer (or every type of customer characterized by X) must have a non-zero probability of receiving treatment AND control. We can't estimate treatment effects for customer types who always or neverreceive promos—there's no counterfactual comparison.
3. Exact Matching
The simplest approach: for each treated customer, find control customers with exactly the same values for all confounders X.
How Exact Matching Works
Example: Simple 3-Variable Matching
Suppose we simplify to 3 binary confounders:
- High past purchases (Yes/No)
- Active browser (Yes/No)
- Email subscriber (Yes/No)
This creates 2³ = 8 possible "cells" or strata:
| Cell | Past Purchases | Active Browser | Email Sub | Treated | Control |
|---|---|---|---|---|---|
| 1 | Yes | Yes | Yes | 800 | 1200 |
| 2 | Yes | Yes | No | 600 | 800 |
| ... | ... | ... | ... | ... | ... |
| 8 | No | No | No | 150 | 5000 |
Strategy: Within each cell, compare treated vs. control customers. They have identical X values, so under CIA, any difference in outcomes is the causal effect for that cell.
Computing ATE with Exact Matching
where wc = proportion of population in cell c
The Curse of Dimensionality
Exact matching works great with a few discrete confounders. But what if we have:
- Many confounders: 10 binary variables = 2¹⁰ = 1,024 cells
- Continuous confounders: How do we match on exact age (23.4 years) or exact cart value ($47.23)?
As dimensionality increases, many cells will have zero treated or zero control customers (violating positivity), making exact matching infeasible. This is called the curse of dimensionality.
💡 The Solution:
Instead of matching on all confounders X, match on a single number that summarizes all confounders: the propensity score.
4. The Propensity Score
4.1 Definition & Intuition
Propensity Score Definition:
e(X) = P(D=1 | X) = probability of receiving treatment given covariates X
In our example: The propensity score is the probability that a customer receives the 20% off promo, given their past purchases, browsing history, account age, etc.
Intuition: The propensity score is a balancing score—it's a one-dimensional summary of multidimensional confounders. Two customers with the same propensity score have the same probability of treatment, even if their individual covariate values differ.
Example: Understanding Propensity Scores
Three Customers:
- Alice: 10 past purchases, 50 page views, $100 in cart → e(Alice) = 0.85
High propensity: she's the type of customer marketing targets - Bob: 1 past purchase, 5 page views, $0 in cart → e(Bob) = 0.15
Low propensity: unlikely to receive promo - Carol: 5 past purchases, 25 page views, $50 in cart → e(Carol) = 0.50
Medium propensity: 50-50 chance
If we find a treated customer with e=0.50 and a control customer with e=0.50, they're comparable—even if their specific X values differ, they had the same likelihood of treatment.
4.2 The Propensity Score Theorem
Why can we reduce high-dimensional X to a single number e(X)? The answer is Rosenbaum and Rubin's (1983) fundamental theorem:
Propensity Score Theorem:
If (Y(1), Y(0)) ⊥ D | X (unconfoundedness given X)
Then (Y(1), Y(0)) ⊥ D | e(X) (unconfoundedness given propensity score)
In words: If treatment assignment is independent of potential outcomes conditional on X, it's also independent conditional on the propensity score e(X). The propensity score is a sufficient statistic for X with respect to balancing treatment and control groups.
Practical implication: Instead of matching on 20 confounders, we can match on just one number: e(X). This solves the curse of dimensionality.
4.3 Estimating Propensity Scores
The propensity score e(X) = P(D=1|X) is typically unknown and must be estimated from data. This is a standard classification/prediction problem.
Common Estimation Methods:
1. Logistic Regression (Most Common)
Fit a logistic regression with treatment D as the outcome and confounders X as predictors. Each customer's predicted probability ê(X) is their estimated propensity score.
2. Machine Learning Methods
- Random Forests: Can capture non-linear relationships and interactions
- Gradient Boosting (XGBoost, LightGBM): Often achieves best predictive performance
- Neural Networks: For complex, high-dimensional confounders
Trade-off: More flexible models fit propensity scores better but may be less stable. If propensity scores are poorly estimated, downstream causal estimates will be biased.
What Variables to Include?
Use your DAG from Week 2! Include:
- ✓ Confounders (affect both D and Y)
- ✓ Predictors of Y (even if they don't affect D—improves precision)
- ✗ Mediators (on causal path from D to Y—don't adjust for these)
- ✗ Colliders (caused by both D and Y—adjusting creates bias)
- ✗ Descendants of D (post-treatment variables—create bias)
5. Propensity Score Matching
Once we have estimated propensity scores ê(X) for all customers, we can match treated and control customers with similar scores.
Matching Algorithms
1. Nearest-Neighbor Matching (1:1)
For each treated unit, find the control unit with the closest propensity score.
- With replacement: Control units can be matched multiple times (reduces bias, increases variance)
- Without replacement: Each control used only once (increases bias, reduces variance)
2. K-Nearest Neighbors (1:K)
Match each treated unit to K control units (e.g., K=5), then average control outcomes. Reduces variance at the cost of some bias if matches are less exact.
3. Caliper Matching
Only match if |êtreated - êcontrol| < caliper (threshold). Discard treated units without good matches. Common caliper: 0.01 or 0.25 × SD(ê).
4. Radius Matching
Match each treated unit to all control units within a caliper. Balances bias-variance trade-off.
Computing the Treatment Effect
After matching, estimate the Average Treatment Effect on the Treated (ATT):
where ȲM(i) = average outcome for control units matched to treated unit i
Example Walkthrough
Matched Pairs (1:1 Nearest Neighbor):
| Treated Customer | ê(X) | Purchased? | Matched Control | ê(X) | Purchased? |
|---|---|---|---|---|---|
| Alice | 0.85 | Yes (1) | David | 0.84 | Yes (1) |
| Bob | 0.50 | Yes (1) | Emma | 0.51 | No (0) |
| Carol | 0.30 | No (0) | Frank | 0.32 | No (0) |
Individual treatment effects: Alice (1-1=0), Bob (1-0=1), Carol (0-0=0)
ATT = (0 + 1 + 0) / 3 = 0.33 → The promo increased purchase rate by 33 percentage points for treated customers
6. Inverse Probability Weighting (IPW)
Instead of matching, we can weight observations by the inverse of their propensity score to create balance.
IPW Intuition
Think of propensity scores as selection probabilities. Units with low e(X) are underrepresented in the treated group (unlikely to receive treatment). We up-weight these rare treated units to reflect the full population.
IPW Estimator for ATE:
= E[D·Y / e(X)] - E[(1-D)·Y / (1-e(X))]
Weights:
• Treated units: wi = 1/e(Xi)
• Control units: wi = 1/(1-e(Xi))
Example: How IPW Works
Three Customers:
| Customer | e(X) | Treated? | Purchased? | Weight |
|---|---|---|---|---|
| Alice | 0.80 | Yes (1) | 1 | 1/0.80 = 1.25 |
| Bob | 0.20 | Yes (1) | 1 | 1/0.20 = 5.0 |
| Carol | 0.20 | No (0) | 0 | 1/(1-0.20) = 1.25 |
Key insight: Bob has low propensity (0.20) but still received treatment—he's a rare case, so we weight him heavily (5×). This up-weights underrepresented treatment combinations to mimic randomization.
Stabilized Weights (Common Improvement)
Extreme propensity scores (near 0 or 1) create extreme weights, increasing variance. Stabilized weights fix this:
SWi = P(Di=0) / (1-e(Xi)) for control
Stabilized weights reduce variance while maintaining consistency (they converge to the same ATE).
Advantages & Disadvantages
✓ Advantages:
- Uses all data (no discarding unmatched units)
- Straightforward to estimate ATE (not just ATT)
- Works well with large samples
✗ Disadvantages:
- Sensitive to extreme weights (propensity scores near 0/1)
- Requires accurate propensity score estimation
- Poor overlap can lead to high variance
7. Stratification & Subclassification
Instead of matching or weighting, we can stratify (group) customers into bins based on propensity scores, then compare treated vs. control within each bin.
How Stratification Works
Step 1: Create Propensity Score Bins
Divide propensity scores into K strata (typically 5-10). Common approach: quintiles (5 bins of equal size).
Step 2: Estimate Treatment Effect Within Each Stratum
Within stratum k: ATEk = Ȳtreated,k - Ȳcontrol,k
Step 3: Aggregate Across Strata
where wk = proportion of population in stratum k (or Nk/N)
Example: 5-Quintile Stratification
| Stratum | e(X) Range | NT | NC | ȲT | ȲC | ATEk |
|---|---|---|---|---|---|---|
| 1 (Low) | 0.0-0.2 | 200 | 3800 | 0.15 | 0.10 | +0.05 |
| 2 | 0.2-0.4 | 800 | 3200 | 0.25 | 0.18 | +0.07 |
| 3 | 0.4-0.6 | 1500 | 2500 | 0.35 | 0.28 | +0.07 |
| 4 | 0.6-0.8 | 1800 | 2200 | 0.48 | 0.40 | +0.08 |
| 5 (High) | 0.8-1.0 | 700 | 3300 | 0.55 | 0.45 | +0.10 |
ATE ≈ 0.20 × 0.05 + 0.20 × 0.07 + 0.20 × 0.07 + 0.20 × 0.08 + 0.20 × 0.10 = 0.074
The promo increased purchase probability by 7.4 percentage points.
Classic result (Cochran, 1968): Stratifying on propensity scores into just 5 groups removes ~90% of bias due to confounding. More strata improve balance but reduce sample size per stratum.
8. Assessing Covariate Balance
After matching/weighting/stratifying, we must check if we've achieved balance—whether treated and control groups have similar covariate distributions.
Why Balance Matters
Good balance is evidence that our method successfully removed confounding. If covariates are still imbalanced after adjustment, our causal estimates may be biased.
Standardized Mean Difference (SMD)
The most common balance metric:
Rule of thumb: |SMD| < 0.1 indicates good balance; |SMD| < 0.25 is acceptable.
Example: Balance Table
Before and After Matching:
| Covariate | SMD (Before) | SMD (After) | Balanced? |
|---|---|---|---|
| Past purchases | 0.65 | 0.08 | ✓ Yes |
| Account age | 0.42 | 0.05 | ✓ Yes |
| Cart value | 0.58 | 0.12 | ~ Marginal |
| Page views | 0.51 | 0.07 | ✓ Yes |
| Email engagement | 0.39 | 0.09 | ✓ Yes |
Before matching, treated customers had substantially more past purchases (SMD=0.65). After matching, groups are well-balanced.
Love Plots (Visual Balance Assessment)
A Love plot shows SMD for each covariate before and after adjustment, making it easy to spot remaining imbalances. Points should cluster near zero after adjustment.
9. Common Support & Positivity
Common support (or overlap) means treated and control groups have overlapping propensity score distributions. Without overlap, causal inference is impossible—we're comparing apples to oranges.
Visualizing Overlap
Good Overlap Example:
Treated: propensity scores range from 0.15 to 0.90
Control: propensity scores range from 0.10 to 0.85
→ Substantial overlap in [0.15, 0.85]
Poor Overlap Example:
Treated: propensity scores range from 0.70 to 0.95
Control: propensity scores range from 0.05 to 0.40
→ Almost no overlap! These are fundamentally different groups.
What to Do When Overlap is Poor
- Trim the sample: Discard treated units with e(X) > 0.9 and control units with e(X) < 0.1. This changes the estimand (you're no longer estimating ATE for the full population, but for the overlap region).
- Caliper matching: Only match within a narrow propensity score range, discarding poor matches.
- Overlap weights: Weight by (1-e)×e instead of 1/e or 1/(1-e). This emphasizes units with moderate propensity scores (near 0.5).
- Reconsider identification: If overlap is fundamentally poor, you may need a different identification strategy (e.g., instrumental variables, RDD, DiD).
⚠️ Warning:
Poor overlap violates the positivity assumption. Trying to force causal inference without overlap leads to unstable estimates driven by extreme weights or extrapolation. Always check overlap before reporting results.
10. Sensitivity Analysis
The biggest threat to propensity score methods is unmeasured confounding. Sensitivity analysis asks: "How strong would an unmeasured confounder need to be to change our conclusions?"
Rosenbaum's Sensitivity Analysis
The most common sensitivity analysis for matched studies. Assume an unmeasured confounder U affects treatment assignment. Rosenbaum bounds quantify how much hidden bias would overturn results.
Sensitivity Parameter Γ (Gamma):
Γ = max odds ratio of treatment due to unmeasured confounding
- Γ = 1: No unmeasured confounding (matched pairs have equal treatment odds)
- Γ = 2: One matched unit could be 2× more likely to receive treatment due to unmeasured U
- Γ = 3: Up to 3× more likely, etc.
Interpreting Sensitivity Results
Example Results:
Estimated ATE = +0.10 (10 percentage point increase in purchases), p < 0.01
Sensitivity analysis: Result remains significant (p < 0.05) for Γ up to 1.8, but becomes non-significant at Γ = 2.0.
Interpretation: An unmeasured confounder would need to increase treatment odds by 2× (doubling the odds) to explain away our result. If such a strong confounder is plausible, our conclusions are fragile. If not, we have reasonable confidence.
Other Sensitivity Approaches
- VanderWeele-Ding E-value: "What's the minimum strength of association (both with treatment and outcome) an unmeasured confounder would need to have to fully explain away the effect?"
- Simulation-based sensitivity: Simulate various unmeasured confounders and see how estimates change.
- Placebo outcomes: Test your method on outcomes that shouldn't be affected by treatment. If you find spurious effects, your method may be biased.
11. Practical Implementation in Python
Here's a complete workflow using Python libraries.
# Propensity Score Matching in Python
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# 1. Load data
df = pd.read_csv('promo_data.csv')
# Columns: customer_id, received_promo, purchased, past_purchases,
# account_age, cart_value, page_views, email_engagement
# 2. Prepare confounders (X) and treatment (D)
confounders = ['past_purchases', 'account_age', 'cart_value',
'page_views', 'email_engagement']
X = df[confounders]
D = df['received_promo']
Y = df['purchased']
# 3. Estimate propensity scores
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(X_scaled, D)
propensity_scores = ps_model.predict_proba(X_scaled)[:, 1]
df['propensity_score'] = propensity_scores
# 4. Check overlap
plt.hist(df[df['received_promo']==1]['propensity_score'],
alpha=0.5, label='Treated', bins=30)
plt.hist(df[df['received_promo']==0]['propensity_score'],
alpha=0.5, label='Control', bins=30)
plt.xlabel('Propensity Score')
plt.legend()
plt.title('Propensity Score Overlap')
plt.show()
# 5. Propensity Score Matching (1:1 nearest neighbor with caliper)
from sklearn.neighbors import NearestNeighbors
treated = df[df['received_promo'] == 1].reset_index(drop=True)
control = df[df['received_promo'] == 0].reset_index(drop=True)
# Fit NN on control propensity scores
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])
# For each treated, find nearest control
distances, indices = nn.kneighbors(treated[['propensity_score']])
# Apply caliper (e.g., 0.01)
caliper = 0.01
valid_matches = distances.flatten() < caliper
matched_treated = treated[valid_matches]
matched_control = control.iloc[indices.flatten()[valid_matches]]
print(f"Matched {len(matched_treated)} out of {len(treated)} treated units")
# 6. Estimate ATT
att = (matched_treated['purchased'].mean() -
matched_control['purchased'].mean())
print(f"ATT: {att:.4f}")
# 7. Check balance (SMD)
def smd(treated, control, var):
mean_diff = treated[var].mean() - control[var].mean()
pooled_std = np.sqrt((treated[var].var() + control[var].var()) / 2)
return mean_diff / pooled_std
print("\nStandardized Mean Differences (After Matching):")
for var in confounders:
smd_val = smd(matched_treated, matched_control, var)
print(f"{var}: {smd_val:.3f}")
# 8. IPW (Alternative approach)
# Calculate weights
df['ipw_weight'] = np.where(
df['received_promo'] == 1,
1 / df['propensity_score'],
1 / (1 - df['propensity_score'])
)
# Trim extreme weights (optional)
df['ipw_weight'] = df['ipw_weight'].clip(upper=df['ipw_weight'].quantile(0.99))
# Estimate ATE with IPW
ate_ipw = (
(df[df['received_promo']==1]['purchased'] *
df[df['received_promo']==1]['ipw_weight']).sum() /
df[df['received_promo']==1]['ipw_weight'].sum()
-
(df[df['received_promo']==0]['purchased'] *
df[df['received_promo']==0]['ipw_weight']).sum() /
df[df['received_promo']==0]['ipw_weight'].sum()
)
print(f"\nATE (IPW): {ate_ipw:.4f}")
Useful Python Libraries
- causalml: Uplift modeling and meta-learners (we'll cover in Week 7)
- DoWhy: Microsoft's causal inference library with DAG support
- econml: Double ML, causal forests (Week 5-6)
- psmpy: Dedicated propensity score matching library
- statsmodels: IPW and regression-based methods
11.5 End-to-End Example: DoorDash Discount Campaign
Let's walk through a complete propensity score matching analysis for our promotional discount example.
🎯 Business Context
Scenario: DoorDash marketing team sent a "20% off your next order" promo code to 5,000 customers in March 2024. They targeted customers based on engagement signals (past orders, app opens, cart abandonment). 15,000 customers didn't receive the promo.
Business Question: Did the promo cause customers to order more? Or did we just give discounts to people who would have ordered anyway? We need the causal effect to decide if future promo campaigns are worth the margin hit.
Step 1: Data Structure & Variable Selection
Dataset: 20,000 customers observed over 30 days post-campaign
| Variable | Description | Type |
|---|---|---|
| customer_id | Unique customer identifier | String |
| received_promo | Treatment indicator (1 = received promo) | Binary |
| ordered | Outcome: ordered in next 30 days (1=Yes, 0=No) | Binary |
| orders_90d | Orders in past 90 days (pre-treatment) | Count |
| avg_order_value | Average historical order value | Continuous ($) |
| days_since_last_order | Recency (days since last order) | Continuous |
| app_opens_7d | App opens in past 7 days | Count |
| cart_abandons_30d | Cart abandonments in past 30 days | Count |
| account_age_days | Days since account creation | Continuous |
| email_opens_30d | Marketing email opens in past 30 days | Count |
| is_dashpass | DashPass subscriber (1=Yes, 0=No) | Binary |
Variables highlighted in purple are confounders—they affect both promo targeting (marketing used them) and purchase outcomes (more engaged customers order more).
Step 2: Complete Python Workflow
# ===================================================================
# END-TO-END PROPENSITY SCORE MATCHING: DOORDASH PROMO CAMPAIGN
# ===================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from scipy import stats
# Set random seed for reproducibility
np.random.seed(42)
# ===================================================================
# STEP 1: LOAD AND EXPLORE DATA
# ===================================================================
# Load data
df = pd.read_csv('doordash_promo_campaign.csv')
print(f"Total customers: {len(df)}")
print(f"Treated (received promo): {df['received_promo'].sum()}")
print(f"Control (no promo): {(~df['received_promo'].astype(bool)).sum()}")
print(f"\nOutcome prevalence:")
print(df.groupby('received_promo')['ordered'].agg(['mean', 'count']))
# Naive comparison (BIASED - don't trust this!)
naive_diff = (df[df['received_promo']==1]['ordered'].mean() -
df[df['received_promo']==0]['ordered'].mean())
print(f"\nNaive difference (BIASED): {naive_diff:.4f} ({naive_diff*100:.2f} pp)")
# ===================================================================
# STEP 2: CHECK PRE-TREATMENT IMBALANCE
# ===================================================================
confounders = ['orders_90d', 'avg_order_value', 'days_since_last_order',
'app_opens_7d', 'cart_abandons_30d', 'account_age_days',
'email_opens_30d', 'is_dashpass']
def standardized_mean_diff(treated, control, var):
"""Calculate standardized mean difference"""
mean_diff = treated[var].mean() - control[var].mean()
pooled_std = np.sqrt((treated[var].var() + control[var].var()) / 2)
return mean_diff / pooled_std if pooled_std > 0 else 0
treated = df[df['received_promo'] == 1]
control = df[df['received_promo'] == 0]
print("\n" + "="*60)
print("PRE-MATCHING COVARIATE BALANCE")
print("="*60)
print(f"{'Variable':<25} {'SMD':>10} {'Balanced?':>12}")
print("-"*60)
smd_before = {}
for var in confounders:
smd = standardized_mean_diff(treated, control, var)
smd_before[var] = smd
balanced = "✓ Yes" if abs(smd) < 0.1 else ("~ Marginal" if abs(smd) < 0.25 else "✗ No")
print(f"{var:<25} {smd:>10.3f} {balanced:>12}")
# ===================================================================
# STEP 3: ESTIMATE PROPENSITY SCORES
# ===================================================================
print("\n" + "="*60)
print("ESTIMATING PROPENSITY SCORES")
print("="*60)
# Prepare features
X = df[confounders].copy()
# Standardize continuous variables for better convergence
scaler = StandardScaler()
continuous_vars = ['avg_order_value', 'days_since_last_order',
'account_age_days']
X[continuous_vars] = scaler.fit_transform(X[continuous_vars])
D = df['received_promo'].values
Y = df['ordered'].values
# Method 1: Logistic Regression (interpretable, common)
print("\nFitting logistic regression...")
ps_model_lr = LogisticRegression(max_iter=1000, random_state=42)
ps_model_lr.fit(X, D)
propensity_scores_lr = ps_model_lr.predict_proba(X)[:, 1]
# Method 2: Gradient Boosting (better fit, more flexible)
print("Fitting gradient boosting...")
ps_model_gb = GradientBoostingClassifier(n_estimators=100,
max_depth=3,
random_state=42)
ps_model_gb.fit(X, D)
propensity_scores_gb = ps_model_gb.predict_proba(X)[:, 1]
# We'll use gradient boosting for better prediction
propensity_scores = propensity_scores_gb
df['propensity_score'] = propensity_scores
print(f"\nPropensity score summary:")
print(df.groupby('received_promo')['propensity_score'].describe())
# ===================================================================
# STEP 4: CHECK OVERLAP (COMMON SUPPORT)
# ===================================================================
print("\n" + "="*60)
print("ASSESSING OVERLAP")
print("="*60)
ps_treated = df[df['received_promo']==1]['propensity_score']
ps_control = df[df['received_promo']==0]['propensity_score']
print(f"\nTreated range: [{ps_treated.min():.3f}, {ps_treated.max():.3f}]")
print(f"Control range: [{ps_control.min():.3f}, {ps_control.max():.3f}]")
overlap_min = max(ps_treated.min(), ps_control.min())
overlap_max = min(ps_treated.max(), ps_control.max())
print(f"Overlap region: [{overlap_min:.3f}, {overlap_max:.3f}]")
# Count customers outside overlap
treated_outside = (ps_treated < overlap_min) | (ps_treated > overlap_max)
control_outside = (ps_control < overlap_min) | (ps_control > overlap_max)
print(f"\nTreated outside overlap: {treated_outside.sum()} ({treated_outside.mean()*100:.1f}%)")
print(f"Control outside overlap: {control_outside.sum()} ({control_outside.mean()*100:.1f}%)")
# Visualize overlap
plt.figure(figsize=(10, 5))
plt.hist(ps_control, bins=50, alpha=0.5, label='Control', density=True)
plt.hist(ps_treated, bins=50, alpha=0.5, label='Treated', density=True)
plt.axvline(overlap_min, color='red', linestyle='--', alpha=0.5, label='Overlap boundary')
plt.axvline(overlap_max, color='red', linestyle='--', alpha=0.5)
plt.xlabel('Propensity Score')
plt.ylabel('Density')
plt.title('Propensity Score Distribution: Treated vs Control')
plt.legend()
plt.tight_layout()
plt.savefig('propensity_overlap.png', dpi=300, bbox_inches='tight')
print("\n[Overlap plot saved to propensity_overlap.png]")
# ===================================================================
# STEP 5: PROPENSITY SCORE MATCHING (1:1 NEAREST NEIGHBOR)
# ===================================================================
print("\n" + "="*60)
print("PERFORMING PROPENSITY SCORE MATCHING")
print("="*60)
# Trim to overlap region (optional but recommended)
df_trimmed = df[(df['propensity_score'] >= overlap_min) &
(df['propensity_score'] <= overlap_max)].copy()
treated_trim = df_trimmed[df_trimmed['received_promo']==1].reset_index(drop=True)
control_trim = df_trimmed[df_trimmed['received_promo']==0].reset_index(drop=True)
print(f"\nAfter trimming to overlap:")
print(f" Treated: {len(treated_trim)}")
print(f" Control: {len(control_trim)}")
# 1:1 Nearest Neighbor Matching with Caliper
caliper = 0.01 # Maximum propensity score difference
print(f"\nUsing caliper: {caliper}")
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control_trim[['propensity_score']])
distances, indices = nn.kneighbors(treated_trim[['propensity_score']])
# Apply caliper
valid_matches = distances.flatten() < caliper
matched_treated = treated_trim[valid_matches].copy()
matched_control = control_trim.iloc[indices.flatten()[valid_matches]].copy()
print(f"\nMatching results:")
print(f" Matched pairs: {len(matched_treated)}")
print(f" Unmatched treated (poor matches): {(~valid_matches).sum()}")
print(f" Matching rate: {valid_matches.mean()*100:.1f}%")
# ===================================================================
# STEP 6: CHECK POST-MATCHING BALANCE
# ===================================================================
print("\n" + "="*60)
print("POST-MATCHING COVARIATE BALANCE")
print("="*60)
print(f"{'Variable':<25} {'SMD Before':>12} {'SMD After':>12} {'Improved?':>12}")
print("-"*60)
smd_after = {}
for var in confounders:
smd_before_val = smd_before[var]
smd_after_val = standardized_mean_diff(matched_treated, matched_control, var)
smd_after[var] = smd_after_val
improved = "✓" if abs(smd_after_val) < abs(smd_before_val) else "✗"
balanced = "✓" if abs(smd_after_val) < 0.1 else ("~" if abs(smd_after_val) < 0.25 else "✗")
print(f"{var:<25} {smd_before_val:>12.3f} {smd_after_val:>12.3f} {improved:>6} {balanced:>5}")
# Create Love plot
fig, ax = plt.subplots(figsize=(8, 6))
vars_sorted = sorted(confounders, key=lambda v: abs(smd_before[v]), reverse=True)
y_pos = np.arange(len(vars_sorted))
ax.scatter([smd_before[v] for v in vars_sorted], y_pos,
label='Before Matching', marker='o', s=100)
ax.scatter([smd_after[v] for v in vars_sorted], y_pos,
label='After Matching', marker='s', s=100)
ax.axvline(0, color='black', linestyle='-', linewidth=0.5)
ax.axvline(-0.1, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)
ax.axvline(0.1, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(vars_sorted)
ax.set_xlabel('Standardized Mean Difference')
ax.set_title('Covariate Balance: Before vs After Matching (Love Plot)')
ax.legend()
plt.tight_layout()
plt.savefig('balance_love_plot.png', dpi=300, bbox_inches='tight')
print("\n[Love plot saved to balance_love_plot.png]")
# ===================================================================
# STEP 7: ESTIMATE TREATMENT EFFECT (ATT)
# ===================================================================
print("\n" + "="*60)
print("TREATMENT EFFECT ESTIMATION")
print("="*60)
# ATT from matched sample
y_treated = matched_treated['ordered'].values
y_control = matched_control['ordered'].values
att = y_treated.mean() - y_control.mean()
# Standard error and confidence interval (paired t-test)
differences = y_treated - y_control
se_att = differences.std() / np.sqrt(len(differences))
ci_lower = att - 1.96 * se_att
ci_upper = att + 1.96 * se_att
# T-test
t_stat, p_value = stats.ttest_rel(y_treated, y_control)
print(f"\nAverage Treatment Effect on the Treated (ATT):")
print(f" Point estimate: {att:.4f} ({att*100:.2f} percentage points)")
print(f" 95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
print(f" Standard error: {se_att:.4f}")
print(f" t-statistic: {t_stat:.3f}")
print(f" p-value: {p_value:.4f}")
if p_value < 0.001:
sig = "***"
elif p_value < 0.01:
sig = "**"
elif p_value < 0.05:
sig = "*"
else:
sig = "ns"
print(f" Significance: {sig}")
# Comparison
print(f"\nComparison:")
print(f" Naive estimate (biased): {naive_diff:.4f} ({naive_diff*100:.2f} pp)")
print(f" PSM estimate (ATT): {att:.4f} ({att*100:.2f} pp)")
print(f" Difference: {abs(naive_diff - att):.4f} ({abs(naive_diff - att)*100:.2f} pp)")
# ===================================================================
# STEP 8: BUSINESS INTERPRETATION
# ===================================================================
print("\n" + "="*60)
print("BUSINESS IMPLICATIONS")
print("="*60)
# Calculate lift
baseline_rate = y_control.mean()
treated_rate = y_treated.mean()
lift = (treated_rate / baseline_rate - 1) * 100 if baseline_rate > 0 else 0
print(f"\nOrder Rates:")
print(f" Without promo (matched control): {baseline_rate:.1%}")
print(f" With promo (matched treated): {treated_rate:.1%}")
print(f" Relative lift: {lift:.1f}%")
# ROI calculation (example numbers)
avg_order_value = 35 # $35 average order
discount_cost = 0.20 * avg_order_value # 20% off
revenue_per_order = avg_order_value * 0.30 # 30% margin
incremental_orders = att # Per customer
net_value = incremental_orders * revenue_per_order - discount_cost
print(f"\nSimplified ROI (per customer who received promo):")
print(f" Incremental order probability: {att:.1%}")
print(f" Avg order value: ${avg_order_value}")
print(f" Discount cost per promo sent: ${discount_cost:.2f}")
print(f" Margin per incremental order: ${revenue_per_order:.2f}")
print(f" Net value per promo: ${net_value:.2f}")
if net_value > 0:
print(f" ✓ Campaign is PROFITABLE (positive ROI)")
else:
print(f" ✗ Campaign is UNPROFITABLE (negative ROI)")
print(f"\nScaled to campaign:")
campaign_size = 5000
total_incremental_orders = att * campaign_size
total_net_value = net_value * campaign_size
print(f" Total incremental orders: {total_incremental_orders:.0f}")
print(f" Total net value: ${total_net_value:,.2f}")
print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
Step 3: Interpreting Results
Example Output Interpretation:
Key Findings:
- Selection bias was substantial: The naive estimate (8.5 pp) overstated the effect by 2×. Marketing targeted engaged customers who were already more likely to order.
- True causal effect (ATT): +4.2 pp — The promo increased order probability by 4.2 percentage points for customers who received it. This is statistically significant (p < 0.01).
- 23% relative lift: Among matched customers, the promo increased order rates from 18.3% to 22.5%.
- Positive ROI: Even after accounting for the discount cost ($7), the incremental margin ($1.05 per customer) makes the campaign profitable. Total campaign value: $5,250.
Step 4: Critical Assumptions to Verify
Before presenting to stakeholders, check:
- Did we measure all confounders? Interview the marketing team. Did they use any targeting signals we don't have in our data? (e.g., predicted churn score, customer support interactions, referral status)
- Is overlap adequate? If >20% of treated customers are outside common support, results may not generalize to the full treated population.
- Are covariates balanced? If any |SMD| > 0.25 after matching, consider: adding interactions to propensity model, exact matching on that variable, or regression adjustment.
- Run sensitivity analysis: Use Rosenbaum bounds to quantify robustness to unmeasured confounding. Report the Γ value where results become non-significant.
- Placebo test: Estimate the "effect" on a pre-treatment outcome (e.g., orders 90-120 days before campaign). Should be near zero—if not, matching didn't work.
Step 5: Communicating to Business Stakeholders
Translation for non-technical audience:
❌ Don't say:
"After estimating propensity scores via gradient boosting and performing 1:1 nearest neighbor matching with a 0.01 caliper, we achieved covariate balance (|SMD| < 0.1) and estimated the ATT using a paired t-test..."
✅ Do say:
"The 20% off promo increased order rates by 4.2 percentage points."
We compared customers who received the promo to very similar customers who didn't—matching them on past ordering behavior, app engagement, and other factors. This tells us the causal effectof the promo itself, not just correlation.
Bottom line: For every 100 promos we send to customers like those in our campaign, we generate about 4 extra orders. After accounting for the discount cost, we make $1.05 per promo sent. The campaign is profitable and should be scaled.
Caveat: This estimate assumes we've captured all major factors marketing used to target customers. If there were hidden targeting rules we don't have data on, the true effect could differ.
💡 Pro Tip:
Always validate your propensity score model. Check: (1) AUC > 0.6 (some predictive power), (2) balance improvement (Love plot), (3) overlap (histograms), and (4) sensitivity to model choice (compare logistic vs. ML methods). If results are highly sensitive to modeling choices, your estimates are fragile.
12. Key Takeaways
✓ Propensity scores reduce high-dimensional confounders to a single balancing score e(X) = P(D=1|X)
✓ Matching: Compare treated and control units with similar propensity scores (estimates ATT)
✓ IPW: Weight observations by 1/e(X) and 1/(1-e(X)) to create pseudo-population (estimates ATE)
✓ Stratification: Divide into propensity score bins and compare within each bin
✓ Balance checking: Use SMD to verify covariates are balanced after adjustment
✓ Overlap: Treated and control must have overlapping propensity score distributions (positivity)
✓ Sensitivity analysis: Quantify robustness to unmeasured confounding (Rosenbaum bounds, E-values)
✓ Conditional independence: Methods assume all confounders are observed—untestable but critical
13. Next Week Preview
Propensity scores help when we observe all confounders. But what if we have:
- Unmeasured confounders that we can't observe?
- Panel data (same customers over time)?
- Natural experiments or quasi-random variation?
Next week, we'll learn Regression & Instrumental Variables:
- Regression adjustment and its limitations
- Difference-in-differences (DiD) for panel data
- Instrumental variables (IV) and Two-Stage Least Squares (2SLS)
- Regression discontinuity design (RDD)
- Fixed effects for time-invariant confounders
These methods relax the "observe all confounders" assumption and are crucial for real-world causal inference.