Module 2, Week 3: Matching & Propensity Scores

1. The Observational Data Challenge

In Week 1, we saw that randomization solves selection bias by making treatment assignment independent of potential outcomes. But what if we can't randomize? What if our marketing team already sent promos to customers they chose—likely targeting loyal customers, recent browsers, or high-value accounts?

Our Observational Data Problem:

• 5,000 customers received the 20% off promo (Treatment group)

• 15,000 customers didn't receive it (Control group)

• Treatment assignment was not random—marketing targeted based on customer characteristics

Selection bias: Treated customers differ systematically from control customers. Simply comparing outcomes would conflate the treatment effect with pre-existing differences.

The solution? Matching and propensity scores—methods that attempt to create comparable groups by adjusting for observed confounders.

2. Identification Strategy: Conditional Independence

Before diving into methods, we need a key assumption that makes causal inference possible with observational data:

Conditional Independence Assumption (CIA)

Also called: Unconfoundedness, Ignorability, Selection on Observables

Formal statement: (Y(1), Y(0)) ⊥ D | X

In words: Conditional on observed covariates X, treatment assignment is independent of potential outcomes. This means we've measured all confounders—all variables that affect both treatment and outcomes.

What This Means for Our Example

From our DAG analysis (Week 2), suppose we identified these confounders:

Past purchases: Number of orders in last 90 days
Account age: Days since customer registered
Cart value: Total value of items currently in cart
Browsing activity: Page views in last 7 days
Email engagement: Open rate for marketing emails

CIA assumes that within groups of customers with the same values for these variables, promo assignment is "as-if random." If two customers have identical values for all X, the one who received the promo was essentially randomly selected.

⚠️ When CIA Fails:

CIA is untestable—we can't verify if unmeasured confounders exist. If marketing also targeted based onunobserved factors (e.g., customer sentiment from support interactions, predicted churn risk from a proprietary model), CIA is violated and our estimates will be biased. This is why domain knowledge and careful confounder selection are critical.

Positivity Assumption

Along with CIA, we need positivity (also called common support or overlap):

Positivity:

0 < P(D=1 | X=x) < 1 for all x in the support of X

In words: Every customer (or every type of customer characterized by X) must have a non-zero probability of receiving treatment AND control. We can't estimate treatment effects for customer types who always or neverreceive promos—there's no counterfactual comparison.

3. Exact Matching

The simplest approach: for each treated customer, find control customers with exactly the same values for all confounders X.

How Exact Matching Works

Example: Simple 3-Variable Matching

Suppose we simplify to 3 binary confounders:

High past purchases (Yes/No)
Active browser (Yes/No)
Email subscriber (Yes/No)

This creates 2³ = 8 possible "cells" or strata:

Cell	Past Purchases	Active Browser	Email Sub	Treated	Control
1	Yes	Yes	Yes	800	1200
2	Yes	Yes	No	600	800
...	...	...	...	...	...
8	No	No	No	150	5000

Strategy: Within each cell, compare treated vs. control customers. They have identical X values, so under CIA, any difference in outcomes is the causal effect for that cell.

Computing ATE with Exact Matching

ATE = Σ_cells w_c × (Ȳ_treated,c - Ȳ_control,c)

where w_c = proportion of population in cell c

The Curse of Dimensionality

Exact matching works great with a few discrete confounders. But what if we have:

Many confounders: 10 binary variables = 2¹⁰ = 1,024 cells
Continuous confounders: How do we match on exact age (23.4 years) or exact cart value ($47.23)?

As dimensionality increases, many cells will have zero treated or zero control customers (violating positivity), making exact matching infeasible. This is called the curse of dimensionality.

💡 The Solution:

Instead of matching on all confounders X, match on a single number that summarizes all confounders: the propensity score.

4. The Propensity Score

4.1 Definition & Intuition

Propensity Score Definition:

e(X) = P(D=1 | X) = probability of receiving treatment given covariates X

In our example: The propensity score is the probability that a customer receives the 20% off promo, given their past purchases, browsing history, account age, etc.

Intuition: The propensity score is a balancing score—it's a one-dimensional summary of multidimensional confounders. Two customers with the same propensity score have the same probability of treatment, even if their individual covariate values differ.

Example: Understanding Propensity Scores

Three Customers:

Alice: 10 past purchases, 50 page views, $100 in cart → e(Alice) = 0.85
High propensity: she's the type of customer marketing targets
Bob: 1 past purchase, 5 page views, $0 in cart → e(Bob) = 0.15
Low propensity: unlikely to receive promo
Carol: 5 past purchases, 25 page views, $50 in cart → e(Carol) = 0.50
Medium propensity: 50-50 chance

If we find a treated customer with e=0.50 and a control customer with e=0.50, they're comparable—even if their specific X values differ, they had the same likelihood of treatment.

4.2 The Propensity Score Theorem

Why can we reduce high-dimensional X to a single number e(X)? The answer is Rosenbaum and Rubin's (1983) fundamental theorem:

Propensity Score Theorem:

If (Y(1), Y(0)) ⊥ D | X (unconfoundedness given X)

Then (Y(1), Y(0)) ⊥ D | e(X) (unconfoundedness given propensity score)

In words: If treatment assignment is independent of potential outcomes conditional on X, it's also independent conditional on the propensity score e(X). The propensity score is a sufficient statistic for X with respect to balancing treatment and control groups.

Practical implication: Instead of matching on 20 confounders, we can match on just one number: e(X). This solves the curse of dimensionality.

4.3 Estimating Propensity Scores

The propensity score e(X) = P(D=1|X) is typically unknown and must be estimated from data. This is a standard classification/prediction problem.

Common Estimation Methods:

1. Logistic Regression (Most Common)

logit(e(X)) = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ

Fit a logistic regression with treatment D as the outcome and confounders X as predictors. Each customer's predicted probability ê(X) is their estimated propensity score.

2. Machine Learning Methods

Random Forests: Can capture non-linear relationships and interactions
Gradient Boosting (XGBoost, LightGBM): Often achieves best predictive performance
Neural Networks: For complex, high-dimensional confounders

Trade-off: More flexible models fit propensity scores better but may be less stable. If propensity scores are poorly estimated, downstream causal estimates will be biased.

What Variables to Include?

Use your DAG from Week 2! Include:

✓ Confounders (affect both D and Y)
✓ Predictors of Y (even if they don't affect D—improves precision)
✗ Mediators (on causal path from D to Y—don't adjust for these)
✗ Colliders (caused by both D and Y—adjusting creates bias)
✗ Descendants of D (post-treatment variables—create bias)

5. Propensity Score Matching

Once we have estimated propensity scores ê(X) for all customers, we can match treated and control customers with similar scores.

Matching Algorithms

1. Nearest-Neighbor Matching (1:1)

For each treated unit, find the control unit with the closest propensity score.

With replacement: Control units can be matched multiple times (reduces bias, increases variance)
Without replacement: Each control used only once (increases bias, reduces variance)

2. K-Nearest Neighbors (1:K)

Match each treated unit to K control units (e.g., K=5), then average control outcomes. Reduces variance at the cost of some bias if matches are less exact.

3. Caliper Matching

Only match if |ê_treated - ê_control| < caliper (threshold). Discard treated units without good matches. Common caliper: 0.01 or 0.25 × SD(ê).

4. Radius Matching

Match each treated unit to all control units within a caliper. Balances bias-variance trade-off.

Computing the Treatment Effect

After matching, estimate the Average Treatment Effect on the Treated (ATT):

ATT = (1/N_T) Σ_{i:D_i=1} [Y_i - Ȳ_M(i)]

where Ȳ_M(i) = average outcome for control units matched to treated unit i

Example Walkthrough

Matched Pairs (1:1 Nearest Neighbor):

Treated Customer	ê(X)	Purchased?	Matched Control	ê(X)	Purchased?
Alice	0.85	Yes (1)	David	0.84	Yes (1)
Bob	0.50	Yes (1)	Emma	0.51	No (0)
Carol	0.30	No (0)	Frank	0.32	No (0)

Individual treatment effects: Alice (1-1=0), Bob (1-0=1), Carol (0-0=0)

ATT = (0 + 1 + 0) / 3 = 0.33 → The promo increased purchase rate by 33 percentage points for treated customers

6. Inverse Probability Weighting (IPW)

Instead of matching, we can weight observations by the inverse of their propensity score to create balance.

IPW Intuition

Think of propensity scores as selection probabilities. Units with low e(X) are underrepresented in the treated group (unlikely to receive treatment). We up-weight these rare treated units to reflect the full population.

IPW Estimator for ATE:

ATE = E[Y(1)] - E[Y(0)]
= E[D·Y / e(X)] - E[(1-D)·Y / (1-e(X))]

Weights:
• Treated units: w_i = 1/e(X_i)
• Control units: w_i = 1/(1-e(X_i))

Example: How IPW Works

Three Customers:

Customer	e(X)	Treated?	Purchased?	Weight
Alice	0.80	Yes (1)	1	1/0.80 = 1.25
Bob	0.20	Yes (1)	1	1/0.20 = 5.0
Carol	0.20	No (0)	0	1/(1-0.20) = 1.25

Key insight: Bob has low propensity (0.20) but still received treatment—he's a rare case, so we weight him heavily (5×). This up-weights underrepresented treatment combinations to mimic randomization.

Stabilized Weights (Common Improvement)

Extreme propensity scores (near 0 or 1) create extreme weights, increasing variance. Stabilized weights fix this:

SW_i = P(D_i=1) / e(X_i) for treated
SW_i = P(D_i=0) / (1-e(X_i)) for control

Stabilized weights reduce variance while maintaining consistency (they converge to the same ATE).

Advantages & Disadvantages

✓ Advantages:

Uses all data (no discarding unmatched units)
Straightforward to estimate ATE (not just ATT)
Works well with large samples

✗ Disadvantages:

Sensitive to extreme weights (propensity scores near 0/1)
Requires accurate propensity score estimation
Poor overlap can lead to high variance

7. Stratification & Subclassification

Instead of matching or weighting, we can stratify (group) customers into bins based on propensity scores, then compare treated vs. control within each bin.

How Stratification Works

Step 1: Create Propensity Score Bins

Divide propensity scores into K strata (typically 5-10). Common approach: quintiles (5 bins of equal size).

Step 2: Estimate Treatment Effect Within Each Stratum

Within stratum k: ATE_k = Ȳ_treated,k - Ȳ_control,k

Step 3: Aggregate Across Strata

ATE = Σ_k w_k × ATE_k

where w_k = proportion of population in stratum k (or N_k/N)

Example: 5-Quintile Stratification

Stratum	e(X) Range	N_T	N_C	Ȳ_T	Ȳ_C	ATE_k
1 (Low)	0.0-0.2	200	3800	0.15	0.10	+0.05
2	0.2-0.4	800	3200	0.25	0.18	+0.07
3	0.4-0.6	1500	2500	0.35	0.28	+0.07
4	0.6-0.8	1800	2200	0.48	0.40	+0.08
5 (High)	0.8-1.0	700	3300	0.55	0.45	+0.10

ATE ≈ 0.20 × 0.05 + 0.20 × 0.07 + 0.20 × 0.07 + 0.20 × 0.08 + 0.20 × 0.10 = 0.074
The promo increased purchase probability by 7.4 percentage points.

Classic result (Cochran, 1968): Stratifying on propensity scores into just 5 groups removes ~90% of bias due to confounding. More strata improve balance but reduce sample size per stratum.

8. Assessing Covariate Balance

After matching/weighting/stratifying, we must check if we've achieved balance—whether treated and control groups have similar covariate distributions.

Why Balance Matters

Good balance is evidence that our method successfully removed confounding. If covariates are still imbalanced after adjustment, our causal estimates may be biased.

Standardized Mean Difference (SMD)

The most common balance metric:

SMD = (X̄_treated - X̄_control) / √[(s²_treated + s²_control) / 2]

Rule of thumb: |SMD| < 0.1 indicates good balance; |SMD| < 0.25 is acceptable.

Example: Balance Table

Before and After Matching:

Covariate	SMD (Before)	SMD (After)	Balanced?
Past purchases	0.65	0.08	✓ Yes
Account age	0.42	0.05	✓ Yes
Cart value	0.58	0.12	~ Marginal
Page views	0.51	0.07	✓ Yes
Email engagement	0.39	0.09	✓ Yes

Before matching, treated customers had substantially more past purchases (SMD=0.65). After matching, groups are well-balanced.

Love Plots (Visual Balance Assessment)

A Love plot shows SMD for each covariate before and after adjustment, making it easy to spot remaining imbalances. Points should cluster near zero after adjustment.

9. Common Support & Positivity

Common support (or overlap) means treated and control groups have overlapping propensity score distributions. Without overlap, causal inference is impossible—we're comparing apples to oranges.

Visualizing Overlap

Good Overlap Example:

Treated: propensity scores range from 0.15 to 0.90
Control: propensity scores range from 0.10 to 0.85
→ Substantial overlap in [0.15, 0.85]

Poor Overlap Example:

Treated: propensity scores range from 0.70 to 0.95
Control: propensity scores range from 0.05 to 0.40
→ Almost no overlap! These are fundamentally different groups.

What to Do When Overlap is Poor

Trim the sample: Discard treated units with e(X) > 0.9 and control units with e(X) < 0.1. This changes the estimand (you're no longer estimating ATE for the full population, but for the overlap region).
Caliper matching: Only match within a narrow propensity score range, discarding poor matches.
Overlap weights: Weight by (1-e)×e instead of 1/e or 1/(1-e). This emphasizes units with moderate propensity scores (near 0.5).
Reconsider identification: If overlap is fundamentally poor, you may need a different identification strategy (e.g., instrumental variables, RDD, DiD).

⚠️ Warning:

Poor overlap violates the positivity assumption. Trying to force causal inference without overlap leads to unstable estimates driven by extreme weights or extrapolation. Always check overlap before reporting results.

10. Sensitivity Analysis

The biggest threat to propensity score methods is unmeasured confounding. Sensitivity analysis asks: "How strong would an unmeasured confounder need to be to change our conclusions?"

Rosenbaum's Sensitivity Analysis

The most common sensitivity analysis for matched studies. Assume an unmeasured confounder U affects treatment assignment. Rosenbaum bounds quantify how much hidden bias would overturn results.

Sensitivity Parameter Γ (Gamma):

Γ = max odds ratio of treatment due to unmeasured confounding

Γ = 1: No unmeasured confounding (matched pairs have equal treatment odds)
Γ = 2: One matched unit could be 2× more likely to receive treatment due to unmeasured U
Γ = 3: Up to 3× more likely, etc.

Interpreting Sensitivity Results

Example Results:

Estimated ATE = +0.10 (10 percentage point increase in purchases), p < 0.01

Sensitivity analysis: Result remains significant (p < 0.05) for Γ up to 1.8, but becomes non-significant at Γ = 2.0.

Interpretation: An unmeasured confounder would need to increase treatment odds by 2× (doubling the odds) to explain away our result. If such a strong confounder is plausible, our conclusions are fragile. If not, we have reasonable confidence.

Other Sensitivity Approaches

VanderWeele-Ding E-value: "What's the minimum strength of association (both with treatment and outcome) an unmeasured confounder would need to have to fully explain away the effect?"
Simulation-based sensitivity: Simulate various unmeasured confounders and see how estimates change.
Placebo outcomes: Test your method on outcomes that shouldn't be affected by treatment. If you find spurious effects, your method may be biased.

11. Practical Implementation in Python

Here's a complete workflow using Python libraries.

# Propensity Score Matching in Python

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# 1. Load data
df = pd.read_csv('promo_data.csv')
# Columns: customer_id, received_promo, purchased, past_purchases,
#          account_age, cart_value, page_views, email_engagement

# 2. Prepare confounders (X) and treatment (D)
confounders = ['past_purchases', 'account_age', 'cart_value',
               'page_views', 'email_engagement']
X = df[confounders]
D = df['received_promo']
Y = df['purchased']

# 3. Estimate propensity scores
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(X_scaled, D)
propensity_scores = ps_model.predict_proba(X_scaled)[:, 1]

df['propensity_score'] = propensity_scores

# 4. Check overlap
plt.hist(df[df['received_promo']==1]['propensity_score'],
         alpha=0.5, label='Treated', bins=30)
plt.hist(df[df['received_promo']==0]['propensity_score'],
         alpha=0.5, label='Control', bins=30)
plt.xlabel('Propensity Score')
plt.legend()
plt.title('Propensity Score Overlap')
plt.show()

# 5. Propensity Score Matching (1:1 nearest neighbor with caliper)
from sklearn.neighbors import NearestNeighbors

treated = df[df['received_promo'] == 1].reset_index(drop=True)
control = df[df['received_promo'] == 0].reset_index(drop=True)

# Fit NN on control propensity scores
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])

# For each treated, find nearest control
distances, indices = nn.kneighbors(treated[['propensity_score']])

# Apply caliper (e.g., 0.01)
caliper = 0.01
valid_matches = distances.flatten() < caliper
matched_treated = treated[valid_matches]
matched_control = control.iloc[indices.flatten()[valid_matches]]

print(f"Matched {len(matched_treated)} out of {len(treated)} treated units")

# 6. Estimate ATT
att = (matched_treated['purchased'].mean() -
       matched_control['purchased'].mean())
print(f"ATT: {att:.4f}")

# 7. Check balance (SMD)
def smd(treated, control, var):
    mean_diff = treated[var].mean() - control[var].mean()
    pooled_std = np.sqrt((treated[var].var() + control[var].var()) / 2)
    return mean_diff / pooled_std

print("\nStandardized Mean Differences (After Matching):")
for var in confounders:
    smd_val = smd(matched_treated, matched_control, var)
    print(f"{var}: {smd_val:.3f}")

# 8. IPW (Alternative approach)
# Calculate weights
df['ipw_weight'] = np.where(
    df['received_promo'] == 1,
    1 / df['propensity_score'],
    1 / (1 - df['propensity_score'])
)

# Trim extreme weights (optional)
df['ipw_weight'] = df['ipw_weight'].clip(upper=df['ipw_weight'].quantile(0.99))

# Estimate ATE with IPW
ate_ipw = (
    (df[df['received_promo']==1]['purchased'] *
     df[df['received_promo']==1]['ipw_weight']).sum() /
    df[df['received_promo']==1]['ipw_weight'].sum()
    -
    (df[df['received_promo']==0]['purchased'] *
     df[df['received_promo']==0]['ipw_weight']).sum() /
    df[df['received_promo']==0]['ipw_weight'].sum()
)
print(f"\nATE (IPW): {ate_ipw:.4f}")

Useful Python Libraries

causalml: Uplift modeling and meta-learners (we'll cover in Week 7)
DoWhy: Microsoft's causal inference library with DAG support
econml: Double ML, causal forests (Week 5-6)
psmpy: Dedicated propensity score matching library
statsmodels: IPW and regression-based methods

11.5 End-to-End Example: DoorDash Discount Campaign

Let's walk through a complete propensity score matching analysis for our promotional discount example.

🎯 Business Context

Scenario: DoorDash marketing team sent a "20% off your next order" promo code to 5,000 customers in March 2024. They targeted customers based on engagement signals (past orders, app opens, cart abandonment). 15,000 customers didn't receive the promo.

Business Question: Did the promo cause customers to order more? Or did we just give discounts to people who would have ordered anyway? We need the causal effect to decide if future promo campaigns are worth the margin hit.

Step 1: Data Structure & Variable Selection

Dataset: 20,000 customers observed over 30 days post-campaign

Variable	Description	Type
customer_id	Unique customer identifier	String
received_promo	Treatment indicator (1 = received promo)	Binary
ordered	Outcome: ordered in next 30 days (1=Yes, 0=No)	Binary
orders_90d	Orders in past 90 days (pre-treatment)	Count
avg_order_value	Average historical order value	Continuous ($)
days_since_last_order	Recency (days since last order)	Continuous
app_opens_7d	App opens in past 7 days	Count
cart_abandons_30d	Cart abandonments in past 30 days	Count
account_age_days	Days since account creation	Continuous
email_opens_30d	Marketing email opens in past 30 days	Count
is_dashpass	DashPass subscriber (1=Yes, 0=No)	Binary

Variables highlighted in purple are confounders—they affect both promo targeting (marketing used them) and purchase outcomes (more engaged customers order more).

Step 2: Complete Python Workflow

# ===================================================================
# END-TO-END PROPENSITY SCORE MATCHING: DOORDASH PROMO CAMPAIGN
# ===================================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# ===================================================================
# STEP 1: LOAD AND EXPLORE DATA
# ===================================================================

# Load data
df = pd.read_csv('doordash_promo_campaign.csv')

print(f"Total customers: {len(df)}")
print(f"Treated (received promo): {df['received_promo'].sum()}")
print(f"Control (no promo): {(~df['received_promo'].astype(bool)).sum()}")
print(f"\nOutcome prevalence:")
print(df.groupby('received_promo')['ordered'].agg(['mean', 'count']))

# Naive comparison (BIASED - don't trust this!)
naive_diff = (df[df['received_promo']==1]['ordered'].mean() -
              df[df['received_promo']==0]['ordered'].mean())
print(f"\nNaive difference (BIASED): {naive_diff:.4f} ({naive_diff*100:.2f} pp)")

# ===================================================================
# STEP 2: CHECK PRE-TREATMENT IMBALANCE
# ===================================================================

confounders = ['orders_90d', 'avg_order_value', 'days_since_last_order',
               'app_opens_7d', 'cart_abandons_30d', 'account_age_days',
               'email_opens_30d', 'is_dashpass']

def standardized_mean_diff(treated, control, var):
    """Calculate standardized mean difference"""
    mean_diff = treated[var].mean() - control[var].mean()
    pooled_std = np.sqrt((treated[var].var() + control[var].var()) / 2)
    return mean_diff / pooled_std if pooled_std > 0 else 0

treated = df[df['received_promo'] == 1]
control = df[df['received_promo'] == 0]

print("\n" + "="*60)
print("PRE-MATCHING COVARIATE BALANCE")
print("="*60)
print(f"{'Variable':<25} {'SMD':>10} {'Balanced?':>12}")
print("-"*60)

smd_before = {}
for var in confounders:
    smd = standardized_mean_diff(treated, control, var)
    smd_before[var] = smd
    balanced = "✓ Yes" if abs(smd) < 0.1 else ("~ Marginal" if abs(smd) < 0.25 else "✗ No")
    print(f"{var:<25} {smd:>10.3f} {balanced:>12}")

# ===================================================================
# STEP 3: ESTIMATE PROPENSITY SCORES
# ===================================================================

print("\n" + "="*60)
print("ESTIMATING PROPENSITY SCORES")
print("="*60)

# Prepare features
X = df[confounders].copy()

# Standardize continuous variables for better convergence
scaler = StandardScaler()
continuous_vars = ['avg_order_value', 'days_since_last_order',
                   'account_age_days']
X[continuous_vars] = scaler.fit_transform(X[continuous_vars])

D = df['received_promo'].values
Y = df['ordered'].values

# Method 1: Logistic Regression (interpretable, common)
print("\nFitting logistic regression...")
ps_model_lr = LogisticRegression(max_iter=1000, random_state=42)
ps_model_lr.fit(X, D)
propensity_scores_lr = ps_model_lr.predict_proba(X)[:, 1]

# Method 2: Gradient Boosting (better fit, more flexible)
print("Fitting gradient boosting...")
ps_model_gb = GradientBoostingClassifier(n_estimators=100,
                                         max_depth=3,
                                         random_state=42)
ps_model_gb.fit(X, D)
propensity_scores_gb = ps_model_gb.predict_proba(X)[:, 1]

# We'll use gradient boosting for better prediction
propensity_scores = propensity_scores_gb
df['propensity_score'] = propensity_scores

print(f"\nPropensity score summary:")
print(df.groupby('received_promo')['propensity_score'].describe())

# ===================================================================
# STEP 4: CHECK OVERLAP (COMMON SUPPORT)
# ===================================================================

print("\n" + "="*60)
print("ASSESSING OVERLAP")
print("="*60)

ps_treated = df[df['received_promo']==1]['propensity_score']
ps_control = df[df['received_promo']==0]['propensity_score']

print(f"\nTreated range: [{ps_treated.min():.3f}, {ps_treated.max():.3f}]")
print(f"Control range: [{ps_control.min():.3f}, {ps_control.max():.3f}]")

overlap_min = max(ps_treated.min(), ps_control.min())
overlap_max = min(ps_treated.max(), ps_control.max())
print(f"Overlap region: [{overlap_min:.3f}, {overlap_max:.3f}]")

# Count customers outside overlap
treated_outside = (ps_treated < overlap_min) | (ps_treated > overlap_max)
control_outside = (ps_control < overlap_min) | (ps_control > overlap_max)
print(f"\nTreated outside overlap: {treated_outside.sum()} ({treated_outside.mean()*100:.1f}%)")
print(f"Control outside overlap: {control_outside.sum()} ({control_outside.mean()*100:.1f}%)")

# Visualize overlap
plt.figure(figsize=(10, 5))
plt.hist(ps_control, bins=50, alpha=0.5, label='Control', density=True)
plt.hist(ps_treated, bins=50, alpha=0.5, label='Treated', density=True)
plt.axvline(overlap_min, color='red', linestyle='--', alpha=0.5, label='Overlap boundary')
plt.axvline(overlap_max, color='red', linestyle='--', alpha=0.5)
plt.xlabel('Propensity Score')
plt.ylabel('Density')
plt.title('Propensity Score Distribution: Treated vs Control')
plt.legend()
plt.tight_layout()
plt.savefig('propensity_overlap.png', dpi=300, bbox_inches='tight')
print("\n[Overlap plot saved to propensity_overlap.png]")

# ===================================================================
# STEP 5: PROPENSITY SCORE MATCHING (1:1 NEAREST NEIGHBOR)
# ===================================================================

print("\n" + "="*60)
print("PERFORMING PROPENSITY SCORE MATCHING")
print("="*60)

# Trim to overlap region (optional but recommended)
df_trimmed = df[(df['propensity_score'] >= overlap_min) &
                (df['propensity_score'] <= overlap_max)].copy()

treated_trim = df_trimmed[df_trimmed['received_promo']==1].reset_index(drop=True)
control_trim = df_trimmed[df_trimmed['received_promo']==0].reset_index(drop=True)

print(f"\nAfter trimming to overlap:")
print(f"  Treated: {len(treated_trim)}")
print(f"  Control: {len(control_trim)}")

# 1:1 Nearest Neighbor Matching with Caliper
caliper = 0.01  # Maximum propensity score difference
print(f"\nUsing caliper: {caliper}")

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control_trim[['propensity_score']])

distances, indices = nn.kneighbors(treated_trim[['propensity_score']])

# Apply caliper
valid_matches = distances.flatten() < caliper
matched_treated = treated_trim[valid_matches].copy()
matched_control = control_trim.iloc[indices.flatten()[valid_matches]].copy()

print(f"\nMatching results:")
print(f"  Matched pairs: {len(matched_treated)}")
print(f"  Unmatched treated (poor matches): {(~valid_matches).sum()}")
print(f"  Matching rate: {valid_matches.mean()*100:.1f}%")

# ===================================================================
# STEP 6: CHECK POST-MATCHING BALANCE
# ===================================================================

print("\n" + "="*60)
print("POST-MATCHING COVARIATE BALANCE")
print("="*60)
print(f"{'Variable':<25} {'SMD Before':>12} {'SMD After':>12} {'Improved?':>12}")
print("-"*60)

smd_after = {}
for var in confounders:
    smd_before_val = smd_before[var]
    smd_after_val = standardized_mean_diff(matched_treated, matched_control, var)
    smd_after[var] = smd_after_val

    improved = "✓" if abs(smd_after_val) < abs(smd_before_val) else "✗"
    balanced = "✓" if abs(smd_after_val) < 0.1 else ("~" if abs(smd_after_val) < 0.25 else "✗")

    print(f"{var:<25} {smd_before_val:>12.3f} {smd_after_val:>12.3f} {improved:>6} {balanced:>5}")

# Create Love plot
fig, ax = plt.subplots(figsize=(8, 6))
vars_sorted = sorted(confounders, key=lambda v: abs(smd_before[v]), reverse=True)
y_pos = np.arange(len(vars_sorted))

ax.scatter([smd_before[v] for v in vars_sorted], y_pos,
           label='Before Matching', marker='o', s=100)
ax.scatter([smd_after[v] for v in vars_sorted], y_pos,
           label='After Matching', marker='s', s=100)
ax.axvline(0, color='black', linestyle='-', linewidth=0.5)
ax.axvline(-0.1, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)
ax.axvline(0.1, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(vars_sorted)
ax.set_xlabel('Standardized Mean Difference')
ax.set_title('Covariate Balance: Before vs After Matching (Love Plot)')
ax.legend()
plt.tight_layout()
plt.savefig('balance_love_plot.png', dpi=300, bbox_inches='tight')
print("\n[Love plot saved to balance_love_plot.png]")

# ===================================================================
# STEP 7: ESTIMATE TREATMENT EFFECT (ATT)
# ===================================================================

print("\n" + "="*60)
print("TREATMENT EFFECT ESTIMATION")
print("="*60)

# ATT from matched sample
y_treated = matched_treated['ordered'].values
y_control = matched_control['ordered'].values

att = y_treated.mean() - y_control.mean()

# Standard error and confidence interval (paired t-test)
differences = y_treated - y_control
se_att = differences.std() / np.sqrt(len(differences))
ci_lower = att - 1.96 * se_att
ci_upper = att + 1.96 * se_att

# T-test
t_stat, p_value = stats.ttest_rel(y_treated, y_control)

print(f"\nAverage Treatment Effect on the Treated (ATT):")
print(f"  Point estimate: {att:.4f} ({att*100:.2f} percentage points)")
print(f"  95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
print(f"  Standard error: {se_att:.4f}")
print(f"  t-statistic: {t_stat:.3f}")
print(f"  p-value: {p_value:.4f}")

if p_value < 0.001:
    sig = "***"
elif p_value < 0.01:
    sig = "**"
elif p_value < 0.05:
    sig = "*"
else:
    sig = "ns"
print(f"  Significance: {sig}")

# Comparison
print(f"\nComparison:")
print(f"  Naive estimate (biased): {naive_diff:.4f} ({naive_diff*100:.2f} pp)")
print(f"  PSM estimate (ATT): {att:.4f} ({att*100:.2f} pp)")
print(f"  Difference: {abs(naive_diff - att):.4f} ({abs(naive_diff - att)*100:.2f} pp)")

# ===================================================================
# STEP 8: BUSINESS INTERPRETATION
# ===================================================================

print("\n" + "="*60)
print("BUSINESS IMPLICATIONS")
print("="*60)

# Calculate lift
baseline_rate = y_control.mean()
treated_rate = y_treated.mean()
lift = (treated_rate / baseline_rate - 1) * 100 if baseline_rate > 0 else 0

print(f"\nOrder Rates:")
print(f"  Without promo (matched control): {baseline_rate:.1%}")
print(f"  With promo (matched treated): {treated_rate:.1%}")
print(f"  Relative lift: {lift:.1f}%")

# ROI calculation (example numbers)
avg_order_value = 35  # $35 average order
discount_cost = 0.20 * avg_order_value  # 20% off
revenue_per_order = avg_order_value * 0.30  # 30% margin

incremental_orders = att  # Per customer
net_value = incremental_orders * revenue_per_order - discount_cost

print(f"\nSimplified ROI (per customer who received promo):")
print(f"  Incremental order probability: {att:.1%}")
print(f"  Avg order value: ${avg_order_value}")
print(f"  Discount cost per promo sent: ${discount_cost:.2f}")
print(f"  Margin per incremental order: ${revenue_per_order:.2f}")
print(f"  Net value per promo: ${net_value:.2f}")

if net_value > 0:
    print(f"  ✓ Campaign is PROFITABLE (positive ROI)")
else:
    print(f"  ✗ Campaign is UNPROFITABLE (negative ROI)")

print(f"\nScaled to campaign:")
campaign_size = 5000
total_incremental_orders = att * campaign_size
total_net_value = net_value * campaign_size
print(f"  Total incremental orders: {total_incremental_orders:.0f}")
print(f"  Total net value: ${total_net_value:,.2f}")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)

Step 3: Interpreting Results

Example Output Interpretation:

Naive estimate (biased): +0.0850 (8.50 pp) PSM estimate (ATT): +0.0420 (4.20 pp) Difference: 0.0430 (4.30 pp) Order Rates: Without promo: 18.3% With promo: 22.5% Relative lift: 23.0% Net value per promo: $1.05 ✓ Campaign is PROFITABLE Total net value: $5,250

Key Findings:

Selection bias was substantial: The naive estimate (8.5 pp) overstated the effect by 2×. Marketing targeted engaged customers who were already more likely to order.
True causal effect (ATT): +4.2 pp — The promo increased order probability by 4.2 percentage points for customers who received it. This is statistically significant (p < 0.01).
23% relative lift: Among matched customers, the promo increased order rates from 18.3% to 22.5%.
Positive ROI: Even after accounting for the discount cost ($7), the incremental margin ($1.05 per customer) makes the campaign profitable. Total campaign value: $5,250.

Step 4: Critical Assumptions to Verify

Before presenting to stakeholders, check:

Did we measure all confounders? Interview the marketing team. Did they use any targeting signals we don't have in our data? (e.g., predicted churn score, customer support interactions, referral status)
Is overlap adequate? If >20% of treated customers are outside common support, results may not generalize to the full treated population.
Are covariates balanced? If any |SMD| > 0.25 after matching, consider: adding interactions to propensity model, exact matching on that variable, or regression adjustment.
Run sensitivity analysis: Use Rosenbaum bounds to quantify robustness to unmeasured confounding. Report the Γ value where results become non-significant.
Placebo test: Estimate the "effect" on a pre-treatment outcome (e.g., orders 90-120 days before campaign). Should be near zero—if not, matching didn't work.

Step 5: Communicating to Business Stakeholders

Translation for non-technical audience:

❌ Don't say:

"After estimating propensity scores via gradient boosting and performing 1:1 nearest neighbor matching with a 0.01 caliper, we achieved covariate balance (|SMD| < 0.1) and estimated the ATT using a paired t-test..."

✅ Do say:

"The 20% off promo increased order rates by 4.2 percentage points."

We compared customers who received the promo to very similar customers who didn't—matching them on past ordering behavior, app engagement, and other factors. This tells us the causal effectof the promo itself, not just correlation.

Bottom line: For every 100 promos we send to customers like those in our campaign, we generate about 4 extra orders. After accounting for the discount cost, we make $1.05 per promo sent. The campaign is profitable and should be scaled.

Caveat: This estimate assumes we've captured all major factors marketing used to target customers. If there were hidden targeting rules we don't have data on, the true effect could differ.

💡 Pro Tip:

Always validate your propensity score model. Check: (1) AUC > 0.6 (some predictive power), (2) balance improvement (Love plot), (3) overlap (histograms), and (4) sensitivity to model choice (compare logistic vs. ML methods). If results are highly sensitive to modeling choices, your estimates are fragile.

12. Key Takeaways

✓ Propensity scores reduce high-dimensional confounders to a single balancing score e(X) = P(D=1|X)

✓ Matching: Compare treated and control units with similar propensity scores (estimates ATT)

✓ IPW: Weight observations by 1/e(X) and 1/(1-e(X)) to create pseudo-population (estimates ATE)

✓ Stratification: Divide into propensity score bins and compare within each bin

✓ Balance checking: Use SMD to verify covariates are balanced after adjustment

✓ Overlap: Treated and control must have overlapping propensity score distributions (positivity)

✓ Sensitivity analysis: Quantify robustness to unmeasured confounding (Rosenbaum bounds, E-values)

✓ Conditional independence: Methods assume all confounders are observed—untestable but critical

13. Next Week Preview

Propensity scores help when we observe all confounders. But what if we have:

Unmeasured confounders that we can't observe?
Panel data (same customers over time)?
Natural experiments or quasi-random variation?

Next week, we'll learn Regression & Instrumental Variables:

Regression adjustment and its limitations
Difference-in-differences (DiD) for panel data
Instrumental variables (IV) and Two-Stage Least Squares (2SLS)
Regression discontinuity design (RDD)
Fixed effects for time-invariant confounders

These methods relax the "observe all confounders" assumption and are crucial for real-world causal inference.