← back to series

Module 2, Week 3: Matching & Propensity Scores

Article 3 of 1320 min read

📊 Running Example: Promotional Discount Campaign

Remember our e-commerce question: Does offering a 20% discount promo code increase customer purchases?

In Week 1, we learned about potential outcomes and how randomization solves the fundamental problem of causal inference. In Week 2, we used DAGs to identify confounders. Now in Week 3, we'll learn how to adjust for those confounders using matching and propensity scores when we can't run an RCT.

The marketing team already sent promos to selected customers (not random!). We need to use observational data to estimate the causal effect.

1. The Observational Data Challenge

In Week 1, we saw that randomization solves selection bias by making treatment assignment independent of potential outcomes. But what if we can't randomize? What if our marketing team already sent promos to customers they chose—likely targeting loyal customers, recent browsers, or high-value accounts?

Our Observational Data Problem:

• 5,000 customers received the 20% off promo (Treatment group)

• 15,000 customers didn't receive it (Control group)

• Treatment assignment was not random—marketing targeted based on customer characteristics

Selection bias: Treated customers differ systematically from control customers. Simply comparing outcomes would conflate the treatment effect with pre-existing differences.

The solution? Matching and propensity scores—methods that attempt to create comparable groups by adjusting for observed confounders.

2. Identification Strategy: Conditional Independence

Before diving into methods, we need a key assumption that makes causal inference possible with observational data:

Conditional Independence Assumption (CIA)

Also called: Unconfoundedness, Ignorability, Selection on Observables

Formal statement: (Y(1), Y(0)) ⊥ D | X

In words: Conditional on observed covariates X, treatment assignment is independent of potential outcomes. This means we've measured all confounders—all variables that affect both treatment and outcomes.

What This Means for Our Example

From our DAG analysis (Week 2), suppose we identified these confounders:

  • Past purchases: Number of orders in last 90 days
  • Account age: Days since customer registered
  • Cart value: Total value of items currently in cart
  • Browsing activity: Page views in last 7 days
  • Email engagement: Open rate for marketing emails

CIA assumes that within groups of customers with the same values for these variables, promo assignment is "as-if random." If two customers have identical values for all X, the one who received the promo was essentially randomly selected.

⚠️ When CIA Fails:

CIA is untestable—we can't verify if unmeasured confounders exist. If marketing also targeted based onunobserved factors (e.g., customer sentiment from support interactions, predicted churn risk from a proprietary model), CIA is violated and our estimates will be biased. This is why domain knowledge and careful confounder selection are critical.

Positivity Assumption

Along with CIA, we need positivity (also called common support or overlap):

Positivity:

0 < P(D=1 | X=x) < 1 for all x in the support of X

In words: Every customer (or every type of customer characterized by X) must have a non-zero probability of receiving treatment AND control. We can't estimate treatment effects for customer types who always or neverreceive promos—there's no counterfactual comparison.

3. Exact Matching

The simplest approach: for each treated customer, find control customers with exactly the same values for all confounders X.

How Exact Matching Works

Example: Simple 3-Variable Matching

Suppose we simplify to 3 binary confounders:

  • High past purchases (Yes/No)
  • Active browser (Yes/No)
  • Email subscriber (Yes/No)

This creates 2³ = 8 possible "cells" or strata:

CellPast PurchasesActive BrowserEmail SubTreatedControl
1YesYesYes8001200
2YesYesNo600800
..................
8NoNoNo1505000

Strategy: Within each cell, compare treated vs. control customers. They have identical X values, so under CIA, any difference in outcomes is the causal effect for that cell.

Computing ATE with Exact Matching

ATE = Σcells wc × (Ȳtreated,c - Ȳcontrol,c)

where wc = proportion of population in cell c

The Curse of Dimensionality

Exact matching works great with a few discrete confounders. But what if we have:

  • Many confounders: 10 binary variables = 2¹⁰ = 1,024 cells
  • Continuous confounders: How do we match on exact age (23.4 years) or exact cart value ($47.23)?

As dimensionality increases, many cells will have zero treated or zero control customers (violating positivity), making exact matching infeasible. This is called the curse of dimensionality.

💡 The Solution:

Instead of matching on all confounders X, match on a single number that summarizes all confounders: the propensity score.

4. The Propensity Score

4.1 Definition & Intuition

Propensity Score Definition:

e(X) = P(D=1 | X) = probability of receiving treatment given covariates X

In our example: The propensity score is the probability that a customer receives the 20% off promo, given their past purchases, browsing history, account age, etc.

Intuition: The propensity score is a balancing score—it's a one-dimensional summary of multidimensional confounders. Two customers with the same propensity score have the same probability of treatment, even if their individual covariate values differ.

Example: Understanding Propensity Scores

Three Customers:

  • Alice: 10 past purchases, 50 page views, $100 in cart → e(Alice) = 0.85
    High propensity: she's the type of customer marketing targets
  • Bob: 1 past purchase, 5 page views, $0 in cart → e(Bob) = 0.15
    Low propensity: unlikely to receive promo
  • Carol: 5 past purchases, 25 page views, $50 in cart → e(Carol) = 0.50
    Medium propensity: 50-50 chance

If we find a treated customer with e=0.50 and a control customer with e=0.50, they're comparable—even if their specific X values differ, they had the same likelihood of treatment.

4.2 The Propensity Score Theorem

Why can we reduce high-dimensional X to a single number e(X)? The answer is Rosenbaum and Rubin's (1983) fundamental theorem:

Propensity Score Theorem:

If (Y(1), Y(0)) ⊥ D | X (unconfoundedness given X)

Then (Y(1), Y(0)) ⊥ D | e(X) (unconfoundedness given propensity score)

In words: If treatment assignment is independent of potential outcomes conditional on X, it's also independent conditional on the propensity score e(X). The propensity score is a sufficient statistic for X with respect to balancing treatment and control groups.

Practical implication: Instead of matching on 20 confounders, we can match on just one number: e(X). This solves the curse of dimensionality.

4.3 Estimating Propensity Scores

The propensity score e(X) = P(D=1|X) is typically unknown and must be estimated from data. This is a standard classification/prediction problem.

Common Estimation Methods:

1. Logistic Regression (Most Common)

logit(e(X)) = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ

Fit a logistic regression with treatment D as the outcome and confounders X as predictors. Each customer's predicted probability ê(X) is their estimated propensity score.

2. Machine Learning Methods

  • Random Forests: Can capture non-linear relationships and interactions
  • Gradient Boosting (XGBoost, LightGBM): Often achieves best predictive performance
  • Neural Networks: For complex, high-dimensional confounders

Trade-off: More flexible models fit propensity scores better but may be less stable. If propensity scores are poorly estimated, downstream causal estimates will be biased.

What Variables to Include?

Use your DAG from Week 2! Include:

  • ✓ Confounders (affect both D and Y)
  • ✓ Predictors of Y (even if they don't affect D—improves precision)
  • ✗ Mediators (on causal path from D to Y—don't adjust for these)
  • ✗ Colliders (caused by both D and Y—adjusting creates bias)
  • ✗ Descendants of D (post-treatment variables—create bias)

5. Propensity Score Matching

Once we have estimated propensity scores ê(X) for all customers, we can match treated and control customers with similar scores.

Matching Algorithms

1. Nearest-Neighbor Matching (1:1)

For each treated unit, find the control unit with the closest propensity score.

  • With replacement: Control units can be matched multiple times (reduces bias, increases variance)
  • Without replacement: Each control used only once (increases bias, reduces variance)

2. K-Nearest Neighbors (1:K)

Match each treated unit to K control units (e.g., K=5), then average control outcomes. Reduces variance at the cost of some bias if matches are less exact.

3. Caliper Matching

Only match if |êtreated - êcontrol| < caliper (threshold). Discard treated units without good matches. Common caliper: 0.01 or 0.25 × SD(ê).

4. Radius Matching

Match each treated unit to all control units within a caliper. Balances bias-variance trade-off.

Computing the Treatment Effect

After matching, estimate the Average Treatment Effect on the Treated (ATT):

ATT = (1/NT) Σi:Di=1 [Yi - ȲM(i)]

where ȲM(i) = average outcome for control units matched to treated unit i

Example Walkthrough

Matched Pairs (1:1 Nearest Neighbor):

Treated Customerê(X)Purchased?Matched Controlê(X)Purchased?
Alice0.85Yes (1)David0.84Yes (1)
Bob0.50Yes (1)Emma0.51No (0)
Carol0.30No (0)Frank0.32No (0)

Individual treatment effects: Alice (1-1=0), Bob (1-0=1), Carol (0-0=0)

ATT = (0 + 1 + 0) / 3 = 0.33 → The promo increased purchase rate by 33 percentage points for treated customers

6. Inverse Probability Weighting (IPW)

Instead of matching, we can weight observations by the inverse of their propensity score to create balance.

IPW Intuition

Think of propensity scores as selection probabilities. Units with low e(X) are underrepresented in the treated group (unlikely to receive treatment). We up-weight these rare treated units to reflect the full population.

IPW Estimator for ATE:

ATE = E[Y(1)] - E[Y(0)]
= E[D·Y / e(X)] - E[(1-D)·Y / (1-e(X))]

Weights:
• Treated units: wi = 1/e(Xi)
• Control units: wi = 1/(1-e(Xi))

Example: How IPW Works

Three Customers:

Customere(X)Treated?Purchased?Weight
Alice0.80Yes (1)11/0.80 = 1.25
Bob0.20Yes (1)11/0.20 = 5.0
Carol0.20No (0)01/(1-0.20) = 1.25

Key insight: Bob has low propensity (0.20) but still received treatment—he's a rare case, so we weight him heavily (5×). This up-weights underrepresented treatment combinations to mimic randomization.

Stabilized Weights (Common Improvement)

Extreme propensity scores (near 0 or 1) create extreme weights, increasing variance. Stabilized weights fix this:

SWi = P(Di=1) / e(Xi) for treated
SWi = P(Di=0) / (1-e(Xi)) for control

Stabilized weights reduce variance while maintaining consistency (they converge to the same ATE).

Advantages & Disadvantages

✓ Advantages:

  • Uses all data (no discarding unmatched units)
  • Straightforward to estimate ATE (not just ATT)
  • Works well with large samples

✗ Disadvantages:

  • Sensitive to extreme weights (propensity scores near 0/1)
  • Requires accurate propensity score estimation
  • Poor overlap can lead to high variance

7. Stratification & Subclassification

Instead of matching or weighting, we can stratify (group) customers into bins based on propensity scores, then compare treated vs. control within each bin.

How Stratification Works

Step 1: Create Propensity Score Bins

Divide propensity scores into K strata (typically 5-10). Common approach: quintiles (5 bins of equal size).

Step 2: Estimate Treatment Effect Within Each Stratum

Within stratum k: ATEk = Ȳtreated,k - Ȳcontrol,k

Step 3: Aggregate Across Strata

ATE = Σk wk × ATEk

where wk = proportion of population in stratum k (or Nk/N)

Example: 5-Quintile Stratification

Stratume(X) RangeNTNCȲTȲCATEk
1 (Low)0.0-0.220038000.150.10+0.05
20.2-0.480032000.250.18+0.07
30.4-0.6150025000.350.28+0.07
40.6-0.8180022000.480.40+0.08
5 (High)0.8-1.070033000.550.45+0.10

ATE ≈ 0.20 × 0.05 + 0.20 × 0.07 + 0.20 × 0.07 + 0.20 × 0.08 + 0.20 × 0.10 = 0.074
The promo increased purchase probability by 7.4 percentage points.

Classic result (Cochran, 1968): Stratifying on propensity scores into just 5 groups removes ~90% of bias due to confounding. More strata improve balance but reduce sample size per stratum.

8. Assessing Covariate Balance

After matching/weighting/stratifying, we must check if we've achieved balance—whether treated and control groups have similar covariate distributions.

Why Balance Matters

Good balance is evidence that our method successfully removed confounding. If covariates are still imbalanced after adjustment, our causal estimates may be biased.

Standardized Mean Difference (SMD)

The most common balance metric:

SMD = (X̄treated - X̄control) / √[(s²treated + s²control) / 2]

Rule of thumb: |SMD| < 0.1 indicates good balance; |SMD| < 0.25 is acceptable.

Example: Balance Table

Before and After Matching:

CovariateSMD (Before)SMD (After)Balanced?
Past purchases0.650.08✓ Yes
Account age0.420.05✓ Yes
Cart value0.580.12~ Marginal
Page views0.510.07✓ Yes
Email engagement0.390.09✓ Yes

Before matching, treated customers had substantially more past purchases (SMD=0.65). After matching, groups are well-balanced.

Love Plots (Visual Balance Assessment)

A Love plot shows SMD for each covariate before and after adjustment, making it easy to spot remaining imbalances. Points should cluster near zero after adjustment.

9. Common Support & Positivity

Common support (or overlap) means treated and control groups have overlapping propensity score distributions. Without overlap, causal inference is impossible—we're comparing apples to oranges.

Visualizing Overlap

Good Overlap Example:

Treated: propensity scores range from 0.15 to 0.90
Control: propensity scores range from 0.10 to 0.85
→ Substantial overlap in [0.15, 0.85]

Poor Overlap Example:

Treated: propensity scores range from 0.70 to 0.95
Control: propensity scores range from 0.05 to 0.40
→ Almost no overlap! These are fundamentally different groups.

What to Do When Overlap is Poor

  • Trim the sample: Discard treated units with e(X) > 0.9 and control units with e(X) < 0.1. This changes the estimand (you're no longer estimating ATE for the full population, but for the overlap region).
  • Caliper matching: Only match within a narrow propensity score range, discarding poor matches.
  • Overlap weights: Weight by (1-e)×e instead of 1/e or 1/(1-e). This emphasizes units with moderate propensity scores (near 0.5).
  • Reconsider identification: If overlap is fundamentally poor, you may need a different identification strategy (e.g., instrumental variables, RDD, DiD).

⚠️ Warning:

Poor overlap violates the positivity assumption. Trying to force causal inference without overlap leads to unstable estimates driven by extreme weights or extrapolation. Always check overlap before reporting results.

10. Sensitivity Analysis

The biggest threat to propensity score methods is unmeasured confounding. Sensitivity analysis asks: "How strong would an unmeasured confounder need to be to change our conclusions?"

Rosenbaum's Sensitivity Analysis

The most common sensitivity analysis for matched studies. Assume an unmeasured confounder U affects treatment assignment. Rosenbaum bounds quantify how much hidden bias would overturn results.

Sensitivity Parameter Γ (Gamma):

Γ = max odds ratio of treatment due to unmeasured confounding

  • Γ = 1: No unmeasured confounding (matched pairs have equal treatment odds)
  • Γ = 2: One matched unit could be 2× more likely to receive treatment due to unmeasured U
  • Γ = 3: Up to 3× more likely, etc.

Interpreting Sensitivity Results

Example Results:

Estimated ATE = +0.10 (10 percentage point increase in purchases), p < 0.01

Sensitivity analysis: Result remains significant (p < 0.05) for Γ up to 1.8, but becomes non-significant at Γ = 2.0.

Interpretation: An unmeasured confounder would need to increase treatment odds by 2× (doubling the odds) to explain away our result. If such a strong confounder is plausible, our conclusions are fragile. If not, we have reasonable confidence.

Other Sensitivity Approaches

  • VanderWeele-Ding E-value: "What's the minimum strength of association (both with treatment and outcome) an unmeasured confounder would need to have to fully explain away the effect?"
  • Simulation-based sensitivity: Simulate various unmeasured confounders and see how estimates change.
  • Placebo outcomes: Test your method on outcomes that shouldn't be affected by treatment. If you find spurious effects, your method may be biased.

11. Practical Implementation in Python

Here's a complete workflow using Python libraries.

# Propensity Score Matching in Python

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# 1. Load data
df = pd.read_csv('promo_data.csv')
# Columns: customer_id, received_promo, purchased, past_purchases,
#          account_age, cart_value, page_views, email_engagement

# 2. Prepare confounders (X) and treatment (D)
confounders = ['past_purchases', 'account_age', 'cart_value',
               'page_views', 'email_engagement']
X = df[confounders]
D = df['received_promo']
Y = df['purchased']

# 3. Estimate propensity scores
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(X_scaled, D)
propensity_scores = ps_model.predict_proba(X_scaled)[:, 1]

df['propensity_score'] = propensity_scores

# 4. Check overlap
plt.hist(df[df['received_promo']==1]['propensity_score'],
         alpha=0.5, label='Treated', bins=30)
plt.hist(df[df['received_promo']==0]['propensity_score'],
         alpha=0.5, label='Control', bins=30)
plt.xlabel('Propensity Score')
plt.legend()
plt.title('Propensity Score Overlap')
plt.show()

# 5. Propensity Score Matching (1:1 nearest neighbor with caliper)
from sklearn.neighbors import NearestNeighbors

treated = df[df['received_promo'] == 1].reset_index(drop=True)
control = df[df['received_promo'] == 0].reset_index(drop=True)

# Fit NN on control propensity scores
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])

# For each treated, find nearest control
distances, indices = nn.kneighbors(treated[['propensity_score']])

# Apply caliper (e.g., 0.01)
caliper = 0.01
valid_matches = distances.flatten() < caliper
matched_treated = treated[valid_matches]
matched_control = control.iloc[indices.flatten()[valid_matches]]

print(f"Matched {len(matched_treated)} out of {len(treated)} treated units")

# 6. Estimate ATT
att = (matched_treated['purchased'].mean() -
       matched_control['purchased'].mean())
print(f"ATT: {att:.4f}")

# 7. Check balance (SMD)
def smd(treated, control, var):
    mean_diff = treated[var].mean() - control[var].mean()
    pooled_std = np.sqrt((treated[var].var() + control[var].var()) / 2)
    return mean_diff / pooled_std

print("\nStandardized Mean Differences (After Matching):")
for var in confounders:
    smd_val = smd(matched_treated, matched_control, var)
    print(f"{var}: {smd_val:.3f}")

# 8. IPW (Alternative approach)
# Calculate weights
df['ipw_weight'] = np.where(
    df['received_promo'] == 1,
    1 / df['propensity_score'],
    1 / (1 - df['propensity_score'])
)

# Trim extreme weights (optional)
df['ipw_weight'] = df['ipw_weight'].clip(upper=df['ipw_weight'].quantile(0.99))

# Estimate ATE with IPW
ate_ipw = (
    (df[df['received_promo']==1]['purchased'] *
     df[df['received_promo']==1]['ipw_weight']).sum() /
    df[df['received_promo']==1]['ipw_weight'].sum()
    -
    (df[df['received_promo']==0]['purchased'] *
     df[df['received_promo']==0]['ipw_weight']).sum() /
    df[df['received_promo']==0]['ipw_weight'].sum()
)
print(f"\nATE (IPW): {ate_ipw:.4f}")

Useful Python Libraries

  • causalml: Uplift modeling and meta-learners (we'll cover in Week 7)
  • DoWhy: Microsoft's causal inference library with DAG support
  • econml: Double ML, causal forests (Week 5-6)
  • psmpy: Dedicated propensity score matching library
  • statsmodels: IPW and regression-based methods

11.5 End-to-End Example: DoorDash Discount Campaign

Let's walk through a complete propensity score matching analysis for our promotional discount example.

🎯 Business Context

Scenario: DoorDash marketing team sent a "20% off your next order" promo code to 5,000 customers in March 2024. They targeted customers based on engagement signals (past orders, app opens, cart abandonment). 15,000 customers didn't receive the promo.

Business Question: Did the promo cause customers to order more? Or did we just give discounts to people who would have ordered anyway? We need the causal effect to decide if future promo campaigns are worth the margin hit.

Step 1: Data Structure & Variable Selection

Dataset: 20,000 customers observed over 30 days post-campaign

VariableDescriptionType
customer_idUnique customer identifierString
received_promoTreatment indicator (1 = received promo)Binary
orderedOutcome: ordered in next 30 days (1=Yes, 0=No)Binary
orders_90dOrders in past 90 days (pre-treatment)Count
avg_order_valueAverage historical order valueContinuous ($)
days_since_last_orderRecency (days since last order)Continuous
app_opens_7dApp opens in past 7 daysCount
cart_abandons_30dCart abandonments in past 30 daysCount
account_age_daysDays since account creationContinuous
email_opens_30dMarketing email opens in past 30 daysCount
is_dashpassDashPass subscriber (1=Yes, 0=No)Binary

Variables highlighted in purple are confounders—they affect both promo targeting (marketing used them) and purchase outcomes (more engaged customers order more).

Step 2: Complete Python Workflow

# ===================================================================
# END-TO-END PROPENSITY SCORE MATCHING: DOORDASH PROMO CAMPAIGN
# ===================================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# ===================================================================
# STEP 1: LOAD AND EXPLORE DATA
# ===================================================================

# Load data
df = pd.read_csv('doordash_promo_campaign.csv')

print(f"Total customers: {len(df)}")
print(f"Treated (received promo): {df['received_promo'].sum()}")
print(f"Control (no promo): {(~df['received_promo'].astype(bool)).sum()}")
print(f"\nOutcome prevalence:")
print(df.groupby('received_promo')['ordered'].agg(['mean', 'count']))

# Naive comparison (BIASED - don't trust this!)
naive_diff = (df[df['received_promo']==1]['ordered'].mean() -
              df[df['received_promo']==0]['ordered'].mean())
print(f"\nNaive difference (BIASED): {naive_diff:.4f} ({naive_diff*100:.2f} pp)")

# ===================================================================
# STEP 2: CHECK PRE-TREATMENT IMBALANCE
# ===================================================================

confounders = ['orders_90d', 'avg_order_value', 'days_since_last_order',
               'app_opens_7d', 'cart_abandons_30d', 'account_age_days',
               'email_opens_30d', 'is_dashpass']

def standardized_mean_diff(treated, control, var):
    """Calculate standardized mean difference"""
    mean_diff = treated[var].mean() - control[var].mean()
    pooled_std = np.sqrt((treated[var].var() + control[var].var()) / 2)
    return mean_diff / pooled_std if pooled_std > 0 else 0

treated = df[df['received_promo'] == 1]
control = df[df['received_promo'] == 0]

print("\n" + "="*60)
print("PRE-MATCHING COVARIATE BALANCE")
print("="*60)
print(f"{'Variable':<25} {'SMD':>10} {'Balanced?':>12}")
print("-"*60)

smd_before = {}
for var in confounders:
    smd = standardized_mean_diff(treated, control, var)
    smd_before[var] = smd
    balanced = "✓ Yes" if abs(smd) < 0.1 else ("~ Marginal" if abs(smd) < 0.25 else "✗ No")
    print(f"{var:<25} {smd:>10.3f} {balanced:>12}")

# ===================================================================
# STEP 3: ESTIMATE PROPENSITY SCORES
# ===================================================================

print("\n" + "="*60)
print("ESTIMATING PROPENSITY SCORES")
print("="*60)

# Prepare features
X = df[confounders].copy()

# Standardize continuous variables for better convergence
scaler = StandardScaler()
continuous_vars = ['avg_order_value', 'days_since_last_order',
                   'account_age_days']
X[continuous_vars] = scaler.fit_transform(X[continuous_vars])

D = df['received_promo'].values
Y = df['ordered'].values

# Method 1: Logistic Regression (interpretable, common)
print("\nFitting logistic regression...")
ps_model_lr = LogisticRegression(max_iter=1000, random_state=42)
ps_model_lr.fit(X, D)
propensity_scores_lr = ps_model_lr.predict_proba(X)[:, 1]

# Method 2: Gradient Boosting (better fit, more flexible)
print("Fitting gradient boosting...")
ps_model_gb = GradientBoostingClassifier(n_estimators=100,
                                         max_depth=3,
                                         random_state=42)
ps_model_gb.fit(X, D)
propensity_scores_gb = ps_model_gb.predict_proba(X)[:, 1]

# We'll use gradient boosting for better prediction
propensity_scores = propensity_scores_gb
df['propensity_score'] = propensity_scores

print(f"\nPropensity score summary:")
print(df.groupby('received_promo')['propensity_score'].describe())

# ===================================================================
# STEP 4: CHECK OVERLAP (COMMON SUPPORT)
# ===================================================================

print("\n" + "="*60)
print("ASSESSING OVERLAP")
print("="*60)

ps_treated = df[df['received_promo']==1]['propensity_score']
ps_control = df[df['received_promo']==0]['propensity_score']

print(f"\nTreated range: [{ps_treated.min():.3f}, {ps_treated.max():.3f}]")
print(f"Control range: [{ps_control.min():.3f}, {ps_control.max():.3f}]")

overlap_min = max(ps_treated.min(), ps_control.min())
overlap_max = min(ps_treated.max(), ps_control.max())
print(f"Overlap region: [{overlap_min:.3f}, {overlap_max:.3f}]")

# Count customers outside overlap
treated_outside = (ps_treated < overlap_min) | (ps_treated > overlap_max)
control_outside = (ps_control < overlap_min) | (ps_control > overlap_max)
print(f"\nTreated outside overlap: {treated_outside.sum()} ({treated_outside.mean()*100:.1f}%)")
print(f"Control outside overlap: {control_outside.sum()} ({control_outside.mean()*100:.1f}%)")

# Visualize overlap
plt.figure(figsize=(10, 5))
plt.hist(ps_control, bins=50, alpha=0.5, label='Control', density=True)
plt.hist(ps_treated, bins=50, alpha=0.5, label='Treated', density=True)
plt.axvline(overlap_min, color='red', linestyle='--', alpha=0.5, label='Overlap boundary')
plt.axvline(overlap_max, color='red', linestyle='--', alpha=0.5)
plt.xlabel('Propensity Score')
plt.ylabel('Density')
plt.title('Propensity Score Distribution: Treated vs Control')
plt.legend()
plt.tight_layout()
plt.savefig('propensity_overlap.png', dpi=300, bbox_inches='tight')
print("\n[Overlap plot saved to propensity_overlap.png]")

# ===================================================================
# STEP 5: PROPENSITY SCORE MATCHING (1:1 NEAREST NEIGHBOR)
# ===================================================================

print("\n" + "="*60)
print("PERFORMING PROPENSITY SCORE MATCHING")
print("="*60)

# Trim to overlap region (optional but recommended)
df_trimmed = df[(df['propensity_score'] >= overlap_min) &
                (df['propensity_score'] <= overlap_max)].copy()

treated_trim = df_trimmed[df_trimmed['received_promo']==1].reset_index(drop=True)
control_trim = df_trimmed[df_trimmed['received_promo']==0].reset_index(drop=True)

print(f"\nAfter trimming to overlap:")
print(f"  Treated: {len(treated_trim)}")
print(f"  Control: {len(control_trim)}")

# 1:1 Nearest Neighbor Matching with Caliper
caliper = 0.01  # Maximum propensity score difference
print(f"\nUsing caliper: {caliper}")

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control_trim[['propensity_score']])

distances, indices = nn.kneighbors(treated_trim[['propensity_score']])

# Apply caliper
valid_matches = distances.flatten() < caliper
matched_treated = treated_trim[valid_matches].copy()
matched_control = control_trim.iloc[indices.flatten()[valid_matches]].copy()

print(f"\nMatching results:")
print(f"  Matched pairs: {len(matched_treated)}")
print(f"  Unmatched treated (poor matches): {(~valid_matches).sum()}")
print(f"  Matching rate: {valid_matches.mean()*100:.1f}%")

# ===================================================================
# STEP 6: CHECK POST-MATCHING BALANCE
# ===================================================================

print("\n" + "="*60)
print("POST-MATCHING COVARIATE BALANCE")
print("="*60)
print(f"{'Variable':<25} {'SMD Before':>12} {'SMD After':>12} {'Improved?':>12}")
print("-"*60)

smd_after = {}
for var in confounders:
    smd_before_val = smd_before[var]
    smd_after_val = standardized_mean_diff(matched_treated, matched_control, var)
    smd_after[var] = smd_after_val

    improved = "✓" if abs(smd_after_val) < abs(smd_before_val) else "✗"
    balanced = "✓" if abs(smd_after_val) < 0.1 else ("~" if abs(smd_after_val) < 0.25 else "✗")

    print(f"{var:<25} {smd_before_val:>12.3f} {smd_after_val:>12.3f} {improved:>6} {balanced:>5}")

# Create Love plot
fig, ax = plt.subplots(figsize=(8, 6))
vars_sorted = sorted(confounders, key=lambda v: abs(smd_before[v]), reverse=True)
y_pos = np.arange(len(vars_sorted))

ax.scatter([smd_before[v] for v in vars_sorted], y_pos,
           label='Before Matching', marker='o', s=100)
ax.scatter([smd_after[v] for v in vars_sorted], y_pos,
           label='After Matching', marker='s', s=100)
ax.axvline(0, color='black', linestyle='-', linewidth=0.5)
ax.axvline(-0.1, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)
ax.axvline(0.1, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(vars_sorted)
ax.set_xlabel('Standardized Mean Difference')
ax.set_title('Covariate Balance: Before vs After Matching (Love Plot)')
ax.legend()
plt.tight_layout()
plt.savefig('balance_love_plot.png', dpi=300, bbox_inches='tight')
print("\n[Love plot saved to balance_love_plot.png]")

# ===================================================================
# STEP 7: ESTIMATE TREATMENT EFFECT (ATT)
# ===================================================================

print("\n" + "="*60)
print("TREATMENT EFFECT ESTIMATION")
print("="*60)

# ATT from matched sample
y_treated = matched_treated['ordered'].values
y_control = matched_control['ordered'].values

att = y_treated.mean() - y_control.mean()

# Standard error and confidence interval (paired t-test)
differences = y_treated - y_control
se_att = differences.std() / np.sqrt(len(differences))
ci_lower = att - 1.96 * se_att
ci_upper = att + 1.96 * se_att

# T-test
t_stat, p_value = stats.ttest_rel(y_treated, y_control)

print(f"\nAverage Treatment Effect on the Treated (ATT):")
print(f"  Point estimate: {att:.4f} ({att*100:.2f} percentage points)")
print(f"  95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
print(f"  Standard error: {se_att:.4f}")
print(f"  t-statistic: {t_stat:.3f}")
print(f"  p-value: {p_value:.4f}")

if p_value < 0.001:
    sig = "***"
elif p_value < 0.01:
    sig = "**"
elif p_value < 0.05:
    sig = "*"
else:
    sig = "ns"
print(f"  Significance: {sig}")

# Comparison
print(f"\nComparison:")
print(f"  Naive estimate (biased): {naive_diff:.4f} ({naive_diff*100:.2f} pp)")
print(f"  PSM estimate (ATT): {att:.4f} ({att*100:.2f} pp)")
print(f"  Difference: {abs(naive_diff - att):.4f} ({abs(naive_diff - att)*100:.2f} pp)")

# ===================================================================
# STEP 8: BUSINESS INTERPRETATION
# ===================================================================

print("\n" + "="*60)
print("BUSINESS IMPLICATIONS")
print("="*60)

# Calculate lift
baseline_rate = y_control.mean()
treated_rate = y_treated.mean()
lift = (treated_rate / baseline_rate - 1) * 100 if baseline_rate > 0 else 0

print(f"\nOrder Rates:")
print(f"  Without promo (matched control): {baseline_rate:.1%}")
print(f"  With promo (matched treated): {treated_rate:.1%}")
print(f"  Relative lift: {lift:.1f}%")

# ROI calculation (example numbers)
avg_order_value = 35  # $35 average order
discount_cost = 0.20 * avg_order_value  # 20% off
revenue_per_order = avg_order_value * 0.30  # 30% margin

incremental_orders = att  # Per customer
net_value = incremental_orders * revenue_per_order - discount_cost

print(f"\nSimplified ROI (per customer who received promo):")
print(f"  Incremental order probability: {att:.1%}")
print(f"  Avg order value: ${avg_order_value}")
print(f"  Discount cost per promo sent: ${discount_cost:.2f}")
print(f"  Margin per incremental order: ${revenue_per_order:.2f}")
print(f"  Net value per promo: ${net_value:.2f}")

if net_value > 0:
    print(f"  ✓ Campaign is PROFITABLE (positive ROI)")
else:
    print(f"  ✗ Campaign is UNPROFITABLE (negative ROI)")

print(f"\nScaled to campaign:")
campaign_size = 5000
total_incremental_orders = att * campaign_size
total_net_value = net_value * campaign_size
print(f"  Total incremental orders: {total_incremental_orders:.0f}")
print(f"  Total net value: ${total_net_value:,.2f}")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)

Step 3: Interpreting Results

Example Output Interpretation:

Naive estimate (biased): +0.0850 (8.50 pp) PSM estimate (ATT): +0.0420 (4.20 pp) Difference: 0.0430 (4.30 pp) Order Rates: Without promo: 18.3% With promo: 22.5% Relative lift: 23.0% Net value per promo: $1.05 ✓ Campaign is PROFITABLE Total net value: $5,250

Key Findings:

  • Selection bias was substantial: The naive estimate (8.5 pp) overstated the effect by 2×. Marketing targeted engaged customers who were already more likely to order.
  • True causal effect (ATT): +4.2 pp — The promo increased order probability by 4.2 percentage points for customers who received it. This is statistically significant (p < 0.01).
  • 23% relative lift: Among matched customers, the promo increased order rates from 18.3% to 22.5%.
  • Positive ROI: Even after accounting for the discount cost ($7), the incremental margin ($1.05 per customer) makes the campaign profitable. Total campaign value: $5,250.

Step 4: Critical Assumptions to Verify

Before presenting to stakeholders, check:

  1. Did we measure all confounders? Interview the marketing team. Did they use any targeting signals we don't have in our data? (e.g., predicted churn score, customer support interactions, referral status)
  2. Is overlap adequate? If >20% of treated customers are outside common support, results may not generalize to the full treated population.
  3. Are covariates balanced? If any |SMD| > 0.25 after matching, consider: adding interactions to propensity model, exact matching on that variable, or regression adjustment.
  4. Run sensitivity analysis: Use Rosenbaum bounds to quantify robustness to unmeasured confounding. Report the Γ value where results become non-significant.
  5. Placebo test: Estimate the "effect" on a pre-treatment outcome (e.g., orders 90-120 days before campaign). Should be near zero—if not, matching didn't work.

Step 5: Communicating to Business Stakeholders

Translation for non-technical audience:

❌ Don't say:

"After estimating propensity scores via gradient boosting and performing 1:1 nearest neighbor matching with a 0.01 caliper, we achieved covariate balance (|SMD| < 0.1) and estimated the ATT using a paired t-test..."

✅ Do say:

"The 20% off promo increased order rates by 4.2 percentage points."

We compared customers who received the promo to very similar customers who didn't—matching them on past ordering behavior, app engagement, and other factors. This tells us the causal effectof the promo itself, not just correlation.

Bottom line: For every 100 promos we send to customers like those in our campaign, we generate about 4 extra orders. After accounting for the discount cost, we make $1.05 per promo sent. The campaign is profitable and should be scaled.

Caveat: This estimate assumes we've captured all major factors marketing used to target customers. If there were hidden targeting rules we don't have data on, the true effect could differ.

💡 Pro Tip:

Always validate your propensity score model. Check: (1) AUC > 0.6 (some predictive power), (2) balance improvement (Love plot), (3) overlap (histograms), and (4) sensitivity to model choice (compare logistic vs. ML methods). If results are highly sensitive to modeling choices, your estimates are fragile.

12. Key Takeaways

Propensity scores reduce high-dimensional confounders to a single balancing score e(X) = P(D=1|X)

Matching: Compare treated and control units with similar propensity scores (estimates ATT)

IPW: Weight observations by 1/e(X) and 1/(1-e(X)) to create pseudo-population (estimates ATE)

Stratification: Divide into propensity score bins and compare within each bin

Balance checking: Use SMD to verify covariates are balanced after adjustment

Overlap: Treated and control must have overlapping propensity score distributions (positivity)

Sensitivity analysis: Quantify robustness to unmeasured confounding (Rosenbaum bounds, E-values)

Conditional independence: Methods assume all confounders are observed—untestable but critical

13. Next Week Preview

Propensity scores help when we observe all confounders. But what if we have:

  • Unmeasured confounders that we can't observe?
  • Panel data (same customers over time)?
  • Natural experiments or quasi-random variation?

Next week, we'll learn Regression & Instrumental Variables:

  • Regression adjustment and its limitations
  • Difference-in-differences (DiD) for panel data
  • Instrumental variables (IV) and Two-Stage Least Squares (2SLS)
  • Regression discontinuity design (RDD)
  • Fixed effects for time-invariant confounders

These methods relax the "observe all confounders" assumption and are crucial for real-world causal inference.

Practice Problems: Matching & Propensity Scores

Test your understanding of propensity score methods with these problems. Click to reveal solutions.

Question 1: Understanding Propensity Scores

Two customers have different characteristics but the same propensity score (e(X) = 0.6). Alice: 10 past purchases, 2 years account age. Bob: 2 past purchases, 8 years account age.

a) What does it mean that they have the same propensity score?
b) Can we match Alice and Bob for causal inference? Why or why not?
c) What assumption must hold for this matching to be valid?

Show Solution

a) Having the same propensity score means Alice and Bob have the same probability of receiving the promo treatment, even though their individual characteristics differ. The propensity score "balances" their different covariate profiles into a single comparable dimension.

b) Yes, we can match them! The propensity score theorem tells us that if unconfoundedness holds given X, it also holds given e(X). Even though Alice and Bob differ on individual covariates, they are comparable with respect to treatment assignment because they have the same propensity score.

c) Conditional independence must hold: (Y(1), Y(0)) ⊥ D | X. This means we've measured all confounders—all variables that affect both promo assignment and purchase outcomes. If there are unmeasured confounders, matching on propensity scores won't remove all bias.

Question 2: IPW Calculation

Consider three customers:

  • Alice: e(X) = 0.8, Treated (D=1), Purchased (Y=1)
  • Bob: e(X) = 0.2, Treated (D=1), Purchased (Y=1)
  • Carol: e(X) = 0.5, Control (D=0), Did not purchase (Y=0)

a) Calculate the IPW weight for each customer.
b) Why is Bob's weight so much larger than Alice's?
c) Explain intuitively what these weights are doing.

Show Solution

a) IPW weights:

  • Alice (treated): w = 1/e(X) = 1/0.8 = 1.25
  • Bob (treated): w = 1/e(X) = 1/0.2 = 5.0
  • Carol (control): w = 1/(1-e(X)) = 1/(1-0.5) = 2.0

b) Bob's weight is much larger because he has a low propensity score (0.2) but still received treatment. He represents a rare case—someone who had only a 20% chance of getting the promo but got it anyway. The IPW method up-weights these rare cases to correct for selection bias.

c) Intuition: IPW creates a "pseudo-population" where treatment is independent of covariates. By weighting inversely to the probability of receiving their actual treatment, we simulate what would have happened if everyone had an equal chance of treatment—mimicking randomization. Units in underrepresented treatment-covariate combinations get higher weights to balance out the sample.

Question 3: Overlap Problem

After estimating propensity scores, you find:
• Treated customers: propensity scores range from 0.65 to 0.95
• Control customers: propensity scores range from 0.05 to 0.45

a) What problem does this indicate?
b) What would happen if you applied IPW without addressing this issue?
c) Suggest two approaches to handle this situation.

Show Solution

a) Problem: This indicates severe lack of overlap (common support violation). The treated and control groups occupy almost completely different regions of the propensity score distribution. There are essentially no comparable units—treated customers are fundamentally different from control customers.

b) IPW without addressing this:

  • Control units with e(X) = 0.05 would get weights of 1/(1-0.05) ≈ 1.05 (reasonable)
  • But you'd need to extrapolate to estimate E[Y(1)] for these low-propensity units—what would they have done if treated?
  • Treated units with e(X) = 0.95 would get weights of 1/0.95 ≈ 1.05, but control units at e(X) = 0.45 would get weights of 1/(1-0.45) ≈ 1.82
  • Main issue: You're comparing very different types of customers. Estimates would rely heavily on model extrapolation and functional form assumptions, likely producing biased and unstable results.

c) Two approaches:

  1. Trimming: Restrict analysis to the overlap region (e.g., 0.45 ≤ e(X) ≤ 0.65). Discard treated units with e(X) > 0.65 and control units with e(X) < 0.45. This changes your estimand—you're now estimating the ATE only for customers in the overlap region, not the full population.
  2. Use a different identification strategy: The lack of overlap suggests conditional independence may not be the right approach. Consider instrumental variables, difference-in-differences (if you have panel data), or regression discontinuity (if there's a threshold-based treatment rule). The fundamental problem is that these groups are too different to be directly compared.
Question 4: Balance Assessment

After propensity score matching, you check covariate balance and find:
• Past purchases: SMD = 0.05
• Account age: SMD = 0.08
• Cart value: SMD = 0.35

a) Which covariates are well-balanced? Which are not?
b) What might cause cart value to remain imbalanced?
c) Should you proceed with your analysis? What are your options?

Show Solution

a) Balance assessment:

  • Past purchases (SMD=0.05): Well-balanced ✓ (|SMD| < 0.1)
  • Account age (SMD=0.08): Well-balanced ✓ (|SMD| < 0.1)
  • Cart value (SMD=0.35): Not balanced ✗ (|SMD| > 0.25, significantly imbalanced)

b) Possible causes of cart value imbalance:

  • Cart value might have a non-linear relationship with treatment that your propensity score model didn't capture (e.g., quadratic or threshold effects)
  • Cart value might interact with other covariates in determining treatment
  • Cart value distribution might be highly skewed, making it hard to match on
  • The matching algorithm might have prioritized matching on other covariates

c) Should you proceed? Options:

Don't proceed with current results—cart value imbalance suggests your method hasn't fully removed confounding.

Options to improve:

  1. Improve propensity score model: Add cart_value² or cart_value × other_variables to capture non-linearity
  2. Covariate adjustment: After matching, include cart value as a covariate in your outcome regression (doubly robust approach)
  3. Exact matching on cart value: Match exactly on cart value bins (e.g., $0-25, $25-50, etc.) AND propensity score
  4. Stratification: Divide data by cart value quartiles, then do propensity score matching within each quartile
  5. Different method: Consider IPW or weighting methods that might achieve better balance

If cart value is a strong confounder (affects both treatment and outcome), proceeding with imbalance could bias your ATE estimate.

Question 5: Sensitivity Analysis Interpretation

You estimate ATT = +0.12 (12 percentage point increase in purchases) using propensity score matching, with p < 0.001. Rosenbaum sensitivity analysis shows:
• Γ = 1.5: p < 0.01 (still significant)
• Γ = 2.0: p = 0.08 (no longer significant at 5%)
• Γ = 2.5: p = 0.15

a) Interpret what Γ = 2.0 means in this context.
b) Are your results robust to unmeasured confounding? Explain.
c) Give an example of an unmeasured confounder that might have Γ ≈ 2.

Show Solution

a) Interpretation of Γ = 2.0:

Γ = 2.0 means that two customers with identical observed covariates (X) could differ in their odds of receiving treatment by a factor of 2 due to an unmeasured confounder U. Formally:

P(D=1|X,U) / P(D=0|X,U) ≤ 2 × P(D=1|X,U') / P(D=0|X,U')

In practical terms: One customer might be twice as likely to receive the promo compared to a matched customer, because of some unmeasured factor.

b) Robustness assessment:

Results are moderately robust but not bulletproof:

  • Positive sign: The result is fairly robust—it would take a moderate unmeasured confounder (Γ=2) to eliminate statistical significance. Weak unmeasured confounders (Γ < 2) wouldn't overturn the conclusion.
  • Caution needed: If there's a plausible unmeasured confounder that could double treatment odds, your conclusions are fragile. The effect might disappear entirely.
  • Domain knowledge critical: You need to think hard about whether such a confounder exists. If you've measured all major confounders, Γ=2 might be implausible. If not, be cautious.

c) Example unmeasured confounder with Γ ≈ 2:

Customer lifetime value (LTV) prediction score from a proprietary model:

  • Affects treatment: Marketing might have used an internal ML model to predict LTV and preferentially targeted high-LTV customers for promos (even though you don't have access to this score in your data). This could double the odds of treatment for high vs. low LTV customers.
  • Affects outcome: High-LTV customers are more likely to purchase regardless of promos, creating correlation between the unmeasured LTV score and purchase outcomes.
  • Why it matters: If you haven't measured LTV or its components (which might be derived from features you don't have, like customer support interactions, predicted churn, or business-specific engagement scores), this unmeasured confounder could bias your estimate.

If this scenario seems plausible, you should: (1) try to obtain the LTV score and include it, (2) conduct interviews with the marketing team to understand their targeting strategy, or (3) report results with appropriate caution about unmeasured confounding.

Business Case Study: Interview Approach

📊 Case: DoorDash Promotional Campaign Effectiveness

Context: You're a senior data scientist at DoorDash. The marketing team ran a promotional campaign offering $5 off for customers who hadn't ordered in the last 30 days (lapsed customers). They targeted 50,000 customers based on a combination of engagement scores, location, and past ordering patterns. Another 200,000 lapsed customers did not receive the promo.

Initial analysis shows:

  • Promo recipients: 18% made an order within 14 days
  • Non-recipients: 8% made an order within 14 days
  • Naive estimate: 10 percentage point lift (18% - 8%)

The marketing team wants to scale this campaign to all 2M lapsed customers. However, you notice that promo recipients were not randomly selected—they were chosen algorithmically based on likelihood to respond. You also have rich customer data: order history, restaurant preferences, location, time patterns, app engagement, customer support interactions, and more.

Your task: Use propensity score matching to estimate the causal effect of the $5 promo on order conversion. Assess covariate balance, check overlap, conduct sensitivity analysis, and make a recommendation about scaling the campaign.

Step 1: Data Understanding & Initial Assessment

Clarifying Questions:

  • Treatment assignment: How exactly were customers selected for the promo? Was there a propensity model? What features drove selection?
    Answer: Marketing used an internal "reactivation score" predicting likelihood to return, selecting top 50K. Score based on recency, frequency, monetary value (RFM), engagement.
  • Outcome measurement: How do we define "order"? Same-day orders? Any order value threshold?
    Answer: Any completed order ≥$10 within 14 days of promo send date.
  • Available covariates: What customer features do we have pre-treatment?
    Answer: 47 features including RFM metrics, restaurant diversity, avg order value, geography (ZIP), app usage, customer tenure, preferred cuisines, time-of-day patterns, device type.
  • Data quality: Missing values? Measurement errors? Selection issues?
    Answer: ~3% missing for app usage metrics (customers who never downloaded app), ZIP available for 98%.
  • External validity: Was this campaign run in specific markets or nationwide?
    Answer: Nationwide, but proportionally more recipients in urban areas with high restaurant density.

Initial Descriptive Analysis:


# Check sample composition
Treated (n=50,000):
  Mean orders last 90 days: 3.2
  Mean AOV: $28.50
  Urban %: 78%
  High engagement %: 62%

Control (n=200,000):
  Mean orders last 90 days: 1.8
  Mean AOV: $22.10
  Urban %: 58%
  High engagement %: 31%

⚠️ Substantial pre-treatment imbalance!
                    

The treated group is clearly different—more engaged, urban, higher AOV. The naive 10pp estimate is confounded.

Step 2: Estimate Propensity Scores

Model Selection for Propensity Score:

With 47 covariates, we need a flexible model. Consider:

Option 1: Logistic Regression (Baseline)

P(Promo=1 | X) = logit⁻¹(β₀ + β₁X₁ + ... + β₄₇X₄₇)
  • Pros: Interpretable, fast, works well with proper feature engineering
  • Cons: Assumes linear relationships (in logit scale), no interactions unless specified
  • Enhancement: Add key interactions (RFM × Urban, Tenure × Engagement), polynomial terms for continuous vars

Option 2: Gradient Boosted Trees (Recommended)

from sklearn.ensemble import GradientBoostingClassifier
  • Pros: Captures non-linearities automatically, handles interactions, robust to outliers
  • Cons: Black box, harder to diagnose misspecification
  • Best practice: Use with moderate depth (max_depth=4-6) to avoid overfitting propensity scores

Option 3: Ensemble (Doubly Robust)

  • Fit multiple models (logistic, GBM, random forest), average propensity scores
  • Reduces sensitivity to any single model misspecification
  • More computation but worth it for high-stakes decisions

Implementation & Diagnostics:


from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Fit propensity score model
ps_model = GradientBoostingClassifier(
    n_estimators=200,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.8,
    random_state=42
)

# Use cross-validation to avoid overfitting
ps_model.fit(X_train, treatment_train)
propensity_scores = ps_model.predict_proba(X_full)[:, 1]

# Check propensity score distribution
print(f"Treated PS range: [{propensity_scores[treated==1].min():.3f}, "
      f"{propensity_scores[treated==1].max():.3f}]")
print(f"Control PS range: [{propensity_scores[treated==0].min():.3f}, "
      f"{propensity_scores[treated==0].max():.3f}]")
                    

Common Issue: Extreme Propensity Scores

  • If some treated units have PS ≈ 1.0 or controls have PS ≈ 0.0, there's no overlap
  • These units are "deterministic" based on observables—can't find good matches
  • Solution: Trim sample to common support (e.g., 0.1 < PS < 0.9)

Check Overlap Visually:


import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.hist(propensity_scores[treated==1], bins=50, alpha=0.5,
         label='Treated', density=True)
plt.hist(propensity_scores[treated==0], bins=50, alpha=0.5,
         label='Control', density=True)
plt.xlabel('Propensity Score')
plt.ylabel('Density')
plt.legend()
plt.title('Propensity Score Distribution: Check for Overlap')

# Identify poor overlap regions
overlap_min = max(propensity_scores[treated==1].min(),
                   propensity_scores[treated==0].min())
overlap_max = min(propensity_scores[treated==1].max(),
                   propensity_scores[treated==0].max())
print(f"Common support region: [{overlap_min:.3f}, {overlap_max:.3f}]")
                    

Good overlap: Substantial distribution overlap with similar densities in [0.2, 0.8] range. Poor overlap: Treated mostly >0.8, control mostly <0.2 (weak identification).

Step 3: Choose & Execute Matching Strategy

Matching Algorithm Trade-offs:

Recommended: 1:4 Nearest Neighbor with Caliper

Match each treated unit to 4 nearest control units (by PS distance) within caliper=0.05

Rationale:

  • 1:4 ratio: We have 4× more controls than treated (200K vs 50K). Using multiple matches improves precision without sacrificing balance.
  • Caliper=0.05: Only match if PS distance < 0.05. Prevents bad matches. Trade-off: lose some treated units with poor matches.
  • Why not 1:1? Would waste data—we'd only use 50K of 200K controls. Lower power.
  • Why not IPW instead? With selection on observables this strong, IPW would create extreme weights. Matching is more robust.

Alternative: Optimal Full Matching

  • Creates matched sets with varying ratios (some 1:1, some 1:5, etc.), minimizing total distance
  • Pros: Uses all data, optimal balance
  • Cons: Computationally expensive for 250K observations, harder to explain to stakeholders
  • When to use: If you have time/compute and want maximum precision

Alternative: Stratification

  • Divide PS into quintiles, estimate effects within each, aggregate
  • Pros: Simple, uses all data, easy to interpret
  • Cons: Residual imbalance within strata if heterogeneity is high
  • When to use: As a robustness check or if matching fails due to poor overlap

Execute Matching:


from sklearn.neighbors import NearestNeighbors

# Trim to common support first
in_support = (propensity_scores >= 0.1) & (propensity_scores <= 0.9)
df_trimmed = df[in_support].copy()

treated_trimmed = df_trimmed[df_trimmed['treated'] == 1]
control_trimmed = df_trimmed[df_trimmed['treated'] == 0]

# 1:4 nearest neighbor matching with caliper
caliper = 0.05
nn = NearestNeighbors(n_neighbors=4, metric='euclidean')
nn.fit(control_trimmed[['propensity_score']].values)

distances, indices = nn.kneighbors(
    treated_trimmed[['propensity_score']].values
)

# Apply caliper: only keep matches within distance threshold
valid_matches = distances.max(axis=1) < caliper
matched_treated = treated_trimmed[valid_matches]
matched_control_indices = indices[valid_matches].flatten()
matched_control = control_trimmed.iloc[matched_control_indices]

print(f"Matched {len(matched_treated)} out of {len(treated_trimmed)} treated units")
print(f"Match rate: {len(matched_treated)/len(treated_trimmed)*100:.1f}%")
print(f"Using {len(matched_control)} control observations (with replacement)")
                    

Expected Outcome:

  • Match ~95-98% of treated units (2-5% too extreme for good matches)
  • If match rate < 90%, consider: relaxing caliper, using optimal matching, or acknowledging that some treated units lack comparators
  • Document unmatched units' characteristics for external validity discussion
Step 4: Assess Covariate Balance

Standardized Mean Difference (SMD) for all 47 covariates:


def standardized_mean_diff(treated, control, var):
    mean_diff = treated[var].mean() - control[var].mean()
    pooled_std = np.sqrt(
        (treated[var].var() + control[var].var()) / 2
    )
    return mean_diff / pooled_std if pooled_std > 0 else 0

# Before matching
smd_before = {}
for var in covariates:
    smd_before[var] = standardized_mean_diff(
        treated_trimmed, control_trimmed, var
    )

# After matching
smd_after = {}
for var in covariates:
    smd_after[var] = standardized_mean_diff(
        matched_treated, matched_control, var
    )

# Create balance table
balance_df = pd.DataFrame({
    'Variable': covariates,
    'SMD_Before': [smd_before[v] for v in covariates],
    'SMD_After': [smd_after[v] for v in covariates],
    'Improved': [abs(smd_after[v]) < abs(smd_before[v]) for v in covariates]
})

print("Balance Assessment:")
print(f"Before matching: {(balance_df['SMD_Before'].abs() > 0.1).sum()}/47 "
      f"covariates imbalanced (|SMD| > 0.1)")
print(f"After matching: {(balance_df['SMD_After'].abs() > 0.1).sum()}/47 "
      f"covariates imbalanced")
print(f"Improved: {balance_df['Improved'].sum()}/47 covariates")
                    

Success Criteria:

  • Gold standard: All covariates with |SMD| < 0.1 (stringent)
  • Acceptable: All covariates with |SMD| < 0.25, most < 0.1
  • Problematic: Multiple covariates with |SMD| > 0.25 → matching failed, try different strategy

Create Love Plot (Visual Balance Check):


import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 12))
y_pos = np.arange(len(covariates))

ax.scatter(balance_df['SMD_Before'], y_pos,
           label='Before Matching', alpha=0.6, s=50)
ax.scatter(balance_df['SMD_After'], y_pos,
           label='After Matching', alpha=0.6, s=50)

# Reference lines at ±0.1
ax.axvline(-0.1, color='gray', linestyle='--', alpha=0.5)
ax.axvline(0.1, color='gray', linestyle='--', alpha=0.5)
ax.axvline(0, color='black', linestyle='-', linewidth=0.5)

ax.set_yticks(y_pos)
ax.set_yticklabels(covariates, fontsize=8)
ax.set_xlabel('Standardized Mean Difference')
ax.set_title('Covariate Balance Before and After Matching (Love Plot)')
ax.legend()
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
                    

Good balance: All points clustered around 0 after matching, within ±0.1 region. Poor balance: Many points outside ±0.1, indicating residual confounding.

What if balance is poor for some covariates?

  1. Add interactions to PS model: If "orders_last_90days × urban" is imbalanced, add that interaction to propensity score model.
  2. Exact matching on problematic vars: Force exact matches on categorical vars with poor balance (e.g., geography).
  3. Caliper matching on specific vars: Use multivariate distance (Mahalanobis) including problematic continuous vars.
  4. Covariate adjustment post-matching: After matching, run regression adjusting for remaining imbalanced covariates (doubly robust).
  5. Report limitation: If all else fails, acknowledge residual imbalance and conduct sensitivity analysis.
Step 5: Estimate ATT & Inference

Estimate Average Treatment Effect on the Treated (ATT):


# Simple difference in means (after matching)
outcome_treated = matched_treated['ordered'].mean()
outcome_control = matched_control['ordered'].mean()
att = outcome_treated - outcome_control

print(f"\nTreatment Effect Estimates:")
print(f"  Naive (before matching): {naive_diff:.4f} ({naive_diff*100:.2f} pp)")
print(f"  ATT (after matching): {att:.4f} ({att*100:.2f} pp)")
print(f"  Bias removed: {(naive_diff - att):.4f} ({(naive_diff - att)*100:.2f} pp)")

# Example output:
#   Naive: 0.1000 (10.00 pp)
#   ATT: 0.0420 (4.20 pp)
#   Bias removed: 0.0580 (5.80 pp) — 58% of naive estimate was bias!
                    

Interpretation: The true causal effect is 4.2pp, not 10pp. Most of the observed difference was selection bias.

Standard Errors & Confidence Intervals:

With matching, naive SE is wrong (ignores PS estimation uncertainty). Use bootstrap or specialized matching SE:


from scipy import stats

# Bootstrap SE (account for matching structure)
def bootstrap_att(n_bootstrap=1000):
    att_boots = []
    for _ in range(n_bootstrap):
        # Resample treated units (with replacement)
        treated_boot = matched_treated.sample(n=len(matched_treated),
                                               replace=True)
        # For each treated, get their matched controls
        control_boot_idx = []
        for tid in treated_boot.index:
            # Get original matches for this treated unit
            matches = matched_control[
                matched_control.index.isin(
                    indices[treated_trimmed.index == tid].flatten()
                )
            ]
            control_boot_idx.extend(matches.index)
        control_boot = matched_control.loc[control_boot_idx]

        att_boot = (treated_boot['ordered'].mean() -
                    control_boot['ordered'].mean())
        att_boots.append(att_boot)

    return np.std(att_boots)

se_att = bootstrap_att(n_bootstrap=1000)
ci_lower = att - 1.96 * se_att
ci_upper = att + 1.96 * se_att

print(f"\nInference:")
print(f"  ATT: {att:.4f}")
print(f"  SE: {se_att:.4f}")
print(f"  95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
print(f"  p-value: {2*(1-stats.norm.cdf(abs(att/se_att))):.4f}")

# Example: ATT=0.042, SE=0.008, 95% CI=[0.026, 0.058], p<0.001
                    

Business Interpretation:

ROI Calculation:


# DoorDash business metrics
att_percentage = 4.2  # pp
promo_cost = 5  # dollars
avg_order_value = 35  # dollars
commission_rate = 0.18  # DoorDash takes 18%
incremental_revenue_per_treated = att_percentage/100 * avg_order_value * commission_rate
net_value_per_promo = incremental_revenue_per_treated - promo_cost

print(f"Per-customer economics:")
print(f"  Incremental order probability: ${att_percentage}%")
print(f"  Incremental revenue to DoorDash: ${incremental_revenue_per_treated:.2f}")
print(f"  Promo cost: ${promo_cost:.2f}")
print(f"  Net value: ${net_value_per_promo:.2f}")

# Result: Net value = 0.27 - 5.00 = -4.73 ❌ UNPROFITABLE!
                      

⚠️ Campaign is not profitable with $5 discount!

Recommendation to stakeholders:

  • The causal effect (4.2pp) is real but smaller than naive estimate (10pp)
  • At $5 discount, ROI is negative: -$4.73 per customer
  • Breakeven discount: ~$0.27 / 0.042 / 0.18 / $35 ≈ $1 per customer
  • Consider: Lower discount ($2-3), or target only high-value customers with higher CATE
  • Do NOT scale this campaign to 2M customers—would lose $9.5M
Step 6: Sensitivity Analysis (Rosenbaum Bounds)

Assess robustness to unmeasured confounding:

Rosenbaum bounds answer: "How strong would an unmeasured confounder need to be to overturn our conclusion?"


# Conceptual Rosenbaum bounds (would use rbounds package in R or sensitivity package)
# Γ = odds ratio for unmeasured confounder affecting treatment assignment

Γ = 1.0  # No confounding (baseline)
p_val = 0.001  # Our result is significant

Γ = 1.5  # Unmeasured confounder increases treatment odds by 50%
p_val_upper = 0.02  # Still significant

Γ = 2.0  # Doubles treatment odds
p_val_upper = 0.15  # No longer significant at α=0.05

Γ = 2.5
p_val_upper = 0.40  # Clearly not significant

Interpretation: An unmeasured confounder would need to approximately
DOUBLE the odds of treatment to explain away our finding.
                    

What does Γ=2 mean practically?

We'd need an unmeasured confounder as strong as (examples from your data):

  • "Customer lifetime value prediction": If marketing had an internal LTV model (unmeasured by you) that doubled treatment odds for high-LTV vs low-LTV customers, AND high-LTV customers are 2× more likely to order regardless of promo.
  • "Strategic targeting based on competitor activity": If DoorDash targeted areas with Uber Eats promotions (unmeasured), and those areas have baseline 2× order rates.

Assessment: These scenarios are plausible but not obvious. Our estimate is reasonably robust. Report Γ=2 threshold in writeup.

Additional Robustness Checks:

  1. Placebo test: Estimate "effect" of promo on orders in the 30 days BEFORE promo send (should be zero).
    placebo_effect = estimate_att(outcome='ordered_before_promo') # Expect placebo_effect ≈ 0, p-value > 0.05
  2. Leave-one-out analysis: Drop each covariate from PS model one at a time, re-match, re-estimate ATT. If estimate changes drastically, that covariate is critical—investigate further.
  3. Alternative matching methods: Repeat with IPW, stratification, optimal matching. If all agree → strong evidence. If diverge → dig deeper.
  4. Subgroup heterogeneity: Estimate ATT separately for urban vs suburban, high vs low tenure, etc. Helps assess external validity and refine targeting.
Step 7: Recommendation & Next Steps

Executive Summary:

Key Findings:

  • Causal effect: $5 promo increases order probability by 4.2 percentage points (95% CI: [2.6pp, 5.8pp]), not 10pp as initially estimated.
  • Selection bias: 58% of the naive estimate was due to targeting more engaged customers, not the promo itself.
  • ROI: Current campaign is unprofitable: -$4.73 per customer. Scaling to 2M customers would lose ~$9.5M.
  • Robustness: Result is robust to moderate unmeasured confounding (Γ≈2), but acknowledge possibility of unmeasured LTV/strategic targeting.

Recommendations:

  1. Do NOT scale the $5 promo as designed—it destroys value.
  2. Test lower discount levels: Run A/B test with $1, $2, $3 discounts to find breakeven point (~$1 based on our analysis).
  3. Personalize targeting: Use causal forests or meta-learners (Weeks 6-7) to identify high-CATE customers who respond better to promos. Target only them.
  4. Optimize promo timing: Our effect is average over 14 days. Test whether promo is more effective on weekends, lunch hours, or specific days.
  5. Long-term effects: Our analysis covers 14 days. Investigate whether promo creates habit formation (order again in weeks 3-8) or just pulls forward demand.

Caveat & Limitations:

  • Assumes no spillover effects (treated customers don't refer controls)
  • External validity: Effect may differ in different cities/times—need geographic heterogeneity analysis
  • Unmeasured confounding: If marketing used proprietary signals we don't have, bias could remain (Rosenbaum Γ=2 threshold)

Interview Talking Points (Handling Pushback):

Marketing VP: "But we saw 18% order rate with promo vs 8% without! How can you say the effect is only 4pp?"

Your response: "The 10pp gap reflects two things: (1) the causal effect of the promo, and (2) the fact that we targeted more engaged customers who order more anyway. By matching treated customers to similar controls (same engagement, geography, RFM), we isolate the causal effect at 4.2pp. The other 5.8pp is selection bias—those customers would have ordered at higher rates even without promos."

PM: "Why not just run an A/B test instead of this complicated matching?"

Your response: "Great question! An A/B test would be ideal, but (1) this campaign already ran, and (2) we need to make a decision now about scaling. Matching lets us extract causal insights from existing data. That said, I strongly recommend an A/B test for the optimized version ($1-3 discount) before full rollout. Use matching for retrospective analysis, A/B tests for prospective decisions."

CFO: "How confident are you in this -$4.73 loss estimate? Can we really trust matching?"

Your response: "The 4.2pp effect estimate is statistically robust (p<0.001, 95% CI tight, covariate balance excellent, sensitivity analysis shows Γ=2 threshold). The -$4.73 ROI uses our standard commission rate (18%) and AOV ($35). Even if we're off by 50% on the effect estimate, we'd still be at -$2.36—unprofitable. The directional insight is solid: $5 is too much. We need to either lower the discount or target high-value customers only."

Common Pitfalls in Matching Analysis
  • Pitfall 1: Not checking overlap/common support
    If treated and control PS distributions don't overlap, matching creates extrapolation bias. Always trim to common support and report match rates.
  • Pitfall 2: Ignoring residual imbalance after matching
    Matching reduces but may not eliminate imbalance. Always check SMD for all covariates post-matching. If |SMD|>0.1 remains for important vars, use doubly robust estimation (regression adjustment after matching).
  • Pitfall 3: Overfitting propensity score model
    Using too flexible a PS model (deep trees, no regularization) can overfit, creating perfect separation. Use cross-validation, limit tree depth, and check PS distribution.
  • Pitfall 4: Matching on outcomes or post-treatment variables
    Only include pre-treatment covariates in PS model. Including post-treatment vars or outcomes creates endogeneity bias.
  • Pitfall 5: Naive standard errors
    Standard errors must account for PS estimation uncertainty and matching structure. Use bootstrap or specialized matching SE methods, not naive t-tests.
  • Pitfall 6: Treating matching as proof of no confounding
    Matching only removes confounding from measured variables. Unmeasured confounders can still bias estimates. Always do sensitivity analysis (Rosenbaum bounds) and report limitations.
  • Pitfall 7: Not exploring treatment effect heterogeneity
    ATT is an average—effects vary across customers. Estimate subgroup effects or use CATE methods (Weeks 6-7) to optimize targeting and improve ROI.