Module 3, Week 5: Double/Debiased Machine Learning

1. The Problem: ML Meets Causality

We've learned classical causal inference methods—matching, regression, IV—but they all struggle when you have many confounders (high-dimensional X). Traditional parametric models either:

Make restrictive assumptions about functional form (linear regression)
Suffer from curse of dimensionality (matching with 200 covariates is hopeless)
Require manual feature engineering and model specification

Machine learning excels at prediction with complex, high-dimensional data. Random forests, neural networks, and gradient boosting can model intricate relationships without manual specification. Can we use ML for causal inference?

⚠️ The Challenge:

Naively plugging ML into causal estimation leads to biased estimates and invalid confidence intervals. Regularization bias, overfitting, and improper cross-validation all break standard inference.

Double/Debiased Machine Learning (DML) solves this by carefully separating the prediction task (where ML shines) from the causal estimation task (where we need unbiased estimates and valid inference).

2. Why Naive ML Fails for Causal Inference

Consider the partially linear model for our promo example:

Y_i = θD_i + g(X_i) + ε_i

Where Y is purchase amount, D is promo treatment (0/1), X is our 200+ features, and θ is the ATE we want to estimate.

Naive Approach: Use Lasso for Everything

We might try:

Include treatment D and all covariates X in a Lasso regression
Read off the coefficient on D as our causal effect estimate θ̂

This fails in three ways:

Problem 1: Regularization Bias

Lasso shrinks coefficients toward zero to prevent overfitting. But we want an unbiased estimate of θ! The regularization penalty contaminates our causal estimate.

Example: True θ = $15 average increase in spending from promo. Lasso might shrink this to θ̂ = $12 because it's penalizing all coefficients. Even with n→∞, the bias doesn't vanish.

Problem 2: Overfitting to Nuisance Functions

If we use the same data to estimate g(X) and θ, the ML model "learns" spurious patterns. The estimate of θ picks up noise from how well g(X) fits in-sample.

Why it matters: Your standard errors will be wrong. You might think θ̂ is precisely estimated, but you've overfit to idiosyncrasies of your sample. Out-of-sample, the effect could be very different.

Problem 3: Invalid Confidence Intervals

Standard errors from regularized regression don't account for the model selection process. The post-selection inference problem means naive CIs have incorrect coverage.

Consequence: You claim "95% confident θ is between $12-$18" but in reality, the true coverage might be only 70%. Scientific conclusions based on these intervals are unreliable.

💡 The Insight:

We need to use ML for nuisance estimation (g(X), propensity scores) while keeping θ estimation separate and unbiased. DML achieves this through orthogonalization and sample splitting.

3. Neyman Orthogonality: The Key Innovation

The breakthrough idea: construct an estimating equation that's insensitive to small errors in estimating nuisance parameters.

The Setup: Partially Linear Model

Y_i = θD_i + g(X_i) + ε_1i

D_i = m(X_i) + ε_2i

θ is our causal parameter. g(X) = E[Y|X] and m(X) = E[D|X] are nuisance functionswe estimate with ML.

From Direct Estimation to Orthogonal Score

Instead of directly regressing Y on D and X, DML uses the orthogonal (Neyman) score:

ψ(W_i; θ, η) = (Y_i - θD_i - g(X_i))(D_i - m(X_i))

Where W_i = (Y_i, D_i, X_i) and η = (g, m) are nuisance parameters.

Why "Orthogonal"?

This score has a magical property: it's locally insensitive to errors in ĝ and m̂. Mathematically, the derivative of E[ψ] with respect to η is zero at the true values:

∂E[ψ(W; θ, η)] / ∂η |_η=η₀ = 0

In plain English: Small mistakes in estimating g and m only have second-order effects on our estimate of θ. Errors in nuisances have to interact with each other to bias θ—first-order errors cancel out!

🎯 Intuition: Why Residual-on-Residual Works

The orthogonal score multiplies two residuals:

Ỹ_i = Y_i - θD_i - g(X_i): outcome residual after removing treatment effect and X
D̃_i = D_i - m(X_i): treatment residual after removing X

If ĝ is slightly wrong, it affects Ỹ. If m̂ is slightly wrong, it affects D̃. But in the product, these errors are "orthogonal" (perpendicular)—they don't accumulate linearly. You need both to be wrong in correlated ways for θ̂ to be biased.

Promo Example: Orthogonal Score in Action

Customer i data:

Y_i = $50 (spent $50)
D_i = 1 (received 20% off promo)
X_i = [browsing history, demographics, ...] (200 features)

ML predictions:

ĝ(X_i) = $30 (expected spending given X, regardless of promo)
m̂(X_i) = 0.7 (70% chance of getting promo given X)
θ̂ = $15 (current estimate of promo effect)

Compute orthogonal score:

Ỹ_i = Y_i - θ̂D_i - ĝ(X_i) = 50 - 15×1 - 30 = $5

D̃_i = D_i - m̂(X_i) = 1 - 0.7 = 0.3

ψ_i = Ỹ_i × D̃_i = 5 × 0.3 = 1.5

We want E[ψ] = 0 at the true θ. If E[ψ] > 0, θ̂ is too small; if E[ψ] < 0, θ̂ is too large. DML searches for θ̂ that sets the average score to zero.

4. Cross-Fitting and Sample Splitting

Orthogonality protects against regularization bias, but we still face overfitting bias. If we use the same data to train ĝ and m̂ that we use to compute θ̂, we create dependence between estimates.

The Problem: In-Sample Overfitting

Suppose we fit a Random Forest for ĝ(X) on all data, then use those predictions to estimate θ.

Issue: The RF has memorized patterns specific to our sample. When we compute residuals Ỹ_i = Y_i - ĝ(X_i), the predictions ĝ(X_i) are "too good" on the training data—artificially small residuals.

This overfitting creates spurious correlation between Ỹ and D̃, biasing θ̂ even though the score is orthogonal!

The Solution: Cross-Fitting (K-Fold Sample Splitting)

Ensure that nuisance function predictions are always out-of-sample:

Cross-Fitting Algorithm:

Split data into K folds (typically K=2 or K=5)
For each fold k = 1, ..., K:
- Train ĝ and m̂ on all folds except k
- Predict ĝ(X_i) and m̂(X_i) for observations in fold k
- Compute residuals: Ỹ_i = Y_i - ĝ(X_i), D̃_i = D_i - m̂(X_i)
Pool residuals from all folds
Estimate θ̂ by regressing Ỹ on D̃: θ̂ = (Σ D̃_iỸ_i) / (Σ D̃_i²)

Visual: 2-Fold Cross-Fitting

Fold 1 (observations 1-500):

Train ĝ₂ and m̂₂ on Fold 2 (observations 501-1000)
Predict ĝ₂(X_i), m̂₂(X_i) for i = 1, ..., 500
Compute residuals Ỹ_i, D̃_i for Fold 1

Fold 2 (observations 501-1000):

Train ĝ₁ and m̂₁ on Fold 1 (observations 1-500)
Predict ĝ₁(X_i), m̂₁(X_i) for i = 501, ..., 1000
Compute residuals Ỹ_i, D̃_i for Fold 2

Result: All predictions are out-of-sample!

Each observation gets nuisance predictions from a model that never saw that observation during training. This eliminates overfitting bias while still using all data for both nuisance estimation and θ estimation.

🔑 Key Insight: Best of Both Worlds

Unlike train/test split (which discards half the data), cross-fitting uses all data for both training ML models and estimating θ, while maintaining statistical independence needed for valid inference.

5. The DML Algorithm Step-by-Step

Let's walk through DML for our promo campaign with 200+ features:

Complete DML Procedure:

Partition data: Randomly split your n observations into K folds of equal size
# Python example
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
folds = list(kf.split(X))
For each fold k:
- Define auxiliary sample: I_k = all observations except fold k
- Estimate outcome model: Train ĝ_k(x) = E[Y|X=x] on I_k
  g_model = RandomForestRegressor(n_estimators=500)
  g_model.fit(X[I_k], Y[I_k])
- Estimate treatment model: Train m̂_k(x) = E[D|X=x] on I_k
  m_model = RandomForestClassifier(n_estimators=500)
  # For binary treatment, predict probability
  m_model.fit(X[I_k], D[I_k])
- Predict on fold k: For observations in fold k, compute ĝ_k(X_i) and m̂_k(X_i)
  g_pred[fold_k] = g_model.predict(X[fold_k])
  m_pred[fold_k] = m_model.predict_proba(X[fold_k])[:, 1]
Compute residuals: For all i, compute
Y_tilde = Y - g_pred # Outcome residuals
D_tilde = D - m_pred # Treatment residuals
Estimate treatment effect: Regress Ỹ on D̃
theta_hat = np.sum(D_tilde * Y_tilde) / np.sum(D_tilde ** 2)
This is equivalent to running OLS of Ỹ_i on D̃_i (the slope coefficient).
Inference: Compute standard error using the orthogonal score (see next section)

Promo Campaign Example: Full Walkthrough

Setup:

n = 10,000 customers
200 features X: browsing behavior, demographics, purchase history, email engagement, etc.
Treatment D: 1 if received 20% off promo, 0 otherwise (5,000 treated, 5,000 control)
Outcome Y: spending in next 30 days (continuous, $0-$500)

DML Execution:

Split into 5 folds (2,000 customers each)
For Fold 1:
- Train RF on Folds 2-5 (8,000 customers) to predict Y from X → ĝ₁
- Train RF on Folds 2-5 to predict D from X → m̂₁
- Predict ĝ₁(X_i), m̂₁(X_i) for 2,000 Fold 1 customers
- Compute Ỹ_i, D̃_i for Fold 1
Repeat for Folds 2, 3, 4, 5 (each fold gets out-of-sample predictions)
Pool: We now have Ỹ_i, D̃_i for all 10,000 customers
Regress Ỹ on D̃:
θ̂ = $14.73 (estimated effect of 20% promo on spending)
Compute SE(θ̂) = $2.10 (from orthogonal score variance)
95% CI: [$10.61, $18.85]

Interpretation: Receiving the 20% off promo causes customers to spend $14.73 more on average over the next 30 days, with 95% confidence the true effect is between $10.61 and $18.85.

6. Variance Estimation and Statistical Inference

A major advantage of DML is that we get valid standard errors and confidence intervals, despite using ML for nuisance estimation.

Asymptotic Normality

√n(θ̂ - θ) → N(0, σ²)

Under regularity conditions (nuisance functions converge fast enough, orthogonality holds), the DML estimator is asymptotically normal with variance σ² that can be consistently estimated.

Computing Standard Errors

The variance σ² is based on the orthogonal score:

Step 1: Compute orthogonal scores

ψ_i = (Y_i - θ̂D_i - ĝ(X_i))(D_i - m̂(X_i))

Step 2: Estimate variance

σ̂² = (1/n) Σ_i ψ_i²

Step 3: Standard error

SE(θ̂) = σ̂ / √n

Step 4: Confidence interval (95%)

CI = [θ̂ - 1.96·SE(θ̂), θ̂ + 1.96·SE(θ̂)]

Hypothesis Testing

Test H₀: θ = 0 (no treatment effect) using a standard t-test:

t = θ̂ / SE(θ̂)

p-value = 2 × Φ(-|t|) # Two-tailed test using standard normal CDF

Promo example: θ̂ = $14.73, SE = $2.10
t = 14.73 / 2.10 = 7.01
p-value ≈ 0.000 (highly significant)

✓ Why This Works

Neyman orthogonality ensures that estimation errors in ĝ and m̂ don't inflate the variance of θ̂ (first-order errors cancel). Cross-fitting ensures estimates are independent, allowing the Central Limit Theorem to apply. Together, we get valid asymptotic inference despite using black-box ML.

7. Implementation with EconML

Microsoft's EconML library provides production-ready DML implementations. Let's see how to apply it to our promo campaign:

Basic DML for ATE Estimation

# Install EconML

pip install econml

# Import libraries

from econml.dml import LinearDML

from sklearn.ensemble import RandomForestRegressor

import numpy as np

import pandas as pd

# Load data (example)

# X: 10000 x 200 matrix of customer features

# D: 10000 x 1 vector of treatment (0/1)

# Y: 10000 x 1 vector of spending

# Initialize DML estimator

dml = LinearDML(

model_y=RandomForestRegressor(n_estimators=500, max_depth=10),

model_t=RandomForestRegressor(n_estimators=500, max_depth=10),

discrete_treatment=True, # Binary treatment

cv=5 # 5-fold cross-fitting

)

# Fit the model

dml.fit(Y, D, X=X, W=None)

# Get ATE estimate

ate = dml.ate(X)

print(f"ATE: ${ate:.2f}")

# Get confidence interval

ate_inference = dml.ate_inference(X)

ci_lower, ci_upper = ate_inference.conf_int()[0]

print(f"95% CI: [${ci_lower:.2f}, ${ci_upper:.2f}]")

# Output:

# ATE: $14.73

# 95% CI: [$10.61, $18.85]

Heterogeneous Treatment Effects (CATE)

DML can also estimate how treatment effects vary across customers:

# For heterogeneous effects, use CausalForestDML

from econml.dml import CausalForestDML

cf_dml = CausalForestDML(

model_y=RandomForestRegressor(n_estimators=500),

model_t=RandomForestRegressor(n_estimators=500),

discrete_treatment=True,

cv=5,

n_estimators=1000 # Number of causal trees

)

cf_dml.fit(Y, D, X=X, W=None)

# Predict individual treatment effects

cate = cf_dml.effect(X_test)

cate_ci = cf_dml.effect_inference(X_test).conf_int()

# Example: Analyze CATE by customer segment

high_value_customers = (X_test[:, 'ltv_score'] > 0.8)

print(f"ATE for high-value customers: ${cate[high_value_customers].mean():.2f}")

new_customers = (X_test[:, 'days_since_signup'] < 30)

print(f"ATE for new customers: ${cate[new_customers].mean():.2f}")

Complete Example with Simulated Data

import numpy as np

from sklearn.model_selection import train_test_split

from econml.dml import LinearDML

from sklearn.ensemble import RandomForestRegressor

# Simulate high-dimensional confounded data

n = 10000

p = 200 # 200 features

np.random.seed(42)

X = np.random.randn(n, p)

# Treatment depends on X (confounding)

propensity = 1 / (1 + np.exp(-X[:, 0] - 0.5*X[:, 1] - 0.3*X[:, 2]))

D = np.random.binomial(1, propensity)

# Outcome depends on X and D (true ATE = $15)

true_ate = 15

g_X = X[:, 0] + 0.5*X[:, 1]**2 + 0.3*X[:, 2]*X[:, 3] # Complex confounding

Y = true_ate * D + g_X + np.random.randn(n) * 5

# Split into train/test

X_train, X_test, D_train, D_test, Y_train, Y_test = \

train_test_split(X, D, Y, test_size=0.3, random_state=42)

# Estimate ATE with DML

dml = LinearDML(

model_y=RandomForestRegressor(n_estimators=100, max_depth=8),

model_t=RandomForestRegressor(n_estimators=100, max_depth=8),

discrete_treatment=True,

cv=5

)

dml.fit(Y_train, D_train, X=X_train, W=None)

# Evaluate

ate_est = dml.ate(X_test)

ate_inf = dml.ate_inference(X_test)

ci = ate_inf.conf_int()

print(f"True ATE: ${true_ate:.2f}")

print(f"Estimated ATE: ${ate_est:.2f}")

print(f"95% CI: [${ci[0]:.2f}, ${ci[1]:.2f}]")

print(f"Contains true value: {true_ate >= ci[0] and true_ate <= ci[1]}")

# Expected output (with sampling variation):

# True ATE: $15.00

# Estimated ATE: $14.87

# 95% CI: [$13.65, $16.09]

# Contains true value: True

💡 Practical Tips

Use ensemble methods (RF, XGBoost) for nuisance functions—they're flexible and reduce variance
Tune hyperparameters via nested cross-validation within each fold
Check model fit: validate that ĝ and m̂ have reasonable out-of-sample R² (>0.1)
Larger K (5 or 10 folds) reduces variance but increases computation time
EconML handles all cross-fitting automatically—you just specify cv=K

8. Key Takeaways

✓ DML = ML + Valid Inference: Combines flexible ML for nuisance parameters with rigorous causal estimation

✓ Two key ingredients: Neyman orthogonality (protects against regularization bias) + cross-fitting (prevents overfitting)

✓ Orthogonal score: Residual-on-residual regression makes θ̂ insensitive to first-order errors in ĝ and m̂

✓ Cross-fitting: Ensures all nuisance predictions are out-of-sample, eliminating overfitting bias

✓ High-dimensional confounding: DML excels when you have many covariates or unknown functional forms

✓ Valid inference: Get asymptotically normal estimates with correct standard errors and confidence intervals

✓ Not magic: Still requires ignorability (no unobserved confounding) and overlap (common support)

9. Next Week Preview

DML gives us valid causal estimates with high-dimensional X, but it assumes the treatment effect isconstant (or follows a specified parametric form). What if treatment effects are heterogeneous— different customers respond differently to the promo?

Next week: Causal Forests & Tree-Based Methods

How to adapt random forests for causal inference
Honest estimation: sample splitting for trees
Generalized Random Forests (GRF) framework
Discovering which customers benefit most from promos
Variable importance for treatment heterogeneity

We'll continue the promo example to learn who responds to discounts, enabling personalized targeting.

Business Case Study: Interview Approach

📊 Case: Uber Surge Pricing Impact on Driver Supply

Context: You're a senior data scientist on Uber's pricing team. The company wants to understand the causal effect of surge pricing on driver supply (number of drivers who go online). Surge pricing is algorithmically triggered based on demand-supply imbalances, weather, events, time-of-day, and hundreds of other real-time signals.

Data: You have 6 months of data across 50 major cities with 10-minute granularity:

Treatment: Surge multiplier (1.0x = no surge, up to 3.0x)
Outcome: Number of drivers who go online in next 30 minutes
Covariates: 300+ features including weather, time, day-of-week, holidays, local events, historical demand, neighborhood characteristics, driver demographics, competitor activity, etc.

Challenge: Surge is not randomly assigned—it's triggered when demand exceeds supply. Simply comparing driver supply during surge vs non-surge periods conflates the causal effect with selection bias. High-dimensional confounding makes traditional methods fail.

Your task: Use Double Machine Learning (Double ML) to estimate the causal effect of a 1.0x increase in surge multiplier on the number of additional drivers who come online. Handle high-dimensional confounders, avoid regularization bias, and provide policy recommendations for surge pricing optimization.

Step 1: Problem Formulation & Why Double ML?

Clarifying Questions:

Treatment variation: Is surge continuous (1.0x, 1.3x, 1.8x, ...) or discrete levels?
Answer: Continuous, from 1.0x to 3.0x, with most observations between 1.0x-2.0x.
Outcome definition: Do we count drivers who log in, or drivers who accept trips?
Answer: Drivers who go online (log into app) within 30 minutes of surge trigger. Accept trips comes later.
Causal estimand: Are we estimating ATE across all times/cities, or focusing on specific contexts?
Answer: ATE averaged across all surge instances in the data. Later, we can explore heterogeneity by city/time.
High-dimensionality: How many covariates? Multicollinearity? Missing data?
Answer: 300 features, some highly correlated (weather metrics, multiple event types). ~2% missing, imputed with medians.
Simultaneity concern: Could driver supply affect surge (reverse causation)?
Answer: Surge algorithm runs first, then drivers respond. 30-min lag reduces simultaneity, but not eliminated—need careful treatment of time.

Why Traditional Methods Fail:

Naive Regression:
Drivers = β₀ + β₁·Surge + ε
Problem: Omitted variable bias. Surge is triggered during high demand, which independently drives more drivers online. β₁ conflates causal effect with confounding.
Kitchen-Sink Regression (add all 300 controls):
Drivers = β₀ + β₁·Surge + β₂X₁ + ... + β₃₀₁X₃₀₀ + ε
Problems:
- Overfitting: p (300) > n (observations per city-time), estimation is unstable
- Multicollinearity: Standard errors explode
- Model selection bias: If you select which controls to include based on data, β₁ is biased
Lasso/Ridge Regression (regularize to select controls):
min ||Y - β₀ - β₁·Surge - β'X||² + λ||β||₁
Problem: Regularization biases β₁ toward zero! Lasso/Ridge penalize all coefficients including the causal parameter we care about. Regularization bias.
Propensity Score Matching:Problem: With continuous treatment (surge multiplier), matching is less natural. Also, 300 covariates → propensity score model is complex, risk of misspecification.

Double ML Solution:

Key insight: Use machine learning to flexibly control for confounders WITHOUT biasing the causal estimate.

Orthogonalization: Partial out confounders from BOTH treatment and outcome using ML (Lasso, Random Forest, etc.)
Cross-fitting: Split data, fit ML models on one part, predict on other. Avoids overfitting bias.
Debiased estimation: Regress residualized outcome on residualized treatment to get unbiased causal effect

Result: Handles 300 covariates without regularization bias, no need to manually select controls, honest standard errors.

Step 2: Double ML Algorithm Implementation

Step 2a: Mathematical Setup

Partially Linear Model:

Drivers = θ·Surge + g(X) + ε

Where:

θ: Causal effect of surge on driver supply (what we want!)
g(X): Flexible function of 300 confounders (unknown, high-dimensional)
Surge: Also influenced by X via surge algorithm: Surge = m(X) + ν

Double ML Strategy:

Predict Drivers from X using ML → Ŷ = ĝ(X), get residuals: Ỹ = Drivers - Ŷ
Predict Surge from X using ML → D̂ = m̂(X), get residuals: D̃ = Surge - D̂
Regress Ỹ on D̃: θ̂ = (D̃'D̃)⁻¹ D̃'Ỹ (simple OLS on residuals)

Intuition: After removing confounding (partialing out X), θ̂ captures only the causal link between surge and driver supply.

Step 2b: Choose ML Methods for g(X) and m(X)

With 300 features, need flexible, regularized methods:

Option 1: Lasso (Recommended for baseline)

Pros: Fast, handles high-dim, automatic feature selection, works well if sparse
Cons: Assumes linear effects (in X), may miss interactions
When to use: As baseline, if relationships are roughly linear

Option 2: Random Forest

Pros: Captures non-linearities, interactions automatically, robust
Cons: Slower with 300 features, can overfit without tuning
When to use: If you suspect non-linear relationships (e.g., weather effects on drivers)

Option 3: Gradient Boosting (XGBoost/LightGBM)

Pros: Best prediction accuracy, handles complex patterns
Cons: Requires careful tuning (depth, learning rate, regularization)
When to use: For final robust estimate after trying Lasso

Option 4: Ensemble (Best Practice)

Fit Lasso, Random Forest, XGBoost for both g(X) and m(X)
Average predictions: Ŷ = (Ŷ_lasso + Ŷ_rf + Ŷ_xgb) / 3
Reduces sensitivity to any single model misspecification

Recommendation: Start with Lasso (interpretable), check residuals, then try ensemble for robustness.

Step 2c: Cross-Fitting (Critical for Valid Inference)

Why cross-fitting? If we use same data to fit g(X)/m(X) and estimate θ, we get overfitting bias → invalid standard errors.


# K-fold cross-fitting (typically K=5)
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)
theta_folds = []

for train_idx, test_idx in kf.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    D_train, D_test = Surge[train_idx], Surge[test_idx]
    Y_train, Y_test = Drivers[train_idx], Drivers[test_idx]

    # Fit nuisance functions on train fold
    g_model = LassoCV().fit(X_train, Y_train)
    m_model = LassoCV().fit(X_train, D_train)

    # Predict on test fold (out-of-sample)
    Y_resid = Y_test - g_model.predict(X_test)  # Ỹ
    D_resid = D_test - m_model.predict(X_test)  # D̃

    # Estimate theta on test fold
    theta_fold = (D_resid @ Y_resid) / (D_resid @ D_resid)
    theta_folds.append(theta_fold)

# Aggregate across folds
theta_dml = np.mean(theta_folds)
se_dml = np.std(theta_folds) / np.sqrt(len(theta_folds))

Key: Each observation's residuals come from a model NOT trained on that observation. Eliminates overfitting bias.

Step 2d: Full Implementation with DoubleMLData


from doubleml import DoubleMLData, DoubleMLPLR
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV

# Prepare data
dml_data = DoubleMLData(
    df,
    y_col='drivers_online',
    d_cols='surge_multiplier',
    x_cols=[col for col in df.columns if col.startswith('covariate_')]
)

# Choose ML methods
ml_g = RandomForestRegressor(n_estimators=500, max_depth=10,
                               min_samples_leaf=20, random_state=42)
ml_m = RandomForestRegressor(n_estimators=500, max_depth=10,
                               min_samples_leaf=20, random_state=42)

# Initialize and fit Double ML
dml_plr = DoubleMLPLR(dml_data, ml_g=ml_g, ml_m=ml_m,
                       n_folds=5, n_rep=10)  # 10 repetitions for stability
dml_plr.fit()

# Results
print(f"Causal Effect (θ): {dml_plr.coef[0]:.2f}")
print(f"Standard Error: {dml_plr.se[0]:.2f}")
print(f"95% CI: [{dml_plr.confint()['2.5 %'][0]:.2f}, "
      f"{dml_plr.confint()['97.5 %'][0]:.2f}]")
print(f"p-value: {dml_plr.pval[0]:.4f}")

# Example output:
# Causal Effect: 12.34 drivers per 1.0x surge increase
# SE: 1.85
# 95% CI: [8.71, 15.97]
# p-value: 0.0000

Step 3: Model Diagnostics & Validation

Diagnostic 1: Check Nuisance Function Quality

Double ML works ONLY if g(X) and m(X) are well-predicted. Check R² for both:


from sklearn.model_selection import cross_val_score

# Check outcome prediction quality (g)
r2_outcome = cross_val_score(ml_g, X, Drivers, cv=5,
                               scoring='r2').mean()
print(f"Outcome model R²: {r2_outcome:.3f}")

# Check treatment prediction quality (m)
r2_treatment = cross_val_score(ml_m, X, Surge, cv=5,
                                 scoring='r2').mean()
print(f"Treatment model R²: {r2_treatment:.3f}")

# Example:
#   Outcome R²: 0.72 ✓ (good prediction)
#   Treatment R²: 0.58 ✓ (moderate but acceptable)

Interpretation:

R² > 0.5 for both: Good! ML models capture confounding well. DML estimate is reliable.
R² < 0.3 for either: Weak predictive power. May indicate: (1) unmeasured confounders, (2) model misspecification, (3) treatment is close to random (good for causality, but check assumptions).
If g R² is low: Check if outcome has high noise or if key confounders are missing.
If m R² is low: Treatment may have substantial randomness (good!), or surge algorithm uses features we don't have.

Diagnostic 2: Residual Analysis


# After fitting DML, extract residuals
Y_resid = dml_plr.psi[:, 0]  # Outcome residuals
D_resid = dml_plr.psi[:, 1]  # Treatment residuals

# Check residuals are mean-zero (by construction)
print(f"Mean Y residual: {Y_resid.mean():.4f}")  # Should be ≈ 0
print(f"Mean D residual: {D_resid.mean():.4f}")  # Should be ≈ 0

# Check residuals are uncorrelated with confounders (key assumption)
import scipy.stats as stats
for i, col in enumerate(['weather_temp', 'time_of_day', 'demand_lag']):
    corr_y, pval_y = stats.pearsonr(Y_resid, X[:, i])
    corr_d, pval_d = stats.pearsonr(D_resid, X[:, i])
    print(f"{col}: Y_resid corr={corr_y:.3f} (p={pval_y:.3f}), "
          f"D_resid corr={corr_d:.3f} (p={pval_d:.3f})")

# Ideally: low correlations (< 0.1), high p-values (> 0.05)
# If strong correlations remain: g(X) or m(X) not flexible enough

Diagnostic 3: Compare with Naive Estimates


# Naive regression (biased)
from sklearn.linear_model import LinearRegression
naive_model = LinearRegression().fit(Surge.reshape(-1, 1), Drivers)
theta_naive = naive_model.coef_[0]

# DML estimate (unbiased)
theta_dml = dml_plr.coef[0]

print(f"\nEstimate Comparison:")
print(f"  Naive: {theta_naive:.2f} drivers per 1.0x surge")
print(f"  Double ML: {theta_dml:.2f} drivers per 1.0x surge")
print(f"  Bias: {theta_naive - theta_dml:.2f} ({(theta_naive/theta_dml - 1)*100:.1f}%)")

# Example:
#   Naive: 18.50
#   DML: 12.34
#   Bias: 6.16 (50% overestimate due to confounding!)

Large gap → substantial confounding was present. DML successfully removed it.

Diagnostic 4: Sensitivity to ML Method

Fit DML with different ML methods (Lasso, RF, XGBoost). If estimates agree → robust. If diverge → investigate.


methods = {
    'Lasso': (LassoCV(), LassoCV()),
    'Random Forest': (RandomForestRegressor(), RandomForestRegressor()),
    'XGBoost': (XGBRegressor(), XGBRegressor())
}

for name, (ml_g, ml_m) in methods.items():
    dml = DoubleMLPLR(dml_data, ml_g=ml_g, ml_m=ml_m, n_folds=5)
    dml.fit()
    print(f"{name}: θ = {dml.coef[0]:.2f} (SE={dml.se[0]:.2f})")

# Example:
#   Lasso: θ = 12.10 (SE=1.92)
#   Random Forest: θ = 12.34 (SE=1.85)
#   XGBoost: θ = 12.51 (SE=1.78)
# → Estimates within ~3% → robust!

Step 4: Business Interpretation & Policy Recommendations

Interpreting the Causal Effect:

Estimated θ = 12.34 drivers per 1.0x surge increase

What this means:

Increasing surge from 1.0x → 2.0x brings 12.34 additional drivers online (on average)
Increasing surge from 1.5x → 2.5x also brings ~12 drivers (assuming linearity)
This is the causal effect, after controlling for all 300 confounders
Effect is statistically significant (p < 0.001), with tight 95% CI: [8.71, 15.97]

Contrast with Naive Estimate:

Naive regression: 18.50 drivers per 1.0x → 50% overestimate
Why? Surge is triggered during high demand periods when drivers naturally come online anyway (confounding)
DML correctly adjusts for this by partialing out demand signals from both surge and driver supply

Policy Recommendation 1: Optimize Surge Thresholds


# Cost-benefit analysis
surge_increase = 1.0  # 1.0x increase (e.g., 1.0x → 2.0x)
drivers_gained = 12.34
avg_trip_value_to_uber = 5  # Uber's take per trip (dollars)
trips_per_driver_per_hour = 2
time_horizon = 1  # hours

incremental_revenue = drivers_gained * trips_per_driver_per_hour * avg_trip_value_to_uber
print(f"Incremental revenue from surge: ${incremental_revenue:.2f}/hour")

# Cost: rider drop-off due to higher prices
price_elasticity = -0.8  # 1% price increase → 0.8% demand drop
demand_loss = surge_increase * price_elasticity  # -80% demand
avg_trips_affected = 100
lost_revenue = demand_loss * avg_trips_affected * avg_trip_value_to_uber
print(f"Lost revenue from demand drop: ${abs(lost_revenue):.2f}/hour")

net_value = incremental_revenue + lost_revenue
print(f"Net value of surge: ${net_value:.2f}/hour")

# If net_value > 0: surge is profitable. If < 0: reduce surge aggressiveness.

Use θ̂ = 12.34 in dynamic pricing algorithm to optimize surge levels based on real-time cost-benefit.

Policy Recommendation 2: Explore Heterogeneity

θ = 12.34 is an average effect. Likely varies by:

Time of day: Drivers may respond more to surge during peak hours (9am, 5pm) than off-peak
City: Dense cities (NYC) vs suburban (Phoenix) may have different driver elasticity
Weather: Surge premium may be more effective in rain/snow (higher driver costs → need bigger incentive)
Driver tenure: Experienced drivers know surge patterns, respond faster

Next steps (Week 6-7 methods):

Causal Forests: Estimate θ(X) = surge effect conditional on all covariates. Discover which contexts have high elasticity.
Meta-learners: Build targeting policy: increase surge more aggressively in high-elasticity contexts, less in low-elasticity.
Benefit: Personalized surge pricing → maximize driver supply while minimizing rider churn.

Policy Recommendation 3: Test Alternative Incentives

Finding: θ = 12.34 drivers per 1.0x surge is significant but not huge (in a city with 5,000 drivers, that's <0.3% supply increase).

Implications: Surge pricing works, but may not be enough alone. Consider:

Fixed bonuses: "$15 bonus for next 3 rides if you go online in next 10 min" (not surge). A/B test vs surge.
Predictive notifications: "High demand expected in 20 min" to get drivers online proactively before surge.
Gamification: Streak bonuses, leaderboards for drivers who respond to surge.

Use Double ML framework to estimate causal effects of these alternatives and compare ROI.

Step 5: Handling Interview Follow-Up Questions

Q1: What if there are unmeasured confounders (e.g., local events not in data)?

Answer:

Double ML assumes selection on observables—all confounders are measured. If unmeasured confounders exist, θ̂ is biased.

Mitigation strategies:

Add proxy variables: If we don't have "concert attendance," use "spike in Google searches for venue" as proxy.
Fixed effects: Include city-specific or hour-of-day fixed effects to absorb time-invariant unmeasured factors.
Instrumental variable: If surge algorithm has a randomized component (e.g., small random noise added for experimentation), use that as instrument.
Sensitivity analysis: Use Rosenbaum-style bounds or Imbens-Manski partial identification to quantify how strong unmeasured confounding would need to be to overturn conclusions.

Q2: What about simultaneity—don't more drivers online affect surge levels?

Answer:

Yes, feedback loop: Surge → Drivers Online → Supply Increases → Surge Decreases. This violates standard causal identification (endogeneity).

Solutions:

Timing/lags: Measure surge at time t, outcome (drivers online) at t+30min. 30-min lag reduces simultaneity (surge change takes time to propagate).
Instrumental variable: Use exogenous shifters of surge (e.g., unexpected weather shock) that affect surge but not drivers directly.
Difference-in-differences: If surge algorithm changed in some cities but not others, use DiD to estimate effect (Week 4 method).
Admit limitation: If using 30-min lag, state: "Our estimate captures short-run supply response. Long-run equilibrium effects may differ due to feedback."

Q3: How do we handle clustered data (repeated observations within cities)?

Answer:

Standard errors must account for within-city correlation. Use cluster-robust standard errors:


dml_plr = DoubleMLPLR(
    dml_data, ml_g=ml_g, ml_m=ml_m,
    n_folds=5,
    dml_procedure='dml2',
    score='partialling out'
)

# Cluster-robust SE by city
from statsmodels.stats.sandwich_covariance import cov_cluster
# (would need custom implementation or use linearmodels package)
# Key: Cluster at city level, bootstrap by city

Alternatively: include city fixed effects as covariates to absorb city-specific shocks.

Q4: Can we estimate heterogeneous effects with Double ML?

Answer:

Yes! Two approaches:

Subgroup analysis: Stratify by city size, time of day, etc., fit separate Double ML models:
theta_peak = fit_dml(data[data['time'] == 'peak']) theta_offpeak = fit_dml(data[data['time'] == 'offpeak'])
Double ML + CATE (Week 6-7): Use Causal Forests or Meta-learners to estimate θ(X) flexibly:
- Causal Forest: Tree-based CATE estimation with honest splitting
- DML + Causal Forest hybrid: Use DML for debiasing, Causal Forest for heterogeneity

Step 6: Executive Summary & Talking Points

Key Findings for Leadership:

Causal effect: 1.0x surge increase brings 12.34 additional drivers online (95% CI: [8.71, 15.97], p<0.001)
Previous estimate was inflated: Naive analysis suggested 18.50 drivers—50% overestimate due to confounding (surge triggered during high-demand periods)
Method: Double Machine Learning with 300 confounders (weather, time, demand, events, etc.) using Random Forest models
Robustness: Result holds across Lasso, RF, XGBoost. Residual diagnostics show good model fit.

Business Implications:

Surge pricing works but effect is moderate (~12 drivers per 1.0x in a city with 5,000 drivers = 0.25% supply increase)
Cost-benefit: Net positive ROI in high-demand periods, but must balance rider churn from price increases
Optimization opportunity: Use θ̂ = 12.34 in dynamic pricing algorithm to set optimal surge levels in real-time
Heterogeneity matters: Effect likely varies by city, time, weather → next step is to estimate CATE (Weeks 6-7) for personalized surge strategies

Recommendations:

Continue using surge pricing—it's effective and causal effect is well-identified
Refine surge algorithm using θ̂ = 12.34 as driver supply elasticity parameter
Invest in CATE analysis to optimize surge by context (city, time, weather)
Test alternative incentives (fixed bonuses, notifications) and compare with surge using Double ML
Run periodic A/B tests to validate Double ML estimates and update as driver behavior evolves

Handling Executive Pushback:

Exec: "Why should I trust this over the simple regression that showed 18.50 drivers?"

Response: "The 18.50 estimate conflates cause and effect. Surge is triggered when demand is high, and high demand independently brings drivers online. The naive regression attributes all 18.50 drivers to surge, but ~6 of them would've come online anyway due to high demand. Double ML carefully separates these by predicting surge and drivers from 300 demand/time/weather variables, then measuring the causal link in the residuals. This is standard in tech (Meta, Google use Double ML for similar problems) and the econometric theory is solid (Chernozhukov et al. 2018)."

Exec: "12 drivers seems small. Is surge pricing even worth it?"

Response: "12 drivers per 1.0x surge multiplier, aggregated across thousands of surge instances per day across 50 cities, adds up to significant supply. Back-of-envelope: 50 cities × 10 surge periods/day × 12 drivers = 6,000 additional driver-hours per day. At $5 revenue per driver-hour, that's $30K/day = $11M/year. Plus, surge reduces wait times, improving rider satisfaction. The effect is statistically robust and economically meaningful."

Exec: "Can we increase surge even more to get more drivers?"

Response: "Our estimate is local—it captures the effect at current surge levels (1.0x-2.0x). At very high surge (e.g., 3.0x+), the relationship may be non-linear (diminishing returns or rider backlash). I recommend: (1) Test higher surge levels in an A/B test to measure effects at 2.5x-3.0x, and (2) Use CATE methods (next phase) to identify contexts where higher surge is most effective with minimal rider churn."

Common Pitfalls in Double ML Analysis

Pitfall 1: Not using cross-fitting
If you fit g(X) and m(X) on the same data you use to estimate θ, you get overfitting bias and invalid SEs. Always use K-fold cross-fitting.
Pitfall 2: Poor nuisance function prediction
If R² for g(X) or m(X) is very low (<0.2), Double ML won't work well. Check model fit and consider: adding interactions, using more flexible ML methods, or checking for unmeasured confounders.
Pitfall 3: Including post-treatment variables in X
Only include pre-treatment covariates. If you control for mediators or colliders, you'll bias θ̂. Be careful with time-varying confounders.
Pitfall 4: Ignoring clustered/panel structure
With repeated observations (cities × time), use cluster-robust SEs or include fixed effects. Otherwise, SEs will be too small (overconfident).
Pitfall 5: Interpreting θ̂ as ATE when treatment is heterogeneous
θ̂ from Double ML is an average effect. If effects vary substantially by subgroup, θ̂ may not be actionable. Always check for heterogeneity and consider CATE methods.
Pitfall 6: Assuming linearity for continuous treatment
Double ML PLR assumes linear treatment effect. If surge effect is non-linear (e.g., diminishing returns at high surge), use non-parametric DML variants or test different surge ranges separately.
Pitfall 7: Not validating with out-of-sample data or A/B test
Double ML gives credible observational estimates, but they rely on unconfoundedness. Validate with A/B test on a small subset to confirm θ̂ is accurate before scaling decisions.

Module 3, Week 5: Double/Debiased Machine Learning

📊 Running Example: High-Dimensional Promo Effectiveness

Table of Contents

1. The Problem: ML Meets Causality

2. Why Naive ML Fails for Causal Inference

Naive Approach: Use Lasso for Everything

Problem 1: Regularization Bias

Problem 2: Overfitting to Nuisance Functions

Problem 3: Invalid Confidence Intervals

3. Neyman Orthogonality: The Key Innovation

The Setup: Partially Linear Model

From Direct Estimation to Orthogonal Score

Why "Orthogonal"?

Promo Example: Orthogonal Score in Action

4. Cross-Fitting and Sample Splitting

The Problem: In-Sample Overfitting

The Solution: Cross-Fitting (K-Fold Sample Splitting)

Visual: 2-Fold Cross-Fitting

5. The DML Algorithm Step-by-Step

Promo Campaign Example: Full Walkthrough

6. Variance Estimation and Statistical Inference

Asymptotic Normality

Computing Standard Errors

Hypothesis Testing

7. Implementation with EconML

Basic DML for ATE Estimation

Heterogeneous Treatment Effects (CATE)

Complete Example with Simulated Data

8. Key Takeaways

9. Next Week Preview

Business Case Study: Interview Approach

📊 Case: Uber Surge Pricing Impact on Driver Supply