1. The Problem: ML Meets Causality
We've learned classical causal inference methods—matching, regression, IV—but they all struggle when you have many confounders (high-dimensional X). Traditional parametric models either:
- Make restrictive assumptions about functional form (linear regression)
- Suffer from curse of dimensionality (matching with 200 covariates is hopeless)
- Require manual feature engineering and model specification
Machine learning excels at prediction with complex, high-dimensional data. Random forests, neural networks, and gradient boosting can model intricate relationships without manual specification. Can we use ML for causal inference?
⚠️ The Challenge:
Naively plugging ML into causal estimation leads to biased estimates and invalid confidence intervals. Regularization bias, overfitting, and improper cross-validation all break standard inference.
Double/Debiased Machine Learning (DML) solves this by carefully separating the prediction task (where ML shines) from the causal estimation task (where we need unbiased estimates and valid inference).
2. Why Naive ML Fails for Causal Inference
Consider the partially linear model for our promo example:
Where Y is purchase amount, D is promo treatment (0/1), X is our 200+ features, and θ is the ATE we want to estimate.
Naive Approach: Use Lasso for Everything
We might try:
- Include treatment D and all covariates X in a Lasso regression
- Read off the coefficient on D as our causal effect estimate θ̂
This fails in three ways:
Problem 1: Regularization Bias
Lasso shrinks coefficients toward zero to prevent overfitting. But we want an unbiased estimate of θ! The regularization penalty contaminates our causal estimate.
Example: True θ = $15 average increase in spending from promo. Lasso might shrink this to θ̂ = $12 because it's penalizing all coefficients. Even with n→∞, the bias doesn't vanish.
Problem 2: Overfitting to Nuisance Functions
If we use the same data to estimate g(X) and θ, the ML model "learns" spurious patterns. The estimate of θ picks up noise from how well g(X) fits in-sample.
Why it matters: Your standard errors will be wrong. You might think θ̂ is precisely estimated, but you've overfit to idiosyncrasies of your sample. Out-of-sample, the effect could be very different.
Problem 3: Invalid Confidence Intervals
Standard errors from regularized regression don't account for the model selection process. The post-selection inference problem means naive CIs have incorrect coverage.
Consequence: You claim "95% confident θ is between $12-$18" but in reality, the true coverage might be only 70%. Scientific conclusions based on these intervals are unreliable.
💡 The Insight:
We need to use ML for nuisance estimation (g(X), propensity scores) while keeping θ estimation separate and unbiased. DML achieves this through orthogonalization and sample splitting.
3. Neyman Orthogonality: The Key Innovation
The breakthrough idea: construct an estimating equation that's insensitive to small errors in estimating nuisance parameters.
The Setup: Partially Linear Model
θ is our causal parameter. g(X) = E[Y|X] and m(X) = E[D|X] are nuisance functionswe estimate with ML.
From Direct Estimation to Orthogonal Score
Instead of directly regressing Y on D and X, DML uses the orthogonal (Neyman) score:
Where Wi = (Yi, Di, Xi) and η = (g, m) are nuisance parameters.
Why "Orthogonal"?
This score has a magical property: it's locally insensitive to errors in ĝ and m̂. Mathematically, the derivative of E[ψ] with respect to η is zero at the true values:
In plain English: Small mistakes in estimating g and m only have second-order effects on our estimate of θ. Errors in nuisances have to interact with each other to bias θ—first-order errors cancel out!
🎯 Intuition: Why Residual-on-Residual Works
The orthogonal score multiplies two residuals:
- Ỹi = Yi - θDi - g(Xi): outcome residual after removing treatment effect and X
- D̃i = Di - m(Xi): treatment residual after removing X
If ĝ is slightly wrong, it affects Ỹ. If m̂ is slightly wrong, it affects D̃. But in the product, these errors are "orthogonal" (perpendicular)—they don't accumulate linearly. You need both to be wrong in correlated ways for θ̂ to be biased.
Promo Example: Orthogonal Score in Action
Customer i data:
- Yi = $50 (spent $50)
- Di = 1 (received 20% off promo)
- Xi = [browsing history, demographics, ...] (200 features)
ML predictions:
- ĝ(Xi) = $30 (expected spending given X, regardless of promo)
- m̂(Xi) = 0.7 (70% chance of getting promo given X)
- θ̂ = $15 (current estimate of promo effect)
Compute orthogonal score:
We want E[ψ] = 0 at the true θ. If E[ψ] > 0, θ̂ is too small; if E[ψ] < 0, θ̂ is too large. DML searches for θ̂ that sets the average score to zero.
4. Cross-Fitting and Sample Splitting
Orthogonality protects against regularization bias, but we still face overfitting bias. If we use the same data to train ĝ and m̂ that we use to compute θ̂, we create dependence between estimates.
The Problem: In-Sample Overfitting
Suppose we fit a Random Forest for ĝ(X) on all data, then use those predictions to estimate θ.
Issue: The RF has memorized patterns specific to our sample. When we compute residuals Ỹi = Yi - ĝ(Xi), the predictions ĝ(Xi) are "too good" on the training data—artificially small residuals.
This overfitting creates spurious correlation between Ỹ and D̃, biasing θ̂ even though the score is orthogonal!
The Solution: Cross-Fitting (K-Fold Sample Splitting)
Ensure that nuisance function predictions are always out-of-sample:
Cross-Fitting Algorithm:
- Split data into K folds (typically K=2 or K=5)
- For each fold k = 1, ..., K:
- Train ĝ and m̂ on all folds except k
- Predict ĝ(Xi) and m̂(Xi) for observations in fold k
- Compute residuals: Ỹi = Yi - ĝ(Xi), D̃i = Di - m̂(Xi)
- Pool residuals from all folds
- Estimate θ̂ by regressing Ỹ on D̃: θ̂ = (Σ D̃iỸi) / (Σ D̃i²)
Visual: 2-Fold Cross-Fitting
Fold 1 (observations 1-500):
- Train ĝ₂ and m̂₂ on Fold 2 (observations 501-1000)
- Predict ĝ₂(Xi), m̂₂(Xi) for i = 1, ..., 500
- Compute residuals Ỹi, D̃i for Fold 1
Fold 2 (observations 501-1000):
- Train ĝ₁ and m̂₁ on Fold 1 (observations 1-500)
- Predict ĝ₁(Xi), m̂₁(Xi) for i = 501, ..., 1000
- Compute residuals Ỹi, D̃i for Fold 2
Result: All predictions are out-of-sample!
Each observation gets nuisance predictions from a model that never saw that observation during training. This eliminates overfitting bias while still using all data for both nuisance estimation and θ estimation.
🔑 Key Insight: Best of Both Worlds
Unlike train/test split (which discards half the data), cross-fitting uses all data for both training ML models and estimating θ, while maintaining statistical independence needed for valid inference.
5. The DML Algorithm Step-by-Step
Let's walk through DML for our promo campaign with 200+ features:
Complete DML Procedure:
- Partition data: Randomly split your n observations into K folds of equal size# Python example
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
folds = list(kf.split(X)) - For each fold k:
- Define auxiliary sample: Ik = all observations except fold k
- Estimate outcome model: Train ĝk(x) = E[Y|X=x] on Ikg_model = RandomForestRegressor(n_estimators=500)
g_model.fit(X[I_k], Y[I_k]) - Estimate treatment model: Train m̂k(x) = E[D|X=x] on Ikm_model = RandomForestClassifier(n_estimators=500)
# For binary treatment, predict probability
m_model.fit(X[I_k], D[I_k]) - Predict on fold k: For observations in fold k, compute ĝk(Xi) and m̂k(Xi)g_pred[fold_k] = g_model.predict(X[fold_k])
m_pred[fold_k] = m_model.predict_proba(X[fold_k])[:, 1]
- Compute residuals: For all i, computeY_tilde = Y - g_pred # Outcome residuals
D_tilde = D - m_pred # Treatment residuals - Estimate treatment effect: Regress Ỹ on D̃theta_hat = np.sum(D_tilde * Y_tilde) / np.sum(D_tilde ** 2)
This is equivalent to running OLS of Ỹi on D̃i (the slope coefficient).
- Inference: Compute standard error using the orthogonal score (see next section)
Promo Campaign Example: Full Walkthrough
Setup:
- n = 10,000 customers
- 200 features X: browsing behavior, demographics, purchase history, email engagement, etc.
- Treatment D: 1 if received 20% off promo, 0 otherwise (5,000 treated, 5,000 control)
- Outcome Y: spending in next 30 days (continuous, $0-$500)
DML Execution:
- Split into 5 folds (2,000 customers each)
- For Fold 1:
- Train RF on Folds 2-5 (8,000 customers) to predict Y from X → ĝ₁
- Train RF on Folds 2-5 to predict D from X → m̂₁
- Predict ĝ₁(Xi), m̂₁(Xi) for 2,000 Fold 1 customers
- Compute Ỹi, D̃i for Fold 1
- Repeat for Folds 2, 3, 4, 5 (each fold gets out-of-sample predictions)
- Pool: We now have Ỹi, D̃i for all 10,000 customers
- Regress Ỹ on D̃:
θ̂ = $14.73 (estimated effect of 20% promo on spending) - Compute SE(θ̂) = $2.10 (from orthogonal score variance)
95% CI: [$10.61, $18.85]
Interpretation: Receiving the 20% off promo causes customers to spend $14.73 more on average over the next 30 days, with 95% confidence the true effect is between $10.61 and $18.85.
6. Variance Estimation and Statistical Inference
A major advantage of DML is that we get valid standard errors and confidence intervals, despite using ML for nuisance estimation.
Asymptotic Normality
Under regularity conditions (nuisance functions converge fast enough, orthogonality holds), the DML estimator is asymptotically normal with variance σ² that can be consistently estimated.
Computing Standard Errors
The variance σ² is based on the orthogonal score:
Step 1: Compute orthogonal scores
Step 2: Estimate variance
Step 3: Standard error
Step 4: Confidence interval (95%)
Hypothesis Testing
Test H₀: θ = 0 (no treatment effect) using a standard t-test:
Promo example: θ̂ = $14.73, SE = $2.10
t = 14.73 / 2.10 = 7.01
p-value ≈ 0.000 (highly significant)
✓ Why This Works
Neyman orthogonality ensures that estimation errors in ĝ and m̂ don't inflate the variance of θ̂ (first-order errors cancel). Cross-fitting ensures estimates are independent, allowing the Central Limit Theorem to apply. Together, we get valid asymptotic inference despite using black-box ML.
7. Implementation with EconML
Microsoft's EconML library provides production-ready DML implementations. Let's see how to apply it to our promo campaign:
Basic DML for ATE Estimation
Heterogeneous Treatment Effects (CATE)
DML can also estimate how treatment effects vary across customers:
Complete Example with Simulated Data
💡 Practical Tips
- Use ensemble methods (RF, XGBoost) for nuisance functions—they're flexible and reduce variance
- Tune hyperparameters via nested cross-validation within each fold
- Check model fit: validate that ĝ and m̂ have reasonable out-of-sample R² (>0.1)
- Larger K (5 or 10 folds) reduces variance but increases computation time
- EconML handles all cross-fitting automatically—you just specify cv=K
8. Key Takeaways
✓ DML = ML + Valid Inference: Combines flexible ML for nuisance parameters with rigorous causal estimation
✓ Two key ingredients: Neyman orthogonality (protects against regularization bias) + cross-fitting (prevents overfitting)
✓ Orthogonal score: Residual-on-residual regression makes θ̂ insensitive to first-order errors in ĝ and m̂
✓ Cross-fitting: Ensures all nuisance predictions are out-of-sample, eliminating overfitting bias
✓ High-dimensional confounding: DML excels when you have many covariates or unknown functional forms
✓ Valid inference: Get asymptotically normal estimates with correct standard errors and confidence intervals
✓ Not magic: Still requires ignorability (no unobserved confounding) and overlap (common support)
9. Next Week Preview
DML gives us valid causal estimates with high-dimensional X, but it assumes the treatment effect isconstant (or follows a specified parametric form). What if treatment effects are heterogeneous— different customers respond differently to the promo?
Next week: Causal Forests & Tree-Based Methods
- How to adapt random forests for causal inference
- Honest estimation: sample splitting for trees
- Generalized Random Forests (GRF) framework
- Discovering which customers benefit most from promos
- Variable importance for treatment heterogeneity
We'll continue the promo example to learn who responds to discounts, enabling personalized targeting.