Causal Forests & Tree-Based Methods - Causal Inference Series

/home/blog/ml/notes
← back to seriesModule 3, Week 6
Causal Forests & Tree-Based MethodsDiscover how to adapt random forests for causal inference to estimate heterogeneous treatment effects—understanding who benefits most from a treatment and by how much.
From Prediction to CausationRandom forests are powerful prediction machines, but naive application to treatment effect estimation fails badly. Causal Forests adapt the random forest algorithm to directly target conditional average treatment effects (CATE).
Instead of asking "what will Y be?", we ask "how would Y change if we changed treatment?"—and critically, "does this effect vary across individuals?"
Key Insight: By modifying the splitting criterion and prediction rule, we can grow trees that partition the covariate space to maximize treatment effect heterogeneity, not prediction accuracy.
1. From CART to Causal TreesStandard regression trees (CART) split on features to minimize prediction error. Causal trees split on features to maximize treatment effect heterogeneity.
Standard Regression Tree (CART):Split to minimize: Σ (yᵢ - ŷ)²
// Minimize prediction error
Causal Tree:Split to maximize: Var[τ(X)] across leaves
// Maximize treatment effect variation
Causal Tree Algorithm (Athey & Imbens 2016):Splitting criterion: Find split that maximizes treatment effect heterogeneity between child nodes
Leaf estimates: Estimate τ̂(x) = Ȳ₁ - Ȳ₀ in each leaf (average difference between treated and control)
Stopping rule: Stop when leaves have minimum sample size or no more heterogeneity
Prediction: For new x, return τ̂ from the leaf x falls into
Why Different Splitting? We don't care if Y is predictable—we care if treatment effects differ across subgroups. A good causal tree creates leaves where treatment effects are homogeneous within but heterogeneous across leaves.
2. Honesty: Sample Splitting for Valid InferenceUsing the same data for both constructing the tree (choosing splits) and estimating leaf values leads to overfitting and biased estimates. Honesty solves this.
Honest Tree Construction:
Split data into two subsamples: I (splitting) and J (estimation)
Use sample I to determine tree structure (where to split, which features)
Use sample J to estimate treatment effects τ̂ in each leaf
Never let the same observation influence both structure and estimates
Benefits of Honesty:Unbiased estimates: Leaf estimates aren't inflated by adaptive selection
Valid confidence intervals: Can derive asymptotic distributions
Better generalization: Tree structure learned on independent data
Principled inference: Enables theoretical guarantees
Intuition: It's like using a train/test split, but for tree structure vs. leaf values. The tree "shape" comes from one sample, the numbers in leaves from another. This prevents "peeking" at outcomes when deciding where to split.
3. Causal Forests: Ensemble of Honest TreesJust as random forests improve on single trees, causal forests aggregate many honest causal trees to get better estimates with lower variance.
Causal Forest Algorithm:
For b = 1, ..., B (number of trees):Draw bootstrap sample or subsample
Split into I_b (structure) and J_b (estimation)
Grow honest causal tree with random feature subsampling
Estimate τ̂_b(x) for each leaf using J_b
Final estimate: τ̂(x) = (1/B) Σ τ̂_b(x)
Can also compute pointwise confidence intervals
Key Innovations:Subsampling without replacement: Unlike classic RF, often better for causal forests
Adaptive neighbors: Estimate τ(x) using observations in same leaves across trees
Variable importance: Identify which features drive treatment heterogeneity
Honest splitting + bagging: Double protection against overfitting
How Prediction Works:For a new observation x, we find which leaf it falls into in each tree, then average treatment effects from those leaves:
τ̂(x) = Σ_b α_b(x) · (Ȳ₁_b - Ȳ₀_b)
// α_b(x) are weights based on leaf membership
// Ȳ₁_b, Ȳ₀_b are treated/control means in leaf
4. Generalized Random Forests (GRF)GRF extends causal forests to a unified framework for local moment estimation, enabling estimation of various causal quantities beyond simple treatment effects.
GRF Framework (Athey, Tibshirani, Wager 2019):
Instead of targeting a specific estimand, GRF solves local versions of moment equations:
E[ψ(Oᵢ; θ(x)) | Xᵢ ≈ x] = 0
// ψ is a moment function (e.g., score)
// θ(x) is parameter of interest at x
What GRF Can Estimate:Causal effects: CATE with various identification strategies
Quantile treatment effects: How treatment affects distribution of Y
Survival/hazard effects: Time-to-event treatment effects
Instrumental variable effects: Local average treatment effects with IV
Policy learning: Optimal treatment assignment rules
Practical Advantage: GRF provides a consistent interface across different causal estimands. Same forest-building machinery, just swap the moment function.
5. Variable Importance for Treatment HeterogeneityBeyond estimating τ(x), we often want to know which features drive treatment effect heterogeneity. Variable importance measures help prioritize subgroup analyses.
Methods for Assessing Importance:1. Split-based importance:How often a variable is used for splitting, weighted by improvement in heterogeneity.
2. Permutation importance:Randomly permute variable and measure decrease in forest's ability to detect heterogeneity.
3. SHAP values for CATE:Decompose individual-level CATE predictions into feature contributions.
4. Best Linear Projection (BLP):Project τ̂(X) onto individual features to quantify linear relationships.
τ̂(Xᵢ) ≈ β₀ + β₁X₁ᵢ + β₂X₂ᵢ + ... + εᵢ
// β coefficients show importance
Practical Use Cases:Targeting: Identify high-value customer segments for personalized offers
Resource allocation: Focus interventions where they're most effective
Hypothesis generation: Discover unexpected moderators of treatment
Policy design: Understand which populations benefit most from programs
Lab: CATE Estimation with GRF and CausalMLLet's implement causal forests using the grf package in R and causalml in Python.
R: Using GRF Package# Install GRF
install.packages("grf")
library(grf)
# Simulate data
n <- 2000
p <- 10
X <- matrix(rnorm(n * p), n, p)
# Treatment depends on X[,1]
propensity <- 1 / (1 + exp(-X[,1]))
W <- rbinom(n, 1, propensity)
# Heterogeneous treatment effect
tau <- 1 + 2 * X[,1] + X[,2]  # effect varies with X[,1] and X[,2]
Y <- tau * W + X[,1] + rnorm(n)
# Train causal forest
cf <- causal_forest(X, Y, W,
                     num.trees = 4000,
                     honesty = TRUE,
                     tune.parameters = "all")
# Predict CATE
tau_hat <- predict(cf, X, estimate.variance = TRUE)
# Average treatment effect
ate <- average_treatment_effect(cf)
print(paste("ATE:", round(ate["estimate"], 3)))
print(paste("95% CI:", round(ate["estimate"] - 1.96*ate["std.err"], 3), 
            "to", round(ate["estimate"] + 1.96*ate["std.err"], 3)))
# Variable importance
varimp <- variable_importance(cf)
ranked_vars <- order(varimp, decreasing = TRUE)
print("Most important variables for heterogeneity:")
print(ranked_vars[1:5])
# Best linear projection
blp <- best_linear_projection(cf, X[, 1:3])
print(blp)  # Shows how CATE varies with each covariate
Python: Using CausalML# Install CausalML
pip install causalml
import numpy as np
from causalml.inference.tree import UpliftTreeClassifier, UpliftRandomForestClassifier
from causalml.inference.meta import BaseXRegressor
from sklearn.ensemble import RandomForestRegressor
# Simulate data
n = 2000
X = np.random.randn(n, 10)
propensity = 1 / (1 + np.exp(-X[:, 0]))
treatment = np.random.binomial(1, propensity)
tau = 1 + 2*X[:, 0] + X[:, 1]
y = tau * treatment + X[:, 0] + np.random.randn(n)
# Method 1: Uplift Random Forest
uplift_rf = UpliftRandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=50
)
# For classification (convert to binary outcome)
y_binary = (y > np.median(y)).astype(int)
uplift_rf.fit(X, treatment, y_binary)
cate_uplift = uplift_rf.predict(X)
# Method 2: X-learner with Random Forest
xl = BaseXRegressor(
    learner=RandomForestRegressor(n_estimators=100)
)
xl.fit(X, treatment, y)
cate_xl = xl.predict(X)
# Evaluate performance
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(tau, cate_xl.flatten())
print(f"MSE for CATE prediction: {mse:.3f}")
# Variable importance (feature importance)
importance = uplift_rf.feature_importances_
for i, imp in enumerate(importance):
    print(f"Feature {i}: {imp:.4f}")
Advanced: Confidence Intervals with GRF# Get pointwise confidence intervals
tau_hat <- predict(cf, X, estimate.variance = TRUE)
# Extract estimates and standard errors
cate <- tau_hat$predictions
cate_se <- sqrt(tau_hat$variance.estimates)
# Construct 95% CIs
ci_lower <- cate - 1.96 * cate_se
ci_upper <- cate + 1.96 * cate_se
# Visualize CATE vs covariate with CIs
library(ggplot2)
df <- data.frame(x1 = X[,1], cate = cate, 
                 ci_lower = ci_lower, ci_upper = ci_upper)
ggplot(df, aes(x = x1, y = cate)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess") +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), 
              alpha = 0.2, fill = "blue") +
  labs(title = "Conditional Average Treatment Effect",
       x = "Covariate X1", y = "CATE")
Causal Forests vs Other MethodsMethodProsCons
Causal Forest• Non-parametric, no functional form
• Valid CIs
• Variable importance
• Handles high-dim X• Slower than meta-learners
• Black box (less interpretable)
• Needs larger samples
DML• Flexible ML for nuisances
• Valid inference
• Efficient estimates• Assumes parametric τ(x)
• Requires correct model for θ
Meta-learners• Simple to implement
• Fast
• Works with any base learner• No built-in inference
• Can be biased (esp. S/T-learner)
Matching• Interpretable
• No parametric assumptions• Curse of dimensionality
• Inefficient with many features
When to Use Causal Forests: Best when you expect complex, non-linear treatment heterogeneity and need valid inference. If you just need point estimates quickly, meta-learners might suffice. For sparse heterogeneity or interpretability, consider Lasso-based methods.
Practical ConsiderationsHyperparameter Tuning:num.trees: More is better (2000-4000 typical), but diminishing returns
min.node.size: Leaf size controls smoothness (larger = smoother estimates)
sample.fraction: Subsampling rate (0.5 often works well)
honesty.fraction: Split between I and J samples (0.5 is common)
Use built-in tuning: tune.parameters = "all" in grf
Diagnostics:Overlap: Check propensity score distributions (same as other methods)
Calibration: Does τ̂(X) predict actual treatment heterogeneity?
Stability: Re-run with different seeds; estimates should be stable
Out-of-bag predictions: Use OOB errors to assess fit quality
Common Mistakes:Using too few trees (leads to noisy estimates)
Ignoring honesty (biases estimates and invalidates inference)
Over-interpreting small CATE differences (check if CIs overlap)
Forgetting to check overlap before estimation
Not adjusting for multiple testing when exploring subgroups
Key Takeaways1.Adapt Trees for Causality: Causal trees split on heterogeneity, not prediction error, to discover treatment effect variation
2.Honesty is Essential: Sample splitting between structure and estimation prevents bias and enables valid inference
3.Forests for Stability: Ensemble many honest trees to reduce variance and improve estimates
4.GRF Generalizes: Unified framework for many causal estimands beyond simple treatment effects
5.Variable Importance: Discover which features moderate treatment effects for targeting and understanding
6.When to Use: Best for complex heterogeneity, sufficient sample size, and when you need confidence intervals
Further ReadingFoundational Papers:Athey & Imbens (2016), "Recursive Partitioning for Heterogeneous Causal Effects"
Wager & Athey (2018), "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests"
Athey, Tibshirani & Wager (2019), "Generalized Random Forests"
Software:grf R package - Reference implementation with excellent documentation
CausalML Python - Python library with uplift forests and meta-learners
EconML - Also includes causal forest variants
Extensions:Davis & Heller (2017), "Using Causal Forests to Predict Treatment Heterogeneity"
Nie & Wager (2021), "Quasi-Oracle Estimation of Heterogeneous Treatment Effects"
Künzel et al. (2019) - Comparing causal forests with meta-learners
Business Case Study: Interview Approach📊 Case: DoorDash Personalized Discount OptimizationContext: You're a data scientist at DoorDash. The company currently offers a blanket 20% discount to all customers who haven't ordered in 60 days. Marketing wants to optimize this strategy: instead of giving everyone the same discount, can we personalize discount amounts based on who benefits most?
Data: Historical A/B test with 100K lapsed customers:
Treatment: 50K received 20% discount
Control: 50K received no discount
Outcome: Order placed within 30 days (binary)
Covariates: Customer tenure, order history (RFM), location, restaurant preferences, device type, time of churn, previous discounts used, customer support interactions (~40 features)
Business question: Should we give 20% off to all 2M lapsed customers (costs $X million)? Or can we target only high-CATE customers—those who respond strongly to discounts—to maximize ROI?
Your task: Use Causal Forests to estimate heterogeneous treatment effects τ(x), identify which customer segments benefit most from discounts, and design a personalized targeting policy that maximizes incremental orders per dollar spent.
Step 1: Why Causal Forests? Setup & MotivationThe Heterogeneity Hypothesis:
Not all customers respond equally to discounts. Likely variation by:
Price sensitivity: Budget-conscious customers respond more to discounts
Engagement: Already-engaged customers might order without discount; low-engagement customers need incentive
Churn reason: Moved away? Switched to competitor? Different responses.
Order history: Heavy users vs. light users have different discount thresholds
Why Causal Forests are Perfect for This:
Automatic heterogeneity discovery: No need to pre-specify which features create heterogeneity—forest finds them
Non-parametric: Captures complex, non-linear patterns (e.g., discount works for tenure 6-12 months but not <6 or >12)
Honest estimation: Uses sample splitting to avoid overfitting, gives valid CIs for τ(x)
Scalable: Handles 40+ features easily, unlike manual subgroup analysis
Alternative Approaches (and why they're worse):
Single ATE: Just report average treatment effect (e.g., +8% order rate). Problem: Hides heterogeneity, can't optimize targeting.
Manual subgroups: Split by tenure (new/old), location (urban/suburban), analyze each. Problem: 2⁴ = 16 subgroups with 4 binary features; 2⁴⁰ with 40 features (intractable). Plus, ignores interactions.
Regression with interactions: Y ~ Discount × Tenure + Discount × Location + ... Problem: Need to manually specify all interactions. Causal forest does this automatically.
Standard Random Forest on Y: Predicts who orders, not who responds to discount. Confuses baseline propensity with treatment effect.
Step 2: Causal Forest ImplementationStep 2a: Data Preparation

import pandas as pd
import numpy as np
from econml.dml import CausalForestDML

# Load A/B test data
df = pd.read_csv('lapsed_customers_experiment.csv')

# Treatment: binary (1 = got discount, 0 = control)
W = df['discount_received'].values

# Outcome: binary (1 = ordered, 0 = didn't order)
Y = df['ordered_within_30days'].values

# Covariates: 40 customer features
X = df[['tenure_days', 'total_orders', 'avg_order_value', 'days_since_last_order',
        'favorite_cuisine_diversity', 'urban', 'app_user', 'support_tickets',
        'previous_discounts_used', ...  # 40 total
]].values

print(f"Sample size: {len(Y)}")
print(f"Treatment: {W.sum()}/{len(W)} ({W.mean()*100:.1f}% treated)")
print(f"Outcome (control): {Y[W==0].mean()*100:.1f}%")
print(f"Outcome (treated): {Y[W==1].mean()*100:.1f}%")
print(f"Naive ATE: {(Y[W==1].mean() - Y[W==0].mean())*100:.1f} pp")

# Example output:
#   Sample size: 100000
#   Treatment: 50000/100000 (50.0% treated)
#   Outcome (control): 5.2%
#   Outcome (treated): 13.4%
#   Naive ATE: 8.2 pp
                    
Step 2b: Fit Causal Forest

from grf import CausalForest

# Fit causal forest with honest splitting
cf = CausalForest(
    n_estimators=4000,      # More trees = more stable
    min_samples_leaf=50,    # Min obs per leaf (avoid overfitting)
    max_depth=None,         # Let tree grow naturally
    honest=True,            # CRITICAL: use honest trees
    honesty_fraction=0.5,   # 50% for splits, 50% for estimates
    inference=True,         # Compute standard errors
    random_state=42
)

# Fit on data
cf.fit(X, Y, W)

# Estimate CATE for each observation
tau_hat = cf.predict(X)  # Individual treatment effects τ(x_i)
tau_stderr = cf.predict_stderr(X)  # Standard errors (from inference)

print(f"\nCaTE Statistics:")
print(f"  Mean CATE: {tau_hat.mean()*100:.2f} pp")
print(f"  Median CATE: {np.median(tau_hat)*100:.2f} pp")
print(f"  Min CATE: {tau_hat.min()*100:.2f} pp")
print(f"  Max CATE: {tau_hat.max()*100:.2f} pp")
print(f"  Std CATE: {tau_hat.std()*100:.2f} pp")

# Example:
#   Mean CATE: 8.1 pp (≈ ATE, good sign)
#   Median CATE: 7.3 pp
#   Min CATE: -2.1 pp (some customers negatively affected!)
#   Max CATE: 24.5 pp (high responders)
#   Std CATE: 5.8 pp (substantial heterogeneity!)
                    
Key Parameters Explained:
n_estimators: More trees → more stable CATE estimates. 2000-5000 typical.
min_samples_leaf: Trade-off: smaller → more flexible (risk overfit); larger → smoother (less local). 20-100 typical.
honest=True: CRITICAL. Uses separate samples for splitting vs estimating. Without this, CATE estimates are biased.
honesty_fraction: 0.5 means 50-50 split. Some recommend 0.7 (more data for estimation).
inference=True: Computes standard errors for τ(x). Needed for confidence intervals.
Step 2c: Validate Causal Forest Fit

# Check 1: Mean CATE ≈ ATE from naive comparison
ate_naive = Y[W==1].mean() - Y[W==0].mean()
ate_cf = tau_hat.mean()
print(f"ATE (naive): {ate_naive*100:.2f} pp")
print(f"ATE (causal forest): {ate_cf*100:.2f} pp")
print(f"Difference: {abs(ate_naive - ate_cf)*100:.2f} pp")
# Should be close (< 1pp difference)

# Check 2: Out-of-bag prediction quality
# (GRF package automatically computes OOB MSE)
print(f"\nOOB MSE: {cf.oob_prediction_error_:.4f}")
# Lower is better; compare to baseline (constant effect model)

# Check 3: Variable importance
var_importance = cf.feature_importances_
top_features = np.argsort(var_importance)[-10:][::-1]
print("\nTop 10 features driving heterogeneity:")
for idx in top_features:
    print(f"  {feature_names[idx]}: {var_importance[idx]:.4f}")

# Example output:
#   1. days_since_last_order: 0.18
#   2. total_orders: 0.14
#   3. avg_order_value: 0.11
#   4. tenure_days: 0.09
#   5. urban: 0.07
                    
Step 3: Discover & Interpret HeterogeneityAnalysis 1: Quantile Analysis—Distribution of Treatment Effects

# Split customers into quintiles by CATE
cate_quintiles = pd.qcut(tau_hat, q=5, labels=['Q1 (Lowest)', 'Q2', 'Q3', 'Q4', 'Q5 (Highest)'])

for q in ['Q1 (Lowest)', 'Q2', 'Q3', 'Q4', 'Q5 (Highest)']:
    mask = (cate_quintiles == q)
    print(f"\n{q}:")
    print(f"  CATE range: [{tau_hat[mask].min()*100:.1f}pp, {tau_hat[mask].max()*100:.1f}pp]")
    print(f"  Avg CATE: {tau_hat[mask].mean()*100:.1f}pp")
    print(f"  Share of total lift: {tau_hat[mask].sum()/tau_hat.sum()*100:.1f}%")

# Example output:
#   Q5 (Highest): CATE range: [15.2pp, 24.5pp], Avg: 18.9pp
#     → Share of total lift: 42% from top 20% of customers!
#   Q1 (Lowest): CATE range: [-2.1pp, 2.8pp], Avg: 0.9pp
#     → Almost no response—giving discounts is wasteful here
                    
Key insight: Top quintile drives 42% of total incremental orders while being only 20% of customers → massive targeting opportunity!
Analysis 2: Best Linear Projection (BLP) Test for Heterogeneity

# Regress actual treatment effect on predicted CATE
# If significant heterogeneity exists, slope should be significantly different from 0

from statsmodels.api import OLS, add_constant

# Actual treatment effect proxy (for treated units)
y_treated = Y[W==1]
tau_treated = tau_hat[W==1]

# BLP regression: Y ~ intercept + CATE
model = OLS(y_treated, add_constant(tau_treated)).fit()
print("\nBest Linear Projection Test:")
print(f"  Intercept: {model.params[0]:.4f} (p={model.pvalues[0]:.4f})")
print(f"  Slope: {model.params[1]:.4f} (p={model.pvalues[1]:.4f})")
print(f"  R²: {model.rsquared:.4f}")

# Interpretation:
#   Slope ≈ 1 → CATE predictions are well-calibrated
#   p-value < 0.05 → Significant heterogeneity detected
#   R² > 0 → CATE explains variation in outcomes

# Example:
#   Slope: 0.89 (p < 0.001) → Significant heterogeneity!
#   R²: 0.12 → CATE explains 12% of outcome variation
                    
Analysis 3: Partial Dependence Plots—Which Features Drive Heterogeneity?

from sklearn.inspection import partial_dependence, plot_partial_dependence

# Partial dependence: How does CATE vary with each feature?
features_to_plot = ['days_since_last_order', 'total_orders',
                     'avg_order_value', 'tenure_days']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for idx, feature in enumerate(features_to_plot):
    ax = axes[idx // 2, idx % 2]

    # Compute PD
    pd_result = partial_dependence(cf, X, features=[feature_idx[feature]])

    ax.plot(pd_result['values'][0], pd_result['average'][0])
    ax.set_xlabel(feature)
    ax.set_ylabel('CATE (pp)')
    ax.set_title(f'CATE vs {feature}')
    ax.grid(alpha=0.3)

plt.tight_layout()

# Example findings from PD plots:
#   - days_since_last_order: CATE peaks at 60-90 days, drops for >120 days
#     → These customers are "retrievable," very long-churned customers less so
#   - total_orders: CATE highest for moderate users (10-30 orders),
#     lower for very light (<5) and heavy (>50) users
#   - avg_order_value: Negative relationship—high AOV customers don't need discount
#   - tenure_days: U-shaped—new customers and very old customers respond less
                    
Analysis 4: Policy Tree—Interpretable Segmentation

from sklearn.tree import DecisionTreeRegressor, plot_tree

# Fit decision tree on CATE to create interpretable rules
policy_tree = DecisionTreeRegressor(max_depth=3, min_samples_leaf=1000)
policy_tree.fit(X, tau_hat)

# Visualize
plt.figure(figsize=(20, 10))
plot_tree(policy_tree, feature_names=feature_names, filled=True)
plt.title('Policy Tree: Which customers have high CATE?')

# Example tree:
#   Root: days_since_last_order < 90?
#     Yes → total_orders < 20?
#       Yes → CATE = 4.2pp (low responders)
#       No → CATE = 16.8pp (HIGH responders)
#     No → CATE = 2.1pp (long-churned, low response)

# Create targeting segments from tree
segments = policy_tree.apply(X)
for seg in np.unique(segments):
    mask = (segments == seg)
    print(f"\nSegment {seg}: n={mask.sum()}")
    print(f"  Avg CATE: {tau_hat[mask].mean()*100:.1f}pp")
    print(f"  Rule: {get_tree_path(policy_tree, seg)}")
                    
Step 4: Design Optimal Targeting PolicyTargeting Strategy: Threshold-Based

# Simulate different targeting policies: Give discount to customers with CATE > threshold

discount_cost = 5  # $5 cost per discount given
revenue_per_order = 8  # $8 revenue per incremental order

thresholds = np.linspace(0, 0.20, 21)  # 0% to 20% CATE thresholds
results = []

for threshold in thresholds:
    # Target customers with CATE > threshold
    target_mask = (tau_hat > threshold)
    n_targeted = target_mask.sum()

    # Expected incremental orders
    incremental_orders = tau_hat[target_mask].sum()

    # Cost vs benefit
    total_cost = n_targeted * discount_cost
    total_revenue = incremental_orders * revenue_per_order
    net_value = total_revenue - total_cost
    roi = net_value / total_cost if total_cost > 0 else 0

    results.append({
        'threshold': threshold * 100,
        'n_targeted': n_targeted,
        'pct_targeted': n_targeted / len(tau_hat) * 100,
        'incremental_orders': incremental_orders,
        'total_cost': total_cost,
        'total_revenue': total_revenue,
        'net_value': net_value,
        'roi': roi * 100
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
                    
Example Results:
Threshold% TargetedNet ValueROI
0%100%$310K31%
5%65%$420K65%
10%35%$490K ⭐140%
15%15%$380K250%
20%5%$180K360%
Optimal policy: Target customers with CATE > 10% → 35% of customers, $490K net value (58% more than blanket policy!)
Recommendation to Leadership:
Current Policy (Blanket 20% off):
Target: All 2M lapsed customers (100%)
Cost: $10M (2M × $5 discount cost)
Incremental orders: 164K (2M × 8.2% ATE)
Revenue: $1.31M (164K × $8)
Net value: $310K (ROI: 31%)
Proposed Policy (Personalized Targeting at CATE > 10%):
Target: 700K customers (35% of lapsed base)
Cost: $3.5M (700K × $5)
Incremental orders: 113K (targeting high responders)
Revenue: $904K (113K × $8)
Net value: $490K (ROI: 140%)
✓ Save $6.5M in discount costs
✓ Increase net value by $180K (58% improvement)
✓ Improve ROI from 31% → 140%
Step 5: Validate CATE Estimates & Targeting PolicyValidation 1: Rank-Weighted Average Treatment Effect (RATE)

# RATE test: Do high-CATE customers actually have higher treatment effects?
# Regress outcomes on treatment, weighted by CATE rank

from scipy.stats import rankdata
ranks = rankdata(tau_hat)
weights = ranks / ranks.sum()

# Separate regressions for treated/control, weighted by CATE rank
y_treated_weighted = (Y[W==1] * weights[W==1]).sum()
y_control_weighted = (Y[W==0] * weights[W==0]).sum()
rate = y_treated_weighted - y_control_weighted

print(f"RATE (rank-weighted ATE): {rate*100:.2f} pp")
print(f"ATE (unweighted): {tau_hat.mean()*100:.2f} pp")

# If RATE > ATE → high-CATE customers have higher actual effects ✓
# If RATE ≈ ATE → no heterogeneity (CATE is noise)

# Example: RATE = 12.4pp vs ATE = 8.1pp → Validation success!
                    
Validation 2: Out-of-Sample Policy Evaluation

# Hold out 20% of data for validation
from sklearn.model_selection import train_test_split

X_train, X_val, Y_train, Y_val, W_train, W_val = train_test_split(
    X, Y, W, test_size=0.2, random_state=42
)

# Fit CF on train set
cf.fit(X_train, Y_train, W_train)

# Predict CATE on validation set (out-of-sample!)
tau_hat_val = cf.predict(X_val)

# Evaluate policy on validation set
threshold = 0.10
target_mask_val = (tau_hat_val > threshold)

# Actual outcomes in validation set
ate_val_targeted = (Y_val[W_val==1][target_mask_val[W_val==1]].mean() -
                     Y_val[W_val==0][target_mask_val[W_val==0]].mean())

print(f"\nOut-of-Sample Validation:")
print(f"  Predicted CATE (targeted group): {tau_hat_val[target_mask_val].mean()*100:.2f} pp")
print(f"  Actual ATE (targeted group): {ate_val_targeted*100:.2f} pp")
print(f"  Difference: {abs(tau_hat_val[target_mask_val].mean() - ate_val_targeted)*100:.2f} pp")

# Small difference (<2pp) → good out-of-sample performance ✓
                    
Validation 3: A/B Test Confirmation (Recommended)
Before rolling out personalized targeting to all 2M customers, run a pilot A/B test:
Treatment: 50K customers, give discount only if CATE > 10%
Control: 50K customers, give discount to all (blanket policy)
Measure: Net value (revenue - cost) per customer over 30 days
Validate: Treatment group should have higher net value (predicted: +$0.09 per customer)
If pilot confirms CATE-based targeting outperforms blanket policy → roll out to full 2M customer base.
Step 6: Handling Interview Follow-UpsQ1: How do you prevent overfitting with so many features (40)?
Answer:
Honest trees: Separate samples for splitting decisions vs. effect estimates → eliminates overfitting at leaf level
min_samples_leaf: Force at least 50-100 observations per leaf → prevents tiny, noisy leaves
Cross-validation: Tune hyperparameters (n_trees, min_samples_leaf) using held-out validation set
Out-of-bag (OOB) error: Each tree uses bootstrap sample → validate on OOB observations
Multiple random splits: Forest averages across 4000 trees, each seeing different random features → reduces variance
Q2: What if customers with high CATE would have ordered anyway (without discount)?
Answer:
This is a critical concern—we want CATE (treatment effect), not baseline propensity to order.
Why Causal Forest solves this:
CF explicitly models treatment effect τ(x) = E[Y(1) - Y(0) | X=x], not outcome Y(x)
Splits trees based on heterogeneity in treatment effect, not outcome level
Uses both treated and control groups to difference out baseline propensity
Sanity check: Plot baseline outcome (control group) vs. CATE. If uncorrelated → good. If correlated → may be confusing propensity with treatment effect (check model specification).
Q3: Can we use CATE estimates for continuous treatments (e.g., discount amount)?
Answer:
Yes! Causal forests generalize to continuous treatments. Instead of binary (discount/no discount), estimate dose-response function τ(x, d) where d is discount level (0%, 10%, 20%, 30%).
Approach:
Run A/B test with multiple discount levels (0%, 10%, 20%, 30%)
Fit causal forest with continuous treatment W ∈ [0, 0.30]
Estimate marginal effect: ∂E[Y|X,W]/∂W for each customer
Optimize discount level per customer: max_d { E[Y|X,d] × revenue - d × cost }
This enables fully personalized discounts (e.g., 8% for customer A, 22% for customer B).
Q4: How do you communicate CATE findings to non-technical stakeholders?
Answer:
Start with business impact: "We can save $6.5M while increasing net value by $180K"
Visualize quintiles: Bar chart showing CATE for Q1-Q5, label "High Responders" vs "Low Responders"
Use policy tree: Show decision tree (max depth 3) with simple rules: "Target customers who churned 60-90 days ago AND made 10-30 orders"
Show ROI curve: Plot Net Value vs % Targeted, highlight optimal point
Avoid jargon: Don't say "CATE," say "predicted response to discount." Don't say "honest trees," say "validated with held-out data."
Frame as A/B test: "We tested this on historical data and confirmed with pilots. Now we roll out."
Common Pitfalls in Causal Forest AnalysisPitfall 1: Not using honest trees
Without honest=True, CATE estimates are overfitted (biased). Always use honest splitting for causal inference.
Pitfall 2: Confusing CATE with propensity to respond
High baseline likelihood to order ≠ high treatment effect. Causal Forest separates these, but sanity check by plotting Y(control) vs CATE—should be uncorrelated.
Pitfall 3: Not validating out-of-sample
CATE predictions can overfit. Hold out validation set or use cross-fitting. Always pilot targeting policy in A/B test before full rollout.
Pitfall 4: Ignoring standard errors
CATE estimates have uncertainty. Use inference=True to get standard errors, and don't target customers where CI includes zero.
Pitfall 5: Overly aggressive targeting
Targeting only top 5% maximizes ROI but leaves money on table (small scale). Find optimal threshold that balances scale and efficiency.
Pitfall 6: Assuming CATE is stable over time
Customer preferences change. Retrain causal forest quarterly, and monitor policy performance to detect drift.
Pitfall 7: Not considering fairness/equity
Personalized targeting may systematically exclude certain demographics. Check if CATE correlates with protected attributes, and consider fairness constraints.
← Previous: Double MLNext: Meta-Learners →
Method	Pros	Cons
Causal Forest	• Non-parametric, no functional form • Valid CIs • Variable importance • Handles high-dim X	• Slower than meta-learners • Black box (less interpretable) • Needs larger samples
DML	• Flexible ML for nuisances • Valid inference • Efficient estimates	• Assumes parametric τ(x) • Requires correct model for θ
Meta-learners	• Simple to implement • Fast • Works with any base learner	• No built-in inference • Can be biased (esp. S/T-learner)
Matching	• Interpretable • No parametric assumptions	• Curse of dimensionality • Inefficient with many features
Threshold	% Targeted	Net Value	ROI
0%	100%	$310K	31%
5%	65%	$420K	65%
10%	35%	$490K ⭐	140%
15%	15%	$380K	250%
20%	5%	$180K	360%