← back to series
Module 3, Week 6

Causal Forests & Tree-Based Methods

Discover how to adapt random forests for causal inference to estimate heterogeneous treatment effects—understanding who benefits most from a treatment and by how much.

From Prediction to Causation

Random forests are powerful prediction machines, but naive application to treatment effect estimation fails badly. Causal Forests adapt the random forest algorithm to directly target conditional average treatment effects (CATE).

Instead of asking "what will Y be?", we ask "how would Y change if we changed treatment?"—and critically, "does this effect vary across individuals?"

Key Insight: By modifying the splitting criterion and prediction rule, we can grow trees that partition the covariate space to maximize treatment effect heterogeneity, not prediction accuracy.

1. From CART to Causal Trees

Standard regression trees (CART) split on features to minimize prediction error. Causal trees split on features to maximize treatment effect heterogeneity.

Standard Regression Tree (CART):
Split to minimize: Σ (yᵢ - ŷ)²
// Minimize prediction error
Causal Tree:
Split to maximize: Var[τ(X)] across leaves
// Maximize treatment effect variation

Causal Tree Algorithm (Athey & Imbens 2016):

  1. Splitting criterion: Find split that maximizes treatment effect heterogeneity between child nodes
  2. Leaf estimates: Estimate τ̂(x) = Ȳ₁ - Ȳ₀ in each leaf (average difference between treated and control)
  3. Stopping rule: Stop when leaves have minimum sample size or no more heterogeneity
  4. Prediction: For new x, return τ̂ from the leaf x falls into
Why Different Splitting? We don't care if Y is predictable—we care if treatment effects differ across subgroups. A good causal tree creates leaves where treatment effects are homogeneous within but heterogeneous across leaves.

2. Honesty: Sample Splitting for Valid Inference

Using the same data for both constructing the tree (choosing splits) and estimating leaf values leads to overfitting and biased estimates. Honesty solves this.

Honest Tree Construction:

  1. Split data into two subsamples: I (splitting) and J (estimation)
  2. Use sample I to determine tree structure (where to split, which features)
  3. Use sample J to estimate treatment effects τ̂ in each leaf
  4. Never let the same observation influence both structure and estimates

Benefits of Honesty:

  • Unbiased estimates: Leaf estimates aren't inflated by adaptive selection
  • Valid confidence intervals: Can derive asymptotic distributions
  • Better generalization: Tree structure learned on independent data
  • Principled inference: Enables theoretical guarantees
Intuition: It's like using a train/test split, but for tree structure vs. leaf values. The tree "shape" comes from one sample, the numbers in leaves from another. This prevents "peeking" at outcomes when deciding where to split.

3. Causal Forests: Ensemble of Honest Trees

Just as random forests improve on single trees, causal forests aggregate many honest causal trees to get better estimates with lower variance.

Causal Forest Algorithm:

  1. For b = 1, ..., B (number of trees):
    • Draw bootstrap sample or subsample
    • Split into I_b (structure) and J_b (estimation)
    • Grow honest causal tree with random feature subsampling
    • Estimate τ̂_b(x) for each leaf using J_b
  2. Final estimate: τ̂(x) = (1/B) Σ τ̂_b(x)
  3. Can also compute pointwise confidence intervals

Key Innovations:

  • Subsampling without replacement: Unlike classic RF, often better for causal forests
  • Adaptive neighbors: Estimate τ(x) using observations in same leaves across trees
  • Variable importance: Identify which features drive treatment heterogeneity
  • Honest splitting + bagging: Double protection against overfitting

How Prediction Works:

For a new observation x, we find which leaf it falls into in each tree, then average treatment effects from those leaves:

τ̂(x) = Σ_b α_b(x) · (Ȳ₁_b - Ȳ₀_b)
// α_b(x) are weights based on leaf membership
// Ȳ₁_b, Ȳ₀_b are treated/control means in leaf

4. Generalized Random Forests (GRF)

GRF extends causal forests to a unified framework for local moment estimation, enabling estimation of various causal quantities beyond simple treatment effects.

GRF Framework (Athey, Tibshirani, Wager 2019):

Instead of targeting a specific estimand, GRF solves local versions of moment equations:

E[ψ(Oᵢ; θ(x)) | Xᵢ ≈ x] = 0
// ψ is a moment function (e.g., score)
// θ(x) is parameter of interest at x

What GRF Can Estimate:

  • Causal effects: CATE with various identification strategies
  • Quantile treatment effects: How treatment affects distribution of Y
  • Survival/hazard effects: Time-to-event treatment effects
  • Instrumental variable effects: Local average treatment effects with IV
  • Policy learning: Optimal treatment assignment rules
Practical Advantage: GRF provides a consistent interface across different causal estimands. Same forest-building machinery, just swap the moment function.

5. Variable Importance for Treatment Heterogeneity

Beyond estimating τ(x), we often want to know which features drive treatment effect heterogeneity. Variable importance measures help prioritize subgroup analyses.

Methods for Assessing Importance:

1. Split-based importance:

How often a variable is used for splitting, weighted by improvement in heterogeneity.

2. Permutation importance:

Randomly permute variable and measure decrease in forest's ability to detect heterogeneity.

3. SHAP values for CATE:

Decompose individual-level CATE predictions into feature contributions.

4. Best Linear Projection (BLP):

Project τ̂(X) onto individual features to quantify linear relationships.

τ̂(Xᵢ) ≈ β₀ + β₁X₁ᵢ + β₂X₂ᵢ + ... + εᵢ
// β coefficients show importance

Practical Use Cases:

  • Targeting: Identify high-value customer segments for personalized offers
  • Resource allocation: Focus interventions where they're most effective
  • Hypothesis generation: Discover unexpected moderators of treatment
  • Policy design: Understand which populations benefit most from programs

Lab: CATE Estimation with GRF and CausalML

Let's implement causal forests using the grf package in R and causalml in Python.

R: Using GRF Package

# Install GRF
install.packages("grf")
library(grf)
# Simulate data
n <- 2000
p <- 10
X <- matrix(rnorm(n * p), n, p)
# Treatment depends on X[,1]
propensity <- 1 / (1 + exp(-X[,1]))
W <- rbinom(n, 1, propensity)
# Heterogeneous treatment effect
tau <- 1 + 2 * X[,1] + X[,2] # effect varies with X[,1] and X[,2]
Y <- tau * W + X[,1] + rnorm(n)
# Train causal forest
cf <- causal_forest(X, Y, W,
num.trees = 4000,
honesty = TRUE,
tune.parameters = "all")
# Predict CATE
tau_hat <- predict(cf, X, estimate.variance = TRUE)
# Average treatment effect
ate <- average_treatment_effect(cf)
print(paste("ATE:", round(ate["estimate"], 3)))
print(paste("95% CI:", round(ate["estimate"] - 1.96*ate["std.err"], 3),
"to", round(ate["estimate"] + 1.96*ate["std.err"], 3)))
# Variable importance
varimp <- variable_importance(cf)
ranked_vars <- order(varimp, decreasing = TRUE)
print("Most important variables for heterogeneity:")
print(ranked_vars[1:5])
# Best linear projection
blp <- best_linear_projection(cf, X[, 1:3])
print(blp) # Shows how CATE varies with each covariate

Python: Using CausalML

# Install CausalML
pip install causalml
import numpy as np
from causalml.inference.tree import UpliftTreeClassifier, UpliftRandomForestClassifier
from causalml.inference.meta import BaseXRegressor
from sklearn.ensemble import RandomForestRegressor
# Simulate data
n = 2000
X = np.random.randn(n, 10)
propensity = 1 / (1 + np.exp(-X[:, 0]))
treatment = np.random.binomial(1, propensity)
tau = 1 + 2*X[:, 0] + X[:, 1]
y = tau * treatment + X[:, 0] + np.random.randn(n)
# Method 1: Uplift Random Forest
uplift_rf = UpliftRandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_leaf=50
)
# For classification (convert to binary outcome)
y_binary = (y > np.median(y)).astype(int)
uplift_rf.fit(X, treatment, y_binary)
cate_uplift = uplift_rf.predict(X)
# Method 2: X-learner with Random Forest
xl = BaseXRegressor(
learner=RandomForestRegressor(n_estimators=100)
)
xl.fit(X, treatment, y)
cate_xl = xl.predict(X)
# Evaluate performance
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(tau, cate_xl.flatten())
print(f"MSE for CATE prediction: {mse:.3f}")
# Variable importance (feature importance)
importance = uplift_rf.feature_importances_
for i, imp in enumerate(importance):
print(f"Feature {i}: {imp:.4f}")

Advanced: Confidence Intervals with GRF

# Get pointwise confidence intervals
tau_hat <- predict(cf, X, estimate.variance = TRUE)
# Extract estimates and standard errors
cate <- tau_hat$predictions
cate_se <- sqrt(tau_hat$variance.estimates)
# Construct 95% CIs
ci_lower <- cate - 1.96 * cate_se
ci_upper <- cate + 1.96 * cate_se
# Visualize CATE vs covariate with CIs
library(ggplot2)
df <- data.frame(x1 = X[,1], cate = cate,
ci_lower = ci_lower, ci_upper = ci_upper)
ggplot(df, aes(x = x1, y = cate)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess") +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper),
alpha = 0.2, fill = "blue") +
labs(title = "Conditional Average Treatment Effect",
x = "Covariate X1", y = "CATE")

Causal Forests vs Other Methods

MethodProsCons
Causal Forest• Non-parametric, no functional form
• Valid CIs
• Variable importance
• Handles high-dim X
• Slower than meta-learners
• Black box (less interpretable)
• Needs larger samples
DML• Flexible ML for nuisances
• Valid inference
• Efficient estimates
• Assumes parametric τ(x)
• Requires correct model for θ
Meta-learners• Simple to implement
• Fast
• Works with any base learner
• No built-in inference
• Can be biased (esp. S/T-learner)
Matching• Interpretable
• No parametric assumptions
• Curse of dimensionality
• Inefficient with many features
When to Use Causal Forests: Best when you expect complex, non-linear treatment heterogeneity and need valid inference. If you just need point estimates quickly, meta-learners might suffice. For sparse heterogeneity or interpretability, consider Lasso-based methods.

Practical Considerations

Hyperparameter Tuning:

  • num.trees: More is better (2000-4000 typical), but diminishing returns
  • min.node.size: Leaf size controls smoothness (larger = smoother estimates)
  • sample.fraction: Subsampling rate (0.5 often works well)
  • honesty.fraction: Split between I and J samples (0.5 is common)
  • Use built-in tuning: tune.parameters = "all" in grf

Diagnostics:

  • Overlap: Check propensity score distributions (same as other methods)
  • Calibration: Does τ̂(X) predict actual treatment heterogeneity?
  • Stability: Re-run with different seeds; estimates should be stable
  • Out-of-bag predictions: Use OOB errors to assess fit quality

Common Mistakes:

  • Using too few trees (leads to noisy estimates)
  • Ignoring honesty (biases estimates and invalidates inference)
  • Over-interpreting small CATE differences (check if CIs overlap)
  • Forgetting to check overlap before estimation
  • Not adjusting for multiple testing when exploring subgroups

Key Takeaways

1.Adapt Trees for Causality: Causal trees split on heterogeneity, not prediction error, to discover treatment effect variation
2.Honesty is Essential: Sample splitting between structure and estimation prevents bias and enables valid inference
3.Forests for Stability: Ensemble many honest trees to reduce variance and improve estimates
4.GRF Generalizes: Unified framework for many causal estimands beyond simple treatment effects
5.Variable Importance: Discover which features moderate treatment effects for targeting and understanding
6.When to Use: Best for complex heterogeneity, sufficient sample size, and when you need confidence intervals

Further Reading

Foundational Papers:
  • Athey & Imbens (2016), "Recursive Partitioning for Heterogeneous Causal Effects"
  • Wager & Athey (2018), "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests"
  • Athey, Tibshirani & Wager (2019), "Generalized Random Forests"
Software:
  • grf R package - Reference implementation with excellent documentation
  • CausalML Python - Python library with uplift forests and meta-learners
  • EconML - Also includes causal forest variants
Extensions:
  • Davis & Heller (2017), "Using Causal Forests to Predict Treatment Heterogeneity"
  • Nie & Wager (2021), "Quasi-Oracle Estimation of Heterogeneous Treatment Effects"
  • Künzel et al. (2019) - Comparing causal forests with meta-learners

Business Case Study: Interview Approach

📊 Case: DoorDash Personalized Discount Optimization

Context: You're a data scientist at DoorDash. The company currently offers a blanket 20% discount to all customers who haven't ordered in 60 days. Marketing wants to optimize this strategy: instead of giving everyone the same discount, can we personalize discount amounts based on who benefits most?

Data: Historical A/B test with 100K lapsed customers:

  • Treatment: 50K received 20% discount
  • Control: 50K received no discount
  • Outcome: Order placed within 30 days (binary)
  • Covariates: Customer tenure, order history (RFM), location, restaurant preferences, device type, time of churn, previous discounts used, customer support interactions (~40 features)

Business question: Should we give 20% off to all 2M lapsed customers (costs $X million)? Or can we target only high-CATE customers—those who respond strongly to discounts—to maximize ROI?

Your task: Use Causal Forests to estimate heterogeneous treatment effects τ(x), identify which customer segments benefit most from discounts, and design a personalized targeting policy that maximizes incremental orders per dollar spent.

Step 1: Why Causal Forests? Setup & Motivation

The Heterogeneity Hypothesis:

Not all customers respond equally to discounts. Likely variation by:

  • Price sensitivity: Budget-conscious customers respond more to discounts
  • Engagement: Already-engaged customers might order without discount; low-engagement customers need incentive
  • Churn reason: Moved away? Switched to competitor? Different responses.
  • Order history: Heavy users vs. light users have different discount thresholds

Why Causal Forests are Perfect for This:

  1. Automatic heterogeneity discovery: No need to pre-specify which features create heterogeneity—forest finds them
  2. Non-parametric: Captures complex, non-linear patterns (e.g., discount works for tenure 6-12 months but not <6 or >12)
  3. Honest estimation: Uses sample splitting to avoid overfitting, gives valid CIs for τ(x)
  4. Scalable: Handles 40+ features easily, unlike manual subgroup analysis

Alternative Approaches (and why they're worse):

  1. Single ATE: Just report average treatment effect (e.g., +8% order rate). Problem: Hides heterogeneity, can't optimize targeting.
  2. Manual subgroups: Split by tenure (new/old), location (urban/suburban), analyze each. Problem: 2⁴ = 16 subgroups with 4 binary features; 2⁴⁰ with 40 features (intractable). Plus, ignores interactions.
  3. Regression with interactions: Y ~ Discount × Tenure + Discount × Location + ... Problem: Need to manually specify all interactions. Causal forest does this automatically.
  4. Standard Random Forest on Y: Predicts who orders, not who responds to discount. Confuses baseline propensity with treatment effect.
Step 2: Causal Forest Implementation

Step 2a: Data Preparation


import pandas as pd
import numpy as np
from econml.dml import CausalForestDML

# Load A/B test data
df = pd.read_csv('lapsed_customers_experiment.csv')

# Treatment: binary (1 = got discount, 0 = control)
W = df['discount_received'].values

# Outcome: binary (1 = ordered, 0 = didn't order)
Y = df['ordered_within_30days'].values

# Covariates: 40 customer features
X = df[['tenure_days', 'total_orders', 'avg_order_value', 'days_since_last_order',
        'favorite_cuisine_diversity', 'urban', 'app_user', 'support_tickets',
        'previous_discounts_used', ...  # 40 total
]].values

print(f"Sample size: {len(Y)}")
print(f"Treatment: {W.sum()}/{len(W)} ({W.mean()*100:.1f}% treated)")
print(f"Outcome (control): {Y[W==0].mean()*100:.1f}%")
print(f"Outcome (treated): {Y[W==1].mean()*100:.1f}%")
print(f"Naive ATE: {(Y[W==1].mean() - Y[W==0].mean())*100:.1f} pp")

# Example output:
#   Sample size: 100000
#   Treatment: 50000/100000 (50.0% treated)
#   Outcome (control): 5.2%
#   Outcome (treated): 13.4%
#   Naive ATE: 8.2 pp
                    

Step 2b: Fit Causal Forest


from grf import CausalForest

# Fit causal forest with honest splitting
cf = CausalForest(
    n_estimators=4000,      # More trees = more stable
    min_samples_leaf=50,    # Min obs per leaf (avoid overfitting)
    max_depth=None,         # Let tree grow naturally
    honest=True,            # CRITICAL: use honest trees
    honesty_fraction=0.5,   # 50% for splits, 50% for estimates
    inference=True,         # Compute standard errors
    random_state=42
)

# Fit on data
cf.fit(X, Y, W)

# Estimate CATE for each observation
tau_hat = cf.predict(X)  # Individual treatment effects τ(x_i)
tau_stderr = cf.predict_stderr(X)  # Standard errors (from inference)

print(f"\nCaTE Statistics:")
print(f"  Mean CATE: {tau_hat.mean()*100:.2f} pp")
print(f"  Median CATE: {np.median(tau_hat)*100:.2f} pp")
print(f"  Min CATE: {tau_hat.min()*100:.2f} pp")
print(f"  Max CATE: {tau_hat.max()*100:.2f} pp")
print(f"  Std CATE: {tau_hat.std()*100:.2f} pp")

# Example:
#   Mean CATE: 8.1 pp (≈ ATE, good sign)
#   Median CATE: 7.3 pp
#   Min CATE: -2.1 pp (some customers negatively affected!)
#   Max CATE: 24.5 pp (high responders)
#   Std CATE: 5.8 pp (substantial heterogeneity!)
                    

Key Parameters Explained:

  • n_estimators: More trees → more stable CATE estimates. 2000-5000 typical.
  • min_samples_leaf: Trade-off: smaller → more flexible (risk overfit); larger → smoother (less local). 20-100 typical.
  • honest=True: CRITICAL. Uses separate samples for splitting vs estimating. Without this, CATE estimates are biased.
  • honesty_fraction: 0.5 means 50-50 split. Some recommend 0.7 (more data for estimation).
  • inference=True: Computes standard errors for τ(x). Needed for confidence intervals.

Step 2c: Validate Causal Forest Fit


# Check 1: Mean CATE ≈ ATE from naive comparison
ate_naive = Y[W==1].mean() - Y[W==0].mean()
ate_cf = tau_hat.mean()
print(f"ATE (naive): {ate_naive*100:.2f} pp")
print(f"ATE (causal forest): {ate_cf*100:.2f} pp")
print(f"Difference: {abs(ate_naive - ate_cf)*100:.2f} pp")
# Should be close (< 1pp difference)

# Check 2: Out-of-bag prediction quality
# (GRF package automatically computes OOB MSE)
print(f"\nOOB MSE: {cf.oob_prediction_error_:.4f}")
# Lower is better; compare to baseline (constant effect model)

# Check 3: Variable importance
var_importance = cf.feature_importances_
top_features = np.argsort(var_importance)[-10:][::-1]
print("\nTop 10 features driving heterogeneity:")
for idx in top_features:
    print(f"  {feature_names[idx]}: {var_importance[idx]:.4f}")

# Example output:
#   1. days_since_last_order: 0.18
#   2. total_orders: 0.14
#   3. avg_order_value: 0.11
#   4. tenure_days: 0.09
#   5. urban: 0.07
                    
Step 3: Discover & Interpret Heterogeneity

Analysis 1: Quantile Analysis—Distribution of Treatment Effects


# Split customers into quintiles by CATE
cate_quintiles = pd.qcut(tau_hat, q=5, labels=['Q1 (Lowest)', 'Q2', 'Q3', 'Q4', 'Q5 (Highest)'])

for q in ['Q1 (Lowest)', 'Q2', 'Q3', 'Q4', 'Q5 (Highest)']:
    mask = (cate_quintiles == q)
    print(f"\n{q}:")
    print(f"  CATE range: [{tau_hat[mask].min()*100:.1f}pp, {tau_hat[mask].max()*100:.1f}pp]")
    print(f"  Avg CATE: {tau_hat[mask].mean()*100:.1f}pp")
    print(f"  Share of total lift: {tau_hat[mask].sum()/tau_hat.sum()*100:.1f}%")

# Example output:
#   Q5 (Highest): CATE range: [15.2pp, 24.5pp], Avg: 18.9pp
#     → Share of total lift: 42% from top 20% of customers!
#   Q1 (Lowest): CATE range: [-2.1pp, 2.8pp], Avg: 0.9pp
#     → Almost no response—giving discounts is wasteful here
                    

Key insight: Top quintile drives 42% of total incremental orders while being only 20% of customers → massive targeting opportunity!

Analysis 2: Best Linear Projection (BLP) Test for Heterogeneity


# Regress actual treatment effect on predicted CATE
# If significant heterogeneity exists, slope should be significantly different from 0

from statsmodels.api import OLS, add_constant

# Actual treatment effect proxy (for treated units)
y_treated = Y[W==1]
tau_treated = tau_hat[W==1]

# BLP regression: Y ~ intercept + CATE
model = OLS(y_treated, add_constant(tau_treated)).fit()
print("\nBest Linear Projection Test:")
print(f"  Intercept: {model.params[0]:.4f} (p={model.pvalues[0]:.4f})")
print(f"  Slope: {model.params[1]:.4f} (p={model.pvalues[1]:.4f})")
print(f"  R²: {model.rsquared:.4f}")

# Interpretation:
#   Slope ≈ 1 → CATE predictions are well-calibrated
#   p-value < 0.05 → Significant heterogeneity detected
#   R² > 0 → CATE explains variation in outcomes

# Example:
#   Slope: 0.89 (p < 0.001) → Significant heterogeneity!
#   R²: 0.12 → CATE explains 12% of outcome variation
                    

Analysis 3: Partial Dependence Plots—Which Features Drive Heterogeneity?


from sklearn.inspection import partial_dependence, plot_partial_dependence

# Partial dependence: How does CATE vary with each feature?
features_to_plot = ['days_since_last_order', 'total_orders',
                     'avg_order_value', 'tenure_days']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for idx, feature in enumerate(features_to_plot):
    ax = axes[idx // 2, idx % 2]

    # Compute PD
    pd_result = partial_dependence(cf, X, features=[feature_idx[feature]])

    ax.plot(pd_result['values'][0], pd_result['average'][0])
    ax.set_xlabel(feature)
    ax.set_ylabel('CATE (pp)')
    ax.set_title(f'CATE vs {feature}')
    ax.grid(alpha=0.3)

plt.tight_layout()

# Example findings from PD plots:
#   - days_since_last_order: CATE peaks at 60-90 days, drops for >120 days
#     → These customers are "retrievable," very long-churned customers less so
#   - total_orders: CATE highest for moderate users (10-30 orders),
#     lower for very light (<5) and heavy (>50) users
#   - avg_order_value: Negative relationship—high AOV customers don't need discount
#   - tenure_days: U-shaped—new customers and very old customers respond less
                    

Analysis 4: Policy Tree—Interpretable Segmentation


from sklearn.tree import DecisionTreeRegressor, plot_tree

# Fit decision tree on CATE to create interpretable rules
policy_tree = DecisionTreeRegressor(max_depth=3, min_samples_leaf=1000)
policy_tree.fit(X, tau_hat)

# Visualize
plt.figure(figsize=(20, 10))
plot_tree(policy_tree, feature_names=feature_names, filled=True)
plt.title('Policy Tree: Which customers have high CATE?')

# Example tree:
#   Root: days_since_last_order < 90?
#     Yes → total_orders < 20?
#       Yes → CATE = 4.2pp (low responders)
#       No → CATE = 16.8pp (HIGH responders)
#     No → CATE = 2.1pp (long-churned, low response)

# Create targeting segments from tree
segments = policy_tree.apply(X)
for seg in np.unique(segments):
    mask = (segments == seg)
    print(f"\nSegment {seg}: n={mask.sum()}")
    print(f"  Avg CATE: {tau_hat[mask].mean()*100:.1f}pp")
    print(f"  Rule: {get_tree_path(policy_tree, seg)}")
                    
Step 4: Design Optimal Targeting Policy

Targeting Strategy: Threshold-Based


# Simulate different targeting policies: Give discount to customers with CATE > threshold

discount_cost = 5  # $5 cost per discount given
revenue_per_order = 8  # $8 revenue per incremental order

thresholds = np.linspace(0, 0.20, 21)  # 0% to 20% CATE thresholds
results = []

for threshold in thresholds:
    # Target customers with CATE > threshold
    target_mask = (tau_hat > threshold)
    n_targeted = target_mask.sum()

    # Expected incremental orders
    incremental_orders = tau_hat[target_mask].sum()

    # Cost vs benefit
    total_cost = n_targeted * discount_cost
    total_revenue = incremental_orders * revenue_per_order
    net_value = total_revenue - total_cost
    roi = net_value / total_cost if total_cost > 0 else 0

    results.append({
        'threshold': threshold * 100,
        'n_targeted': n_targeted,
        'pct_targeted': n_targeted / len(tau_hat) * 100,
        'incremental_orders': incremental_orders,
        'total_cost': total_cost,
        'total_revenue': total_revenue,
        'net_value': net_value,
        'roi': roi * 100
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
                    

Example Results:

Threshold% TargetedNet ValueROI
0%100%$310K31%
5%65%$420K65%
10%35%$490K ⭐140%
15%15%$380K250%
20%5%$180K360%

Optimal policy: Target customers with CATE > 10% → 35% of customers, $490K net value (58% more than blanket policy!)

Recommendation to Leadership:

Current Policy (Blanket 20% off):

  • Target: All 2M lapsed customers (100%)
  • Cost: $10M (2M × $5 discount cost)
  • Incremental orders: 164K (2M × 8.2% ATE)
  • Revenue: $1.31M (164K × $8)
  • Net value: $310K (ROI: 31%)

Proposed Policy (Personalized Targeting at CATE > 10%):

  • Target: 700K customers (35% of lapsed base)
  • Cost: $3.5M (700K × $5)
  • Incremental orders: 113K (targeting high responders)
  • Revenue: $904K (113K × $8)
  • Net value: $490K (ROI: 140%)

✓ Save $6.5M in discount costs
✓ Increase net value by $180K (58% improvement)
✓ Improve ROI from 31% → 140%

Step 5: Validate CATE Estimates & Targeting Policy

Validation 1: Rank-Weighted Average Treatment Effect (RATE)


# RATE test: Do high-CATE customers actually have higher treatment effects?
# Regress outcomes on treatment, weighted by CATE rank

from scipy.stats import rankdata
ranks = rankdata(tau_hat)
weights = ranks / ranks.sum()

# Separate regressions for treated/control, weighted by CATE rank
y_treated_weighted = (Y[W==1] * weights[W==1]).sum()
y_control_weighted = (Y[W==0] * weights[W==0]).sum()
rate = y_treated_weighted - y_control_weighted

print(f"RATE (rank-weighted ATE): {rate*100:.2f} pp")
print(f"ATE (unweighted): {tau_hat.mean()*100:.2f} pp")

# If RATE > ATE → high-CATE customers have higher actual effects ✓
# If RATE ≈ ATE → no heterogeneity (CATE is noise)

# Example: RATE = 12.4pp vs ATE = 8.1pp → Validation success!
                    

Validation 2: Out-of-Sample Policy Evaluation


# Hold out 20% of data for validation
from sklearn.model_selection import train_test_split

X_train, X_val, Y_train, Y_val, W_train, W_val = train_test_split(
    X, Y, W, test_size=0.2, random_state=42
)

# Fit CF on train set
cf.fit(X_train, Y_train, W_train)

# Predict CATE on validation set (out-of-sample!)
tau_hat_val = cf.predict(X_val)

# Evaluate policy on validation set
threshold = 0.10
target_mask_val = (tau_hat_val > threshold)

# Actual outcomes in validation set
ate_val_targeted = (Y_val[W_val==1][target_mask_val[W_val==1]].mean() -
                     Y_val[W_val==0][target_mask_val[W_val==0]].mean())

print(f"\nOut-of-Sample Validation:")
print(f"  Predicted CATE (targeted group): {tau_hat_val[target_mask_val].mean()*100:.2f} pp")
print(f"  Actual ATE (targeted group): {ate_val_targeted*100:.2f} pp")
print(f"  Difference: {abs(tau_hat_val[target_mask_val].mean() - ate_val_targeted)*100:.2f} pp")

# Small difference (<2pp) → good out-of-sample performance ✓
                    

Validation 3: A/B Test Confirmation (Recommended)

Before rolling out personalized targeting to all 2M customers, run a pilot A/B test:

  • Treatment: 50K customers, give discount only if CATE > 10%
  • Control: 50K customers, give discount to all (blanket policy)
  • Measure: Net value (revenue - cost) per customer over 30 days
  • Validate: Treatment group should have higher net value (predicted: +$0.09 per customer)

If pilot confirms CATE-based targeting outperforms blanket policy → roll out to full 2M customer base.

Step 6: Handling Interview Follow-Ups

Q1: How do you prevent overfitting with so many features (40)?

Answer:

  • Honest trees: Separate samples for splitting decisions vs. effect estimates → eliminates overfitting at leaf level
  • min_samples_leaf: Force at least 50-100 observations per leaf → prevents tiny, noisy leaves
  • Cross-validation: Tune hyperparameters (n_trees, min_samples_leaf) using held-out validation set
  • Out-of-bag (OOB) error: Each tree uses bootstrap sample → validate on OOB observations
  • Multiple random splits: Forest averages across 4000 trees, each seeing different random features → reduces variance

Q2: What if customers with high CATE would have ordered anyway (without discount)?

Answer:

This is a critical concern—we want CATE (treatment effect), not baseline propensity to order.

Why Causal Forest solves this:

  • CF explicitly models treatment effect τ(x) = E[Y(1) - Y(0) | X=x], not outcome Y(x)
  • Splits trees based on heterogeneity in treatment effect, not outcome level
  • Uses both treated and control groups to difference out baseline propensity

Sanity check: Plot baseline outcome (control group) vs. CATE. If uncorrelated → good. If correlated → may be confusing propensity with treatment effect (check model specification).

Q3: Can we use CATE estimates for continuous treatments (e.g., discount amount)?

Answer:

Yes! Causal forests generalize to continuous treatments. Instead of binary (discount/no discount), estimate dose-response function τ(x, d) where d is discount level (0%, 10%, 20%, 30%).

Approach:

  1. Run A/B test with multiple discount levels (0%, 10%, 20%, 30%)
  2. Fit causal forest with continuous treatment W ∈ [0, 0.30]
  3. Estimate marginal effect: ∂E[Y|X,W]/∂W for each customer
  4. Optimize discount level per customer: max_d { E[Y|X,d] × revenue - d × cost }

This enables fully personalized discounts (e.g., 8% for customer A, 22% for customer B).

Q4: How do you communicate CATE findings to non-technical stakeholders?

Answer:

  1. Start with business impact: "We can save $6.5M while increasing net value by $180K"
  2. Visualize quintiles: Bar chart showing CATE for Q1-Q5, label "High Responders" vs "Low Responders"
  3. Use policy tree: Show decision tree (max depth 3) with simple rules: "Target customers who churned 60-90 days ago AND made 10-30 orders"
  4. Show ROI curve: Plot Net Value vs % Targeted, highlight optimal point
  5. Avoid jargon: Don't say "CATE," say "predicted response to discount." Don't say "honest trees," say "validated with held-out data."
  6. Frame as A/B test: "We tested this on historical data and confirmed with pilots. Now we roll out."
Common Pitfalls in Causal Forest Analysis
  • Pitfall 1: Not using honest trees
    Without honest=True, CATE estimates are overfitted (biased). Always use honest splitting for causal inference.
  • Pitfall 2: Confusing CATE with propensity to respond
    High baseline likelihood to order ≠ high treatment effect. Causal Forest separates these, but sanity check by plotting Y(control) vs CATE—should be uncorrelated.
  • Pitfall 3: Not validating out-of-sample
    CATE predictions can overfit. Hold out validation set or use cross-fitting. Always pilot targeting policy in A/B test before full rollout.
  • Pitfall 4: Ignoring standard errors
    CATE estimates have uncertainty. Use inference=True to get standard errors, and don't target customers where CI includes zero.
  • Pitfall 5: Overly aggressive targeting
    Targeting only top 5% maximizes ROI but leaves money on table (small scale). Find optimal threshold that balances scale and efficiency.
  • Pitfall 6: Assuming CATE is stable over time
    Customer preferences change. Retrain causal forest quarterly, and monitor policy performance to detect drift.
  • Pitfall 7: Not considering fairness/equity
    Personalized targeting may systematically exclude certain demographics. Check if CATE correlates with protected attributes, and consider fairness constraints.