← back to series

Module 1, Week 1: Potential Outcomes Framework

Article 1 of 1315 min read

📊 Running Example: Promotional Discount Campaign

Throughout this entire series, we'll use a consistent example: Does offering a 20% discount promo code increase customer purchases?

Imagine you're a data scientist at an e-commerce company. Your marketing team wants to know if sending customers a promotional discount code (20% off their next purchase) actually causes them to buy more, or if customers who receive promos were already more likely to purchase anyway.

This simple question will help us explore every concept in causal inference—from potential outcomes to advanced machine learning methods.

1. Introduction: The Causal Question

Every day, businesses and researchers face questions that go beyond prediction:

  • Does our marketing campaign cause more sales?
  • Would a price change lead to higher revenue?
  • Does a new drug improve patient outcomes?

These are causal questions—they ask about the effect of an action or intervention. Unlike predictive modeling (which answers "what will happen?"), causal inference answers "what would happen if?"

🎯 Our Question: Does offering a 20% discount promo code cause customers to make more purchases?

2. Why Correlation ≠ Causation

You've heard this phrase a thousand times, but let's see why it matters with our discount promo example:

Scenario:

• Customers who received 20% off promo: 45% made a purchase

• Customers who didn't receive promo: 20% made a purchase

Should you conclude that the promo caused a 25 percentage point increase in purchases? Not necessarily.

The problem: the marketing team might have selected which customers to send the promo to. Perhaps they targeted customers who:

  • Previously made many purchases (loyal customers)
  • Recently browsed the website or added items to cart
  • Have higher average order values

These customers might have purchased anyway, even without the discount. The observed difference (45% vs 20%) reflects both the causal effect of the promo and pre-existing differences between groups.

⚠️ The Core Problem:

Correlation measures association. Causation requires us to compare what actually happened with what would have happened under a different scenario—for the same individuals.

The Promo Targeting Paradox (Simpson's Paradox)

Let's dig deeper with a concrete numerical example showing how targeting creates misleading correlations:

Scenario: Two Customer Segments

Loyal Customers (80% of promo recipients):

  • Purchase rate with promo: 50%
  • Purchase rate without promo: 40%
  • True causal effect: +10 percentage points

New Customers (20% of promo recipients):

  • Purchase rate with promo: 30%
  • Purchase rate without promo: 10%
  • True causal effect: +20 percentage points

The Paradox:

Overall observed rates: 45% (with promo) vs. 20% (without promo) = 25 point gap

But the true causal effect is only 10-20 points within each group!

The extra 5-15 points come from selection bias: the marketing team sent more promos to loyal customers who already had a 40% baseline purchase rate, versus 10% for new customers.

This is Simpson's Paradox in action: an aggregate correlation can reverse or exaggerate when we account for subgroups. The observed difference conflates the causal effect with compositional differences between groups.

3. The Potential Outcomes Framework

The Potential Outcomes Framework (also called the Rubin Causal Model) provides a formal way to think about causality. Developed by Donald Rubin in the 1970s, it's built on a simple but profound idea:

Core Idea:

Every individual has multiple potential outcomes, depending on which treatment they receive. We observe only one outcome (the one corresponding to the treatment they actually received), but the other outcomes exist conceptually.

Notation:

For each customer i:

  • Wi: Treatment indicator (1 = received 20% off promo, 0 = no promo)
  • Yi(1): Potential outcome if customer i receives the promo
  • Yi(0): Potential outcome if customer i doesn't receive the promo
  • Yi: Observed outcome = Yi(Wi)

Example: Customer Alice

• YAlice(1) = 1 (would purchase if she receives 20% off promo)

• YAlice(0) = 0 (would not purchase without the promo)

If Alice actually receives the promo (WAlice = 1), we observe YAlice = 1.

The causal effect of the promo on Alice is: YAlice(1) - YAlice(0) = 1 - 0 = 1

Visual Example: Five Customers

Here's a table showing potential outcomes for 5 customers. The ✓ marks which outcome we actually observe:

CustomerY(1)
(with promo)
Y(0)
(no promo)
Treatment
Effect τ
Received
Promo?
Observed
Outcome
Alice1 ✓0 ?+1Yes1
Bob1 ?1 ✓0No1
Carol1 ✓0 ?+1Yes1
David0 ?0 ✓0No0
Emma1 ?0 ✓+1No0

Notice: We see the checkmarked outcomes, but the counterfactuals (marked with ?) are forever hidden. Bob would purchase even without a promo (τ=0), while Alice and Carol only purchase with the discount (τ=1). The true ATE = (1+0+1+0+1)/5 = 0.6, but we can't compute this directly from observations!

4. The Fundamental Problem of Causal Inference

Here's the catch: for any individual, we can never observe both potential outcomes. This is called the Fundamental Problem of Causal Inference.

The Fundamental Problem:

We observe Yi(1) or Yi(0), but never both. One potential outcome is always counterfactual—it describes what would have happened in an alternate reality.

If Alice received the 20% off promo, we'll never know what she would have done without it. This missing outcome is the counterfactual.

This is why causal inference is challenging: we need to estimate something that is fundamentally unobservable at the individual level. However, we can estimate average causal effects across groups.

5. Defining Causal Effects

5.1 Average Treatment Effect (ATE)

The Average Treatment Effect is the average difference in outcomes if everyone received the treatment versus if no one received it:

ATE = E[Yi(1) - Yi(0)] = E[Yi(1)] - E[Yi(0)]

In our example: "On average, what is the effect of offering a 20% off promo to a randomly selected customer?"

5.2 Average Treatment Effect on the Treated (ATT)

The ATT is the average effect for those who actually received the treatment:

ATT = E[Yi(1) - Yi(0) | Wi = 1]

In our example: "What is the effect of the 20% off promo among customers who actually received it?"

This matters when treatment effects differ across people, and the treated group isn't randomly selected.

5.3 Conditional Average Treatment Effect (CATE)

The CATE is the average treatment effect for a specific subgroup with characteristics X:

CATE(X) = E[Yi(1) - Yi(0) | Xi = x]

In our example: "What is the effect of the 20% off promo for high-value customers vs. new customers?"

Understanding treatment effect heterogeneity is crucial for personalization and targeting.

5.5 Key Assumptions: SUTVA

Before we can meaningfully define and estimate causal effects, we need a critical assumption called SUTVA (Stable Unit Treatment Value Assumption):

SUTVA has two components:

  1. No interference between units: One customer receiving a promo doesn't affect another customer's purchase decision. Formally: Yi(d1, d2, ..., dn) = Yi(di) — customer i's outcome depends only on their own treatment.
  2. No hidden variations of treatment: All "20% off promos" are identical. There's only one version of treatment and one version of control, so we can write Yi(1) and Yi(0) unambiguously.

Why SUTVA Matters for Our Example:

Potential violations in e-commerce:

  • Interference: Customers might share promo codes with friends/family, or discuss purchases on social media, affecting others' decisions. Network effects violate the no-interference assumption.
  • Treatment variations: If some customers receive the promo via email, others via SMS, and others via app notification, we have multiple "versions" of treatment, violating consistency.
  • Market effects: At scale, widespread discounts might change prices or inventory availability, creating spillovers.

For this series, we'll generally assume SUTVA holds, but keep these potential violations in mind when applying methods to real-world problems.

6. Randomized Controlled Trials: The Gold Standard

How do we solve the fundamental problem? The gold standard is a Randomized Controlled Trial (RCT):

🎲 Randomization:

Randomly assign customers to receive the 20% off promo or not. Each customer has the same probability of receiving the promo, regardless of their characteristics.

Why it works:

Randomization ensures that, on average, the treated and control groups are identical in all characteristics—both observed (age, past purchases) and unobserved (motivation, preferences). Any difference in outcomes can be attributed to the promo.

Mathematically, under randomization:

E[Yi(1) | Wi = 1] = E[Yi(1)]
E[Yi(0) | Wi = 0] = E[Yi(0)]

So we can estimate ATE by comparing average outcomes between groups:

ATE = E[Yi | Wi = 1] - E[Yi | Wi = 0]

Practical RCT Challenges in E-commerce

Running an RCT isn't always straightforward. Real-world challenges include:

  • Sample size requirements: To detect a 5 percentage point increase in purchase rate (20% → 25%) with 80% power and 5% significance, you'd need approximately 1,570 customers per group (3,140 total). Smaller effects require even larger samples.
  • Spillover effects: What if customers share promo codes on deal-sharing websites? Control group customers might find and use the codes, creating non-compliance. Or social media posts might create hype, violating SUTVA.
  • Duration and timing: How long should the experiment run? Seasonality (holidays, payday cycles) might create confounding. Running too short misses delayed effects; too long increases costs.
  • Ethical considerations: Is it fair to randomly deny discounts to some customers? What if some customers really need the discount? This is less serious for promos than medical trials, but still matters for customer satisfaction and equity.
  • Revenue costs: Giving random discounts means losing margin from customers who would have paid full price. Finance teams may resist purely experimental promos.

7. Selection Bias and Observational Data

In practice, randomization isn't always possible. Maybe:

  • The marketing team already sent promos to selected customers (observational data)
  • Randomization is too expensive (giving discounts to everyone costs money)
  • You need to analyze historical promotional campaigns

When treatment assignment is not random, we face selection bias:

Selection Bias:

The difference in observed outcomes reflects not just the causal effect, but also pre-existing differences between groups:

E[Yi | Wi = 1] - E[Yi | Wi = 0]
= ATE + Selection Bias

The entire field of causal inference for observational data is about developing methods to remove or adjust for selection bias. We'll explore these methods in subsequent weeks:

Preview: Identification Strategies for Observational Data

Different methods work in different scenarios. Here's a preview of what's coming:

  • Matching & Propensity Scores (Week 3): When you observe all confounders.
    Example: Control for customer age, past purchases, browsing history to make treated/control groups comparable.
  • Regression & Difference-in-Differences (Week 4): When you have panel data (same units over time) or can control for time-invariant confounders.
    Example: Compare how purchases changed for customers who received promos vs. those who didn't, before and after promo rollout.
  • Instrumental Variables (Week 4): When you have a source of quasi-random variation that affects treatment but not outcomes directly.
    Example: Use random email delivery delays as an instrument—affects who sees the promo but doesn't directly affect purchases.
  • Regression Discontinuity (Week 4): When treatment has a sharp threshold.
    Example: Promos sent only to customers with cart value above $100—compare customers just above vs. just below threshold.
  • Double Machine Learning (Week 5): When you have many confounders (high-dimensional X) and need flexible models to control for them.
    Example: Control for hundreds of features (browsing patterns, demographics, preferences) using ML while still getting valid causal estimates.
  • Causal Forests & Meta-Learners (Weeks 6-7): When treatment effects are heterogeneous and you want to personalize—who benefits most from promos?
    Example: Discover that price-sensitive customers respond strongly to promos, while brand-loyal customers don't need them.

Each method makes different assumptions and is suited to different data structures. Choosing the right method requires understanding your causal graph (Week 2) and data availability.

8. Key Takeaways

Causal questions ask "what would happen if?" not just "what will happen?"

Correlation ≠ causation because observed differences may reflect selection bias

Potential outcomes: each unit has multiple potential outcomes Yi(1), Yi(0), but we observe only one

Fundamental problem: counterfactuals are unobservable at the individual level

ATE, ATT, CATE: different ways to define average causal effects

Randomization solves the identification problem by making treated/control groups comparable

Selection bias arises in observational data when treatment assignment is not random

9. Next Week Preview

Now that we understand potential outcomes, we need a way to represent and reason about complex causal relationships. Next week, we'll learn about Causal Graphs and DAGs:

  • How to draw and interpret causal diagrams
  • Confounders, mediators, and colliders
  • d-separation and the backdoor criterion
  • When can we identify causal effects from observational data?

We'll continue using our promotional discount example to see how DAGs help us think about which variables to control for.

Business Case Study: Interview Approach

📊 Case: Subscription Price Change at StreamFlix

Context: You're a data scientist at StreamFlix, a video streaming service. The product team wants to increase the monthly subscription price from $9.99 to $12.99 to improve revenue. However, the CFO is concerned about churn.

The analytics team pulls historical data and finds that customers paying $12.99 (on a legacy premium plan) have a 15% annual churn rate, while customers paying $9.99 have a 25% annual churn rate. The CFO suggests: "The data shows higher prices lead to lower churn—let's raise prices!"

Question: How would you respond to the CFO's interpretation? What analysis would you propose to estimate the causal effect of a price increase on churn?

Step 1: Clarifying Questions to Ask

Before proposing a solution, ask these questions:

  • Data availability: Do we have historical price changes we can study? Can we access customer characteristics (viewing hours, tenure, payment method)?
  • Selection mechanism: How did customers end up on the $12.99 vs $9.99 plan? Was it self-selected? Based on features? Geographic differences?
  • Timeline: How quickly do we need an answer? Can we run an experiment, or must we use historical data?
  • Business constraints: Is the company willing to randomize prices? What's the acceptable risk of revenue loss?
  • Population of interest: Are we estimating the effect on all current customers (ATE) or specifically on $9.99 customers who would receive the increase (ATT)?
  • Outcome definition: How do we define churn? Cancellation vs. downgrade? Time window?
Step 2: Diagnose the Problem (Potential Outcomes Lens)

Key insight: The CFO is confusing correlation with causation.

Potential Outcomes Framework:

For each customer i, define:

  • Yi(0) = churn probability if price stays at $9.99
  • Yi(1) = churn probability if price increases to $12.99
  • τi = Yi(1) - Yi(0) = individual causal effect

Why the observed difference is misleading:

The 10-point gap (25% - 15% = 10pp) reflects:

Observed difference = Causal effect + Selection bias

Selection bias sources:

  • $12.99 customers may be early adopters (more loyal, higher engagement)
  • They might have higher incomes (less price-sensitive)
  • They may value premium features more (higher perceived value)
  • Could be in different geographic markets with different competition

These customers likely would have lower churn even at $9.99 (i.e., E[Y(0) | Price=$12.99] ≠ E[Y(0) | Price=$9.99]).

Step 3: Proposed Methods & Tradeoffs

Approach 1: Randomized Controlled Trial (RCT) ⭐ Gold Standard

Method: Randomly assign a subset of $9.99 customers to receive the $12.99 price; keep others at $9.99 as control. Measure churn over 3-6 months.

Pros:

  • Eliminates selection bias—randomization ensures treated/control groups are comparable
  • Clean causal interpretation: observed difference = ATE
  • No need to worry about unobserved confounders

Cons:

  • Revenue risk: Could lose money if price increase backfires during experiment
  • Customer experience: Some customers pay different prices (fairness concerns, potential PR issues)
  • Time: Need 3-6 months to observe churn patterns
  • Sample size: Need sufficient power (likely ~10K customers to detect 5pp churn difference)

Approach 2: Propensity Score Matching (Week 3 preview)

Method: If we have observational data on past price changes, match $12.99 customers with similar $9.99 customers based on characteristics (tenure, viewing hours, demographics), then compare churn.

Pros:

  • Uses existing data—no need to run experiment
  • Faster results than RCT
  • No revenue risk from experimenting

Cons:

  • Requires assumption: All confounders are observed (unconfoundedness)
  • May still have bias from unmeasured factors (e.g., customer motivation)
  • Requires historical price variation to exist
  • More complex to explain to stakeholders

→ We'll cover matching in Week 3

Approach 3: Regression Discontinuity (Week 4 preview)

Method: If price changes occurred based on a threshold (e.g., all customers who signed up after Jan 1 get $12.99), compare churn for customers just before vs. just after the cutoff.

Pros:

  • Local randomization around cutoff provides causal identification
  • No need to measure/control for confounders
  • Intuitive—customers near threshold are similar

Cons:

  • Only works if a sharp threshold exists
  • Estimates local effect (at threshold), may not generalize to all customers
  • Requires sufficient data near cutoff

→ We'll cover RD in Week 4

Approach 4: Difference-in-Differences (Week 4 preview)

Method: If price increased in one region but not another, compare how churn changed in the treatment region vs. control region before/after the price change.

Pros:

  • Controls for time-invariant differences between regions
  • Uses observational data from natural policy variation
  • Addresses some omitted variable bias

Cons:

  • Requires parallel trends assumption (regions would've evolved similarly without treatment)
  • Need pre-treatment data to test/validate assumption
  • Vulnerable to time-varying confounders

→ We'll cover DiD in Week 4

Step 4: Recommendation & Implementation Plan

Recommended approach: Staged RCT with risk mitigation

Phase 1: Small-scale pilot (2 weeks)

  • Randomize 5% of customers (treatment: 2.5%, control: 2.5%)
  • Monitor early indicators: cancellation attempts, customer service contacts
  • If catastrophic churn (>40%), abort quickly

Phase 2: Full experiment (3 months)

  • Expand to 20% of customer base (10% treatment, 10% control)
  • Track primary outcome: churn rate at 30/60/90 days
  • Track secondary outcomes: viewing hours, customer satisfaction, lifetime value

Analysis plan:

  • Estimate ATE: Compare churn rates between treatment and control
  • Estimate CATE: Examine heterogeneity by customer segments (tenure, engagement level, region)
  • Revenue model: Net impact = (Revenue gain from higher price) - (Revenue loss from incremental churn)
  • Confidence intervals: Report 95% CI to quantify uncertainty

Decision framework:

  • If causal effect < 8pp churn increase: Price increase is profitable (rough breakeven threshold)
  • If 8-12pp: Proceed cautiously; target low-churn segments only
  • If >12pp: Don't implement; explore alternative revenue strategies

Communication to CFO: "The 10-point churn difference we observe likely overstates the true causal effect because $12.99 customers differ systematically. We need an RCT to isolate the true price elasticity. I recommend a staged rollout to measure the effect while limiting downside risk."

Step 5: Common Pitfalls to Avoid
  • Pitfall 1: Naive regression without controls
    Simply regressing churn on price without controlling for confounders will yield biased estimates due to omitted variable bias.
  • Pitfall 2: Cherry-picking comparable customers
    Manually selecting "similar" customers without a principled matching method introduces researcher bias.
  • Pitfall 3: Ignoring SUTVA violations
    If customers talk to each other about prices (social media, family plans), interference violates SUTVA and complicates causal inference.
  • Pitfall 4: Confusing ATE with ATT
    If we only care about the effect on current $9.99 customers (who'd receive the increase), we want ATT, not ATE on all customers.
  • Pitfall 5: Short-term focus
    Churn effects may be delayed (customers churn when cards are re-charged). Need sufficient follow-up time.
  • Pitfall 6: Ignoring treatment effect heterogeneity
    The ATE may mask important variation—some segments might tolerate price increases while others are very price-sensitive. CATE analysis is crucial for targeted strategies.

Advanced Quiz: Test Your Understanding

Challenge yourself with these advanced questions covering the Potential Outcomes Framework. Click on each question to reveal the answer.

Part 1: Conceptual Foundations (40 points)

Question 1 (8 points): Office Hours and Exam Scores

A researcher observes that students who attend office hours score 15 points higher on exams on average than those who don't.

a) Why can't we conclude that attending office hours causes higher exam scores?
b) Describe two specific confounding variables that could explain this correlation, and explain the mechanism through which each operates.

Show Answer

a) We can't conclude causation because students who attend office hours likely differ systematically from those who don't—this is a classic case of selection bias. The 15-point difference could reflect pre-existing differences between groups rather than the causal effect of office hours.

b) Two confounding variables:

  • Student motivation: More motivated students are both more likely to attend office hours AND more likely to study harder independently, leading to higher scores. The mechanism: motivation → office hours attendance; motivation → increased studying → higher scores.
  • Prior knowledge/academic preparation: Students with stronger foundational knowledge are more likely to recognize when they need help (and seek office hours) and also perform better on exams due to their existing knowledge base. The mechanism: prior knowledge → recognizing gaps → office hours; prior knowledge → better exam performance.
Question 2 (10 points): Rubin Causal Model Fundamentals

In the Rubin Causal Model, define the individual treatment effect for unit i using potential outcomes notation. Then explain why we call the impossibility of observing this quantity the "fundamental problem of causal inference." Finally, how does the Average Treatment Effect (ATE) relate to individual treatment effects, and why is estimating the ATE potentially easier despite the fundamental problem?

Show Answer

Individual Treatment Effect (ITE): τᵢ = Yᵢ(1) - Yᵢ(0)

Fundamental Problem: For any individual unit i, we observe either Yᵢ(1) OR Yᵢ(0), but never both simultaneously. Once treatment is assigned, one potential outcome becomes factual (observed) and the other remains counterfactual (unobserved). We cannot observe the ITE for any specific individual because it requires knowledge of both potential outcomes.

Relationship to ATE: ATE = E[Yᵢ(1) - Yᵢ(0)] = E[τᵢ] = the average of all individual treatment effects across the population.

Why ATE is easier: While we can't observe individual treatment effects, we can estimate the average by comparing groups. Under randomization, E[Yᵢ | Dᵢ=1] estimates E[Yᵢ(1)] and E[Yᵢ | Dᵢ=0] estimates E[Yᵢ(0)], so we can identify ATE = E[Yᵢ | Dᵢ=1] - E[Yᵢ | Dᵢ=0] even though we never observe both potential outcomes for the same individual.

Question 3 (12 points): Job Training Program Analysis

Consider a job training program with the following potential outcomes:

IndividualY(1)Y(0)
Person A$45K$35K
Person B$50K$48K
Person C$38K$40K

a) Calculate the individual treatment effect for each person.
b) Calculate the ATE.
c) Person C has a negative individual treatment effect. Does this violate any assumptions of the potential outcomes framework? Why or why not?
d) If we could only observe Person A in treatment and Persons B and C in control, what would our naive estimate of the ATE be? Why does this differ from the true ATE?

Show Answer

a) Individual Treatment Effects:

  • Person A: τₐ = $45K - $35K = $10K
  • Person B: τᵦ = $50K - $48K = $2K
  • Person C: τᴄ = $38K - $40K = -$2K

b) ATE: ATE = (10 + 2 + (-2))/3 = $10K/3 ≈ $3,333

c) No, this does NOT violate any assumptions. The potential outcomes framework makes no assumption that treatment effects must be positive or homogeneous. Treatment effect heterogeneity (including negative effects for some individuals) is perfectly compatible with the framework. In reality, many interventions help some people while harming others.

d) Naive estimate: We'd compare Person A's observed outcome ($45K) to the average of B and C's observed outcomes (($48K + $40K)/2 = $44K), giving us $45K - $44K = $1K.

This differs from the true ATE ($3,333) because of selection bias. Person A has better potential outcomes than the average (both Y(1) and Y(0) are relatively good), while B and C are a mix. The naive comparison confounds the treatment effect with baseline differences between Person A and the control group. Specifically: E[Y(0)|D=1] = $35K ≠ E[Y(0)|D=0] = $44K.

Question 4 (10 points): SUTVA and Violations

Explain the Stable Unit Treatment Value Assumption (SUTVA) and its two components. Then provide a concrete example from educational interventions where SUTVA would be violated, explaining specifically which component is violated and why this matters for causal inference.

Show Answer

SUTVA has two components:

  1. No interference: One unit's treatment doesn't affect another unit's outcome. Formally: Yᵢ(d₁, d₂, ..., dₙ) = Yᵢ(dᵢ) — unit i's outcome depends only on their own treatment, not others' treatments.
  2. No hidden variations of treatment: There's only one version of each treatment level. For any unit, there's only one Y(1) and one Y(0), not multiple versions of the treated/control state.

Educational example violating SUTVA: Consider randomly assigning students in the same classroom to receive tutoring (treatment) or not (control).

Violation: This violates the "no interference" component through peer effects/spillovers. Students who receive tutoring might help their untreated classmates (positive spillover), or treated students might consume more teacher time, leaving less for control students (negative spillover).

Why it matters: If student i's outcome depends on both their own treatment AND how many classmates received tutoring, then we can't write Yᵢ as simply Yᵢ(1) or Yᵢ(0). The potential outcomes are not well-defined. Our estimate of the treatment effect will confound the direct effect of tutoring with the spillover effects, making causal interpretation unclear. We'd be comparing treated students (who benefit from tutoring) to control students (who may be helped OR harmed by being in a partially-treated classroom), rather than a clean counterfactual comparison.

Part 2: Treatment Effects & Heterogeneity (30 points)

Question 5 (12 points): Treatment Effect Heterogeneity

A pharmaceutical company tests a new drug and finds:

  • ATE = 5 mmHg reduction in blood pressure
  • CATE for patients over 65 = 8 mmHg reduction
  • CATE for patients under 65 = 3 mmHg reduction

a) Is it mathematically possible for 80% of individual patients to experience no benefit from the drug while still observing these average effects? Explain your reasoning with a simple numerical example.
b) Why might policymakers care more about CATE than ATE in this scenario?
c) What key assumption must hold for us to interpret these CATEs causally rather than merely as correlational subgroup differences?

Show Answer

a) Yes, absolutely possible. Average effects can mask extreme heterogeneity.

Simple example: Suppose we have 100 patients total.

  • 80 patients: treatment effect = 0 mmHg (no benefit)
  • 20 patients: treatment effect = -25 mmHg (large benefit)

ATE = (80×0 + 20×(-25))/100 = -500/100 = -5 mmHg reduction ✓

This demonstrates that averages conceal individual-level variation. The drug could be completely ineffective for most people but highly effective for a responsive subgroup, still yielding the observed ATE.

b) Policymakers care about CATE because:

  • Resource allocation: If over-65 patients benefit more (8 vs 3 mmHg), we might prioritize coverage/subsidies for elderly patients.
  • Cost-effectiveness: The drug may be cost-effective for older patients but not younger ones.
  • Personalized medicine: Understanding who benefits allows targeted treatment rather than one-size-fits-all.
  • Equity concerns: If the drug works better for one demographic group, we need to understand whether to target that group or invest in alternatives for others.

c) Key assumption: Conditional independence (or ignorability/unconfoundedness)

Formally: (Y(1), Y(0)) ⊥ D | X, where X is the subgroup variable (age).

This means that within each age group, treatment assignment is "as-if random"—there are no unmeasured confounders that affect both treatment assignment and outcomes within that age group. If older patients who receive the drug differ systematically from older patients who don't (beyond just age), the CATE for over-65 may be biased and shouldn't be interpreted causally.

Question 6 (10 points): ATT vs ATE

Define the Average Treatment Effect on the Treated (ATT) formally using potential outcomes notation. Then explain why ATT ≠ ATE when there is selection bias, and provide a concrete scenario where we would specifically care about estimating ATT rather than ATE.

Show Answer

Formal definition:

ATT = E[Y(1) - Y(0) | D=1] = E[Y(1) | D=1] - E[Y(0) | D=1]

The average treatment effect for those who actually received treatment.

Why ATT ≠ ATE under selection bias:

ATE = E[Y(1) - Y(0)] averages over the entire population, while ATT averages only over the treated subpopulation. When there's selection bias, people who select into treatment differ from the overall population in their potential outcomes.

Mathematically: If E[Y(0) | D=1] ≠ E[Y(0)] or E[Y(1) | D=1] ≠ E[Y(1)], then ATT ≠ ATE.

Intuitively: Those who receive treatment may be systematically different (more motivated, sicker, wealthier, etc.), and if treatment effects vary with these characteristics, the effect on the treated differs from the effect on a random person.

Scenario where ATT matters more than ATE:

Example: Job training program evaluation

A voluntary job training program was offered to unemployed workers, and 30% enrolled. We want to know if we should continue funding the program.

Why ATT is the right estimand:

  • The policy-relevant question is: "Did the program help the people who actually participated?" not "Would it help a randomly selected person?"
  • Since participation is voluntary, we can't force everyone to participate. The ATE (effect on the full population including those who'd never enroll) is not actionable.
  • Those who enrolled likely differ from non-enrollees (perhaps more motivated, or more desperate). The program's effect on enrollees is what matters for judging its value.
  • For cost-benefit analysis: We spent money on participants, so we need to know the causal effect on them specifically.
Question 7 (8 points): RCTs and Causation

Critics sometimes argue that even randomized experiments don't establish causation because "correlation doesn't imply causation." Write a careful response to this criticism that explains: (a) what problem randomization solves, and (b) under what conditions the correlation observed in an RCT can be interpreted causally.

Show Answer

Response to criticism:

The maxim "correlation doesn't imply causation" is correct, but it applies to observational correlations where confounding can occur. RCTs are specifically designed to address this problem.

a) What problem randomization solves:

Randomization solves the confounding problem (selection bias). In observational data, the correlation between treatment D and outcome Y may arise because:

  • D causes Y (causal effect) — what we want to measure
  • Some confounder X causes both D and Y (confounding)
  • Y causes D (reverse causation)

Randomization breaks the link between confounders and treatment assignment. By randomly assigning D, we ensure that:

  • Treatment groups are balanced on all covariates (observed and unobserved) in expectation
  • E[Y(0) | D=1] = E[Y(0) | D=0] — no selection bias
  • Any systematic difference in outcomes must be due to the treatment itself, not pre-existing differences

b) Conditions for causal interpretation:

The correlation observed in an RCT can be interpreted causally when:

  1. Randomization is properly implemented: Treatment assignment is truly random and independent of all potential confounders.
  2. SUTVA holds: No interference between units, and treatment is consistently defined.
  3. No post-randomization confounding: We measure outcomes before they can be affected by other factors that might differ between groups.
  4. Compliance: Units actually receive their assigned treatment (or we use appropriate methods like instrumental variables for non-compliance).
  5. No attrition bias: Outcome data is available for both groups, or missingness is unrelated to potential outcomes.

Under these conditions, the correlation is not spurious—it reflects the causal effect because randomization has eliminated all alternative explanations.

Part 3: RCTs and Their Limitations (30 points)

Question 8 (12 points): Perfect RCT with Infinite Sample

In a perfectly executed RCT with infinite sample size:

a) What is E[Y₀(0) | D=1]? Explain what this notation means and why randomization determines its value.
b) Why does this property allow us to identify the ATE, even though we never observe the same unit in both treatment and control?
c) Pearl's framework uses do-calculus while Rubin uses potential outcomes. Briefly explain what quantity Pearl's P(Y|do(X=1)) corresponds to in Rubin's framework.

Show Answer

a) E[Y(0) | D=1] in a perfect RCT:

What it means: This is the expected value of the control potential outcome for units assigned to treatment. It represents "what would have happened to the treated units if they hadn't been treated" (the counterfactual for the treated group).

Value under randomization: E[Y(0) | D=1] = E[Y(0)] = E[Y(0) | D=0]

Why randomization determines this: Because treatment assignment D is independent of potential outcomes (Y(0), Y(1)) ⊥ D. This means the distribution of potential outcomes is identical across treatment and control groups. Units assigned to treatment have the same expected Y(0) as units assigned to control, because assignment was random and unrelated to any unit characteristics. With infinite sample size, any random imbalances vanish, and this equality holds exactly.

b) Why this identifies ATE:

We want ATE = E[Y(1)] - E[Y(0)], but we only observe:

  • E[Y | D=1] for treated units (where Y = Y(1) for them)
  • E[Y | D=0] for control units (where Y = Y(0) for them)

Under randomization:

  • E[Y(1) | D=1] = E[Y(1)] (treated units represent the population's Y(1))
  • E[Y(0) | D=0] = E[Y(0)] (control units represent the population's Y(0))

Therefore: ATE = E[Y(1)] - E[Y(0)] = E[Y | D=1] - E[Y | D=0]

Even though we never observe the same individual in both states, randomization ensures that the treated group is a valid stand-in for what everyone's Y(1) would be, and the control group is a valid stand-in for what everyone's Y(0) would be. We identify the population-level causal effect by comparing across randomized groups.

c) Pearl's P(Y|do(X=1)) in Rubin's framework:

Pearl's P(Y|do(X=1)) represents the distribution of Y if we intervened to set X=1 for everyone, breaking any causal arrows into X.

In Rubin's framework, this corresponds to the distribution of Y(1), the potential outcome under treatment.

Specifically:

  • P(Y|do(X=1)) ↔ distribution of Y(1)
  • E[Y|do(X=1)] ↔ E[Y(1)]
  • E[Y|do(X=1)] - E[Y|do(X=0)] ↔ E[Y(1)] - E[Y(0)] = ATE

Both frameworks describe the same causal concepts using different mathematical machinery—Pearl uses graphical models and interventions (do-operator), while Rubin uses potential outcomes and counterfactuals.

Question 9 (10 points): External Validity Threats

An RCT finds that a microfinance program increases business profits by $500 annually (p < 0.01, n=5000). List and briefly explain three distinct threats to external validity that might prevent us from generalizing this result to: other geographical contexts, other time periods, and policy implementation at scale.

Show Answer

Three threats to external validity:

1. Population/Sample Heterogeneity (affects geographical generalization):

The study population may differ from target populations in other contexts. For example:

  • The RCT might be conducted in rural India, where business opportunities, credit markets, and cultural factors differ from urban Brazil or rural Kenya.
  • Treatment effects may vary with baseline characteristics (education levels, existing business experience, market saturation, local infrastructure).
  • CATE heterogeneity means the $500 effect observed in one context might not apply to populations with different characteristics.

Why it matters: If the treatment effect is heterogeneous and the study sample differs systematically from other contexts, the ATE estimated in one place won't generalize.

2. Time-Varying Contextual Factors (affects temporal generalization):

Economic, institutional, and technological conditions change over time:

  • The RCT conducted in 2010 might not reflect 2025 conditions (e.g., smartphone adoption has increased, enabling mobile banking; COVID-19 changed business dynamics).
  • Market conditions, competition, regulatory environment, and macroeconomic factors (interest rates, inflation) evolve.
  • The novelty effect: early microfinance interventions might work better than later ones as markets become saturated.

Why it matters: Causal effects aren't immutable constants—they depend on context. An intervention effective in one era may not work in another.

3. General Equilibrium Effects and Scale-Up Issues (affects policy implementation):

Small-scale RCT effects may not reflect large-scale policy implementation:

  • Market equilibrium effects: If only 5,000 businesses get microloans, they may thrive. But if the program scales to millions, increased competition might erode profit margins, reducing the per-business effect.
  • Supply constraints: The small RCT might not strain local resources (skilled labor, raw materials), but scaling up could create bottlenecks.
  • Implementation quality: Small RCTs are often implemented with unusual care and monitoring. Large-scale government programs may suffer from lower fidelity, corruption, or bureaucratic inefficiency.
  • Behavioral responses: At scale, competitors, suppliers, or other market actors might change behavior in response to the widespread program.

Why it matters: Partial equilibrium effects (holding everything else constant) estimated in an RCT may not reflect general equilibrium effects when the intervention is deployed broadly and affects market dynamics.

Question 10 (8 points): Natural Experiments

A researcher wants to study whether meditation reduces stress but cannot randomize who meditates. They propose a "natural experiment" where a company randomly assigns employees to offices, and one office happens to have a meditation room.

a) Why is this weaker than a true RCT for establishing causation?
b) What specific assumption would need to hold for this design to yield causal estimates, and how could it be violated?

Show Answer

a) Why it's weaker than a true RCT:

In a true RCT, we would randomize meditation itself (the treatment). Here, we're randomizing office assignment (an instrumental variable), which only affects meditation access, not actual meditation behavior.

Key differences:

  1. Non-compliance: Not everyone with access will meditate (one-sided non-compliance). The "treatment" (meditation) is not randomly assigned—only the opportunity is random.
  2. Self-selection into actual treatment: Among those with access, who meditates is chosen by employees themselves, likely based on stress levels, personality, or other factors correlated with outcomes.
  3. Reduced power: Since office assignment is only probabilistically related to meditation, the effect is diluted (intent-to-treat effect is weaker than actual treatment effect).
  4. Exclusion restriction concerns: Office assignment might affect stress through pathways other than meditation (office quality, natural light, distance from management, peer composition).

b) Key assumption and potential violations:

The critical assumption: Exclusion Restriction

Office assignment (Z) must affect stress (Y) only through meditation (D), not directly.

Formally: Y(d, z) = Y(d, z') for all z, z' — outcomes depend only on meditation, not which office you're in (once we condition on meditation behavior).

How this could be violated:

  • Office quality differences: The office with the meditation room might be newer, quieter, have better lighting, or be farther from a noisy area—all directly affecting stress independent of meditation.
  • Peer effects: Being assigned to an office with meditation-oriented colleagues might reduce stress through social support or workplace culture, not meditation itself.
  • Signaling effects: Employees in the "meditation office" might perceive their employer as more caring, reducing stress even if they never meditate.
  • Spatial/proximity effects: Different office locations might mean different commute times, access to amenities (cafeteria, parking), or exposure to workplace stressors (proximity to demanding managers).

If any of these mechanisms operate, then comparing stress levels between offices confounds the meditation effect with these direct effects of office assignment, making causal inference invalid. We'd need to argue convincingly that offices are otherwise identical, which is difficult to ensure in a natural experiment.

Bonus Challenge Question (10 points)

Bonus: Non-Compliance in RCTs

Suppose we have non-compliance in an RCT: some people assigned to treatment don't take it, and some assigned to control do. Let Z denote random assignment and D denote actual treatment received.

a) Why can't we simply compare outcomes between those with D=1 vs. D=0 to get the ATE?
b) What causal estimand can we identify from this design? (Name it and define it informally)
c) What assumption about how assignment affects outcomes is critical for this identification?

Show Answer

a) Why comparing D=1 vs. D=0 doesn't work:

Even though Z (assignment) was randomized, D (actual treatment) is not randomized. Whether someone actually takes treatment depends on their choice/behavior, which may be correlated with potential outcomes.

Selection bias problem: People who take treatment (D=1) differ systematically from those who don't (D=0), even within the same assignment group. For example:

  • Those assigned to treatment who actually take it (compliers) might be more motivated or health-conscious.
  • Those assigned to control who seek out treatment elsewhere (crossovers/always-takers) might be more desperate or resourceful.

Comparing by actual treatment receipt confounds the treatment effect with these selection effects: E[Y | D=1] - E[Y | D=0] ≠ ATE.

This is precisely the observational correlation problem that randomization was meant to solve—but non-compliance re-introduces it.

b) What we CAN identify: Local Average Treatment Effect (LATE) or Complier Average Causal Effect (CACE)

Definition: LATE is the average treatment effect for compliers—individuals whose treatment status is affected by assignment.

More precisely: units who take treatment if assigned to treatment (Z=1) and don't take it if assigned to control (Z=0).

Formally: LATE = E[Y(1) - Y(0) | D(1)=1, D(0)=0]

where D(z) is the potential treatment received under assignment z.

Intuition: We can estimate the effect for the subpopulation whose behavior is actually changed by randomization, not the entire population. This is weaker than ATE but still causally interpretable.

c) Critical assumption: Exclusion Restriction

Assumption: Assignment (Z) affects outcomes (Y) only through actual treatment received (D), not directly.

Formally: Y(d, z) = Y(d, z') for all z, z' — knowing your assignment doesn't affect your outcome once we condition on actual treatment.

Why it's critical:

  • Without this, we can't separate the effect of treatment from the effect of being assigned to treatment.
  • For example, if people assigned to treatment feel psychologically different (placebo/nocebo effects) even if they don't take it, the exclusion restriction is violated.
  • We use assignment Z as an instrumental variable for actual treatment D. The exclusion restriction ensures Z is a valid instrument—it provides variation in D without directly affecting Y.

Additional assumption (for completeness): Monotonicity

No "defiers" — no one does the opposite of their assignment (if assigned to treatment, they avoid it; if assigned to control, they seek it out). This ensures the effect of Z on D goes in one direction.

With exclusion restriction and monotonicity, we can use the Wald estimator: LATE = [E(Y|Z=1) - E(Y|Z=0)] / [E(D|Z=1) - E(D|Z=0)]