1. Introduction: The Causal Question
Every day, businesses and researchers face questions that go beyond prediction:
- Does our marketing campaign cause more sales?
- Would a price change lead to higher revenue?
- Does a new drug improve patient outcomes?
These are causal questions—they ask about the effect of an action or intervention. Unlike predictive modeling (which answers "what will happen?"), causal inference answers "what would happen if?"
🎯 Our Question: Does offering a 20% discount promo code cause customers to make more purchases?
2. Why Correlation ≠ Causation
You've heard this phrase a thousand times, but let's see why it matters with our discount promo example:
Scenario:
• Customers who received 20% off promo: 45% made a purchase
• Customers who didn't receive promo: 20% made a purchase
Should you conclude that the promo caused a 25 percentage point increase in purchases? Not necessarily.
The problem: the marketing team might have selected which customers to send the promo to. Perhaps they targeted customers who:
- Previously made many purchases (loyal customers)
- Recently browsed the website or added items to cart
- Have higher average order values
These customers might have purchased anyway, even without the discount. The observed difference (45% vs 20%) reflects both the causal effect of the promo and pre-existing differences between groups.
⚠️ The Core Problem:
Correlation measures association. Causation requires us to compare what actually happened with what would have happened under a different scenario—for the same individuals.
The Promo Targeting Paradox (Simpson's Paradox)
Let's dig deeper with a concrete numerical example showing how targeting creates misleading correlations:
Scenario: Two Customer Segments
Loyal Customers (80% of promo recipients):
- Purchase rate with promo: 50%
- Purchase rate without promo: 40%
- True causal effect: +10 percentage points
New Customers (20% of promo recipients):
- Purchase rate with promo: 30%
- Purchase rate without promo: 10%
- True causal effect: +20 percentage points
The Paradox:
Overall observed rates: 45% (with promo) vs. 20% (without promo) = 25 point gap
But the true causal effect is only 10-20 points within each group!
The extra 5-15 points come from selection bias: the marketing team sent more promos to loyal customers who already had a 40% baseline purchase rate, versus 10% for new customers.
This is Simpson's Paradox in action: an aggregate correlation can reverse or exaggerate when we account for subgroups. The observed difference conflates the causal effect with compositional differences between groups.
3. The Potential Outcomes Framework
The Potential Outcomes Framework (also called the Rubin Causal Model) provides a formal way to think about causality. Developed by Donald Rubin in the 1970s, it's built on a simple but profound idea:
Core Idea:
Every individual has multiple potential outcomes, depending on which treatment they receive. We observe only one outcome (the one corresponding to the treatment they actually received), but the other outcomes exist conceptually.
Notation:
For each customer i:
- Wi: Treatment indicator (1 = received 20% off promo, 0 = no promo)
- Yi(1): Potential outcome if customer i receives the promo
- Yi(0): Potential outcome if customer i doesn't receive the promo
- Yi: Observed outcome = Yi(Wi)
Example: Customer Alice
• YAlice(1) = 1 (would purchase if she receives 20% off promo)
• YAlice(0) = 0 (would not purchase without the promo)
If Alice actually receives the promo (WAlice = 1), we observe YAlice = 1.
The causal effect of the promo on Alice is: YAlice(1) - YAlice(0) = 1 - 0 = 1
Visual Example: Five Customers
Here's a table showing potential outcomes for 5 customers. The ✓ marks which outcome we actually observe:
| Customer | Y(1) (with promo) | Y(0) (no promo) | Treatment Effect τ | Received Promo? | Observed Outcome |
|---|---|---|---|---|---|
| Alice | 1 ✓ | 0 ? | +1 | Yes | 1 |
| Bob | 1 ? | 1 ✓ | 0 | No | 1 |
| Carol | 1 ✓ | 0 ? | +1 | Yes | 1 |
| David | 0 ? | 0 ✓ | 0 | No | 0 |
| Emma | 1 ? | 0 ✓ | +1 | No | 0 |
Notice: We see the checkmarked outcomes, but the counterfactuals (marked with ?) are forever hidden. Bob would purchase even without a promo (τ=0), while Alice and Carol only purchase with the discount (τ=1). The true ATE = (1+0+1+0+1)/5 = 0.6, but we can't compute this directly from observations!
4. The Fundamental Problem of Causal Inference
Here's the catch: for any individual, we can never observe both potential outcomes. This is called the Fundamental Problem of Causal Inference.
The Fundamental Problem:
We observe Yi(1) or Yi(0), but never both. One potential outcome is always counterfactual—it describes what would have happened in an alternate reality.
If Alice received the 20% off promo, we'll never know what she would have done without it. This missing outcome is the counterfactual.
This is why causal inference is challenging: we need to estimate something that is fundamentally unobservable at the individual level. However, we can estimate average causal effects across groups.
5. Defining Causal Effects
5.1 Average Treatment Effect (ATE)
The Average Treatment Effect is the average difference in outcomes if everyone received the treatment versus if no one received it:
In our example: "On average, what is the effect of offering a 20% off promo to a randomly selected customer?"
5.2 Average Treatment Effect on the Treated (ATT)
The ATT is the average effect for those who actually received the treatment:
In our example: "What is the effect of the 20% off promo among customers who actually received it?"
This matters when treatment effects differ across people, and the treated group isn't randomly selected.
5.3 Conditional Average Treatment Effect (CATE)
The CATE is the average treatment effect for a specific subgroup with characteristics X:
In our example: "What is the effect of the 20% off promo for high-value customers vs. new customers?"
Understanding treatment effect heterogeneity is crucial for personalization and targeting.
5.5 Key Assumptions: SUTVA
Before we can meaningfully define and estimate causal effects, we need a critical assumption called SUTVA (Stable Unit Treatment Value Assumption):
SUTVA has two components:
- No interference between units: One customer receiving a promo doesn't affect another customer's purchase decision. Formally: Yi(d1, d2, ..., dn) = Yi(di) — customer i's outcome depends only on their own treatment.
- No hidden variations of treatment: All "20% off promos" are identical. There's only one version of treatment and one version of control, so we can write Yi(1) and Yi(0) unambiguously.
Why SUTVA Matters for Our Example:
Potential violations in e-commerce:
- Interference: Customers might share promo codes with friends/family, or discuss purchases on social media, affecting others' decisions. Network effects violate the no-interference assumption.
- Treatment variations: If some customers receive the promo via email, others via SMS, and others via app notification, we have multiple "versions" of treatment, violating consistency.
- Market effects: At scale, widespread discounts might change prices or inventory availability, creating spillovers.
For this series, we'll generally assume SUTVA holds, but keep these potential violations in mind when applying methods to real-world problems.
6. Randomized Controlled Trials: The Gold Standard
How do we solve the fundamental problem? The gold standard is a Randomized Controlled Trial (RCT):
🎲 Randomization:
Randomly assign customers to receive the 20% off promo or not. Each customer has the same probability of receiving the promo, regardless of their characteristics.
Why it works:
Randomization ensures that, on average, the treated and control groups are identical in all characteristics—both observed (age, past purchases) and unobserved (motivation, preferences). Any difference in outcomes can be attributed to the promo.
Mathematically, under randomization:
E[Yi(0) | Wi = 0] = E[Yi(0)]
So we can estimate ATE by comparing average outcomes between groups:
Practical RCT Challenges in E-commerce
Running an RCT isn't always straightforward. Real-world challenges include:
- Sample size requirements: To detect a 5 percentage point increase in purchase rate (20% → 25%) with 80% power and 5% significance, you'd need approximately 1,570 customers per group (3,140 total). Smaller effects require even larger samples.
- Spillover effects: What if customers share promo codes on deal-sharing websites? Control group customers might find and use the codes, creating non-compliance. Or social media posts might create hype, violating SUTVA.
- Duration and timing: How long should the experiment run? Seasonality (holidays, payday cycles) might create confounding. Running too short misses delayed effects; too long increases costs.
- Ethical considerations: Is it fair to randomly deny discounts to some customers? What if some customers really need the discount? This is less serious for promos than medical trials, but still matters for customer satisfaction and equity.
- Revenue costs: Giving random discounts means losing margin from customers who would have paid full price. Finance teams may resist purely experimental promos.
7. Selection Bias and Observational Data
In practice, randomization isn't always possible. Maybe:
- The marketing team already sent promos to selected customers (observational data)
- Randomization is too expensive (giving discounts to everyone costs money)
- You need to analyze historical promotional campaigns
When treatment assignment is not random, we face selection bias:
Selection Bias:
The difference in observed outcomes reflects not just the causal effect, but also pre-existing differences between groups:
= ATE + Selection Bias
The entire field of causal inference for observational data is about developing methods to remove or adjust for selection bias. We'll explore these methods in subsequent weeks:
Preview: Identification Strategies for Observational Data
Different methods work in different scenarios. Here's a preview of what's coming:
- Matching & Propensity Scores (Week 3): When you observe all confounders.
Example: Control for customer age, past purchases, browsing history to make treated/control groups comparable. - Regression & Difference-in-Differences (Week 4): When you have panel data (same units over time) or can control for time-invariant confounders.
Example: Compare how purchases changed for customers who received promos vs. those who didn't, before and after promo rollout. - Instrumental Variables (Week 4): When you have a source of quasi-random variation that affects treatment but not outcomes directly.
Example: Use random email delivery delays as an instrument—affects who sees the promo but doesn't directly affect purchases. - Regression Discontinuity (Week 4): When treatment has a sharp threshold.
Example: Promos sent only to customers with cart value above $100—compare customers just above vs. just below threshold. - Double Machine Learning (Week 5): When you have many confounders (high-dimensional X) and need flexible models to control for them.
Example: Control for hundreds of features (browsing patterns, demographics, preferences) using ML while still getting valid causal estimates. - Causal Forests & Meta-Learners (Weeks 6-7): When treatment effects are heterogeneous and you want to personalize—who benefits most from promos?
Example: Discover that price-sensitive customers respond strongly to promos, while brand-loyal customers don't need them.
Each method makes different assumptions and is suited to different data structures. Choosing the right method requires understanding your causal graph (Week 2) and data availability.
8. Key Takeaways
✓ Causal questions ask "what would happen if?" not just "what will happen?"
✓ Correlation ≠ causation because observed differences may reflect selection bias
✓ Potential outcomes: each unit has multiple potential outcomes Yi(1), Yi(0), but we observe only one
✓ Fundamental problem: counterfactuals are unobservable at the individual level
✓ ATE, ATT, CATE: different ways to define average causal effects
✓ Randomization solves the identification problem by making treated/control groups comparable
✓ Selection bias arises in observational data when treatment assignment is not random
9. Next Week Preview
Now that we understand potential outcomes, we need a way to represent and reason about complex causal relationships. Next week, we'll learn about Causal Graphs and DAGs:
- How to draw and interpret causal diagrams
- Confounders, mediators, and colliders
- d-separation and the backdoor criterion
- When can we identify causal effects from observational data?
We'll continue using our promotional discount example to see how DAGs help us think about which variables to control for.