1. Motivation: Which Variables Should We Control For?
Last week, we learned that observational data suffers from selection bias. In our promo example, customers who receive discounts differ from those who don't. The solution seems simple: control for those differences!
The Naive Approach:
"Let's just control for everything! We have data on customer age, gender, location, past purchases, browsing history, device type, time of day, cart value... let's throw it all into a regression."
β οΈ The Problem:
Controlling for the wrong variables can make bias worse, not better. Some variables we should control for, some we shouldn't, and some we absolutely must not.
Three Questions That Arise Naturally:
Question 1: Should we control for past purchase behavior?
Loyal customers get more promos AND purchase more. Controlling for loyalty seems right... or does it create problems?
Question 2: Should we control for email open rate?
Only customers who open promo emails see the discount. Is email open rate a confounder, a mediator, or something else?
Question 3: Should we control for cart abandonment?
We often send promos to customers who abandoned carts. Both promo receipt and purchases depend on cart abandonment. Should we control for it?
To answer these questions rigorously, we need a formal way to represent causal relationships. Enter: Directed Acyclic Graphs (DAGs).
2. Introduction to Causal Graphs (DAGs)
A Directed Acyclic Graph (DAG) is a visual representation of causal relationships among variables.
DAG Components:
- Nodes (circles): Represent variables
- Directed edges (arrows): Represent direct causal effects (X β Y means "X causes Y")
- Acyclic: No variable can cause itself through a chain of effects (no cycles/loops)
β οΈ Common Misconceptions About DAGs
β Myth: "Arrows in DAGs represent correlations"
β Reality: Arrows represent causal effects, not mere associations. Correlation β causation β arrows in DAG
β Myth: "Control for everything to be safe"
β Reality: Controlling for colliders and mediators makes bias worse or blocks causal pathways
β Myth: "Conditioning on variables in regression equals experimental control"
β Reality: Statistical control can remove confounding only if you control for the right variables and measure them perfectly
β Myth: "If X and Y are associated, draw X β Y or Y β X"
β Reality: They could be confounded (both caused by Z) with no direct causal link
Simple Example: Smoking β Lung Cancer
Arrow indicates that smoking directly causes increased lung cancer risk.
Adding a Confounder:
Genetics
β β
Smoking β Lung Cancer
Genetic predisposition affects both smoking behavior AND lung cancer risk independently. This is a confounder.
Key DAG Terminology:
Parent:
A variable with an arrow pointing to another. (Genetics is a parent of Smoking)
Child:
A variable that has an arrow pointing to it. (Smoking is a child of Genetics)
Path:
A sequence of edges connecting variables (direction doesn't matter for paths)
Directed Path:
A path where all arrows point in the same direction (following arrow heads)
3. Building a DAG for Our Promo Example
Let's construct a DAG for our e-commerce promo scenario step by step, thinking carefully about what causes what.
Step 1: Core Relationship
This is the causal effect we want to estimate.
Step 2: Add Customer Loyalty (Confounder)
Customer Loyalty
β β
Promo β Purchase
Loyal customers are more likely to be targeted for promos AND more likely to purchase anyway. This creates a backdoor path: Promo β Loyalty β Purchase
Step 3: Add Email Engagement (Mediator)
Customer Loyalty
β β
Promo β Email Open β Purchase
The promo affects purchase through email opens. Email engagement is a mediatorβ it's part of the causal mechanism.
Step 4: Complete DAG with Cart Abandonment (Collider)
Customer Loyalty
β β β
Promo β Email Open β Purchase
β β β
Cart Abandon
Cart abandonment is caused by both promo receipt (we trigger promos for cart abandoners) AND by purchase intent (people who don't purchase abandon carts). This is a collider.
This DAG encodes our assumptions about the causal structure. Now let's learn how to use it to determine which variables to control for.
4. Confounders: The Variables We Must Control
Definition: Confounder
A variable that causally affects both treatment and outcome, creating a spurious association (a backdoor path) between them.
Structure: Confounder β Treatment, Confounder β Outcome
Customer Loyalty as a Confounder
Why it's a confounder:
- Loyalty β Promo: Marketing targets loyal customers for retention promos
- Loyalty β Purchase: Loyal customers purchase more frequently, independent of promos
The problem:
If we don't control for loyalty, we'll attribute purchases to the promo when they're actually driven by pre-existing loyalty. The association is confounded.
The solution:
Control for loyalty (e.g., number of past purchases, account age) to "block" the backdoor path and isolate the causal effect of the promo.
Answer to Question 1:
β Yes, control for customer loyalty
It's a confounder. Failing to control for it leads to omitted variable biasβwe'd overestimate the promo effect because loyal customers purchase more anyway.
5. Mediators: The Mechanism of Causation
Definition: Mediator
A variable that lies on the causal path between treatment and outcome. The treatment affects the mediator, which in turn affects the outcome.
Structure: Treatment β Mediator β Outcome
Email Open as a Mediator
Why it's a mediator:
- Promo β Email Open: You can't respond to a promo without opening the email
- Email Open β Purchase: Seeing the discount triggers a purchase decision
The problem with controlling for mediators:
If we control for email opens, we're blocking part of the causal effect we want to measure! The promo works through email engagementβthat's the mechanism.
Controlling for a mediator gives you the direct effect (promo β purchase not through email), but destroys the indirect effect (promo β email β purchase). Usually, we want thetotal effect (direct + indirect).
When Might We Control for Mediators?
Sometimes we specifically want to decompose effects:
- Mediation analysis: How much of the promo effect is mediated by email opens vs. other channels?
- Mechanism testing: Does the promo work only through email awareness, or are there other pathways?
But for estimating the total causal effect of promos on purchases, do NOT control for mediators.
Answer to Question 2:
β No, do NOT control for email opens (for total effect)
Email open is a mediator. Controlling for it blocks the causal pathway and underestimates the promo's total effect. Only control for it if you're specifically interested in direct effects.
6. Colliders: The Variables We Must NOT Control
Definition: Collider
A variable that is causally affected by two or more other variables. Arrows collide into it.
Structure: A β Collider β B
Cart Abandonment as a Collider
Promo β Cart Abandon β Purchase Intent
Why it's a collider:
- Promo β Cart Abandon: We send promos to people who abandoned carts
- Purchase Intent β Cart Abandon: People with low intent abandon carts
β οΈ Collider Bias (Selection Bias)
Controlling for a collider creates a spurious association between its causes, even if they're independent. This is called collider bias or selection bias.
Example: How Collider Bias Works
Scenario:
Imagine promo receipt and purchase intent are initially independent (no causal relationship). But both affect cart abandonment.
What happens if we condition on cart abandonment?
Among people who abandoned carts:
- If someone received a promo, they're more likely to have higher purchase intent (because low-intent people abandon carts even without promos, but high-intent people need the promo trigger to abandon)
- This creates a spurious positive correlation between promo and purchase within the conditioned group
Conditioning on a collider "opens" a path between its parents that was previously blocked, inducing bias.
Real-World Example: Berkson's Paradox
Hospital admission example: Talent and Beauty are independent. But both increase probability of being admitted to a prestigious hospital. Among hospital patients, talent and beauty appear negatively correlated (because if you're there despite low talent, you must be very beautiful, and vice versa).
This is collider bias: conditioning on hospital admission creates a spurious correlation.
Answer to Question 3:
β No, do NOT control for cart abandonment
It's a collider. Controlling for it induces bias by creating a spurious association between promo receipt and purchase intent. Leave it uncontrolled to avoid collider bias.
Summary: When to Control for Variables
| Structure | Example | Should Control? | Why? |
|---|---|---|---|
| Confounder | X β C β Y | β Yes | Blocks backdoor path; removes spurious association |
| Mediator | X β M β Y | β No (for total effect) | Would block causal pathway; only control for direct effects |
| Collider | X β L β Y | β No | Would open blocked path; induces spurious association |
7. d-Separation: Reading Independence from Graphs
Now that we understand confounders, mediators, and colliders, we need a formal rule to determine when two variables are independent given a set of conditioning variables. This is called d-separation(directional separation).
Definition: d-Separation
Two variables X and Y are d-separated by a set Z if all paths between X and Y are "blocked" by Z. If X and Y are d-separated by Z, then X β₯ Y | Z (X and Y are independent conditional on Z).
Three Path Types and Blocking Rules:
1. Chain: X β M β Y
- Path is open: When M is not conditioned on
- Path is blocked: When M is conditioned on
Intuition: Conditioning on the mediator M blocks information flow through the chain.
2. Fork: X β C β Y
- Path is open: When C is not conditioned on
- Path is blocked: When C is conditioned on
Intuition: Conditioning on the confounder C blocks the spurious association it creates.
3. Collider: X β L β Y
- Path is blocked: When L is not conditioned on
- Path is open: When L is conditioned on (or any descendant of L)
Intuition: Colliders naturally block paths. Conditioning on them opens the path (induces collider bias).
Example: Applying d-Separation to Our DAG
Question: Is Promo β₯ Purchase | {Loyalty, Cart Abandon}?
DAG paths from Promo to Purchase:
- Direct path: Promo β Purchase (always open, this is the causal effect we want)
- Backdoor path: Promo β Loyalty β Purchase (blocked by conditioning on Loyalty)
- Collider path: Promo β Cart Abandon β Purchase (blocked unless we condition on Cart Abandon)
Answer:
No, they are NOT d-separated. The direct path is always open (it's the causal path). But if we condition on Loyalty alone (not Cart Abandon), all backdoor paths are blocked, allowing us to identify the causal effect.
8. The Backdoor Criterion: Identifying Valid Control Sets
The backdoor criterion gives us a formal rule for which variables to control for to identify causal effects from observational data.
Backdoor Criterion (Pearl, 1995)
A set of variables Z satisfies the backdoor criterion relative to (X, Y) if:
- Z blocks all backdoor paths from X to Y (paths that have an arrow INTO X)
- Z contains no descendants of X (don't control for mediators or their descendants)
What is a Backdoor Path?
A backdoor path is any path from treatment X to outcome Y that begins with an arrow pointing INTO X (i.e., X β ...).
These paths represent spurious associations (confounding) that we need to block.
Applying Backdoor to Our Promo Example
Customer Loyalty
β β β
Promo β Email Open β Purchase
β β β
Cart Abandon
Backdoor paths from Promo to Purchase:
- Promo β Loyalty β Purchase (OPEN, must block)
- Promo β Loyalty β Email Open β Purchase (OPEN through Loyalty)
Valid adjustment sets (examples):
- Z = {Loyalty} β (blocks all backdoor paths, doesn't include descendants)
- Z = {} β (doesn't block backdoor paths)
- Z = {Loyalty, Email Open} β (includes mediator, blocks indirect effect)
- Z = {Loyalty, Cart Abandon} β (includes collider, induces bias)
Using Backdoor to Estimate Causal Effects
If Z satisfies the backdoor criterion, then:
This is the adjustment formula. It says: to estimate the causal effect of X on Y, stratify by Z and average across strata.
In practice: run a regression of Purchase on Promo and Loyalty, or use matching/weighting methods.
9. The Frontdoor Criterion: When Backdoor Fails
Sometimes we can't satisfy the backdoor criterion because we have unmeasured confounders. Thefrontdoor criterion provides an alternative identification strategy using mediators.
Scenario: Unmeasured Confounder
U (unmeasured)
β β
Promo β M β Purchase
U is an unmeasured confounder (e.g., customer motivation we can't observe). We can't use backdoor. But if we can measure mediator M, we might use frontdoor.
Frontdoor Criterion
A set of variables M satisfies the frontdoor criterion relative to (X, Y) if:
- M intercepts all directed paths from X to Y (M is a mediator)
- There are no backdoor paths from X to M
- All backdoor paths from M to Y are blocked by X
Practical Note:
Frontdoor identification is elegant theoretically but rarely used in practice because:
- It requires strong assumptions (all paths mediated, no XβM confounders)
- It requires measuring all mediators perfectly
- Instrumental variables (Week 4) are often more practical for unmeasured confounding
10. Brief Introduction to do-Calculus
Backdoor and frontdoor criteria are special cases of a more general framework: do-calculus, developed by Judea Pearl.
The do-Operator
P(Y | do(X=x)) represents the distribution of Y if we intervened to set X=x, breaking all arrows into X.
This is different from P(Y | X=x), which is the conditional distribution in observational data (includes confounding).
do-Calculus Rules (Simplified)
do-calculus provides three rules for manipulating do-expressions to determine if a causal effect is identifiable:
- Rule 1 (Insertion/deletion of observations): When can we ignore observing a variable?
- Rule 2 (Action/observation exchange): When can we replace do(X) with observing X?
- Rule 3 (Insertion/deletion of actions): When can we ignore an intervention?
These rules are applied algorithmically to derive identifiability results. Backdoor and frontdoor are derivable from do-calculus.
Practical Takeaway:
For most practical applications, you don't need to master do-calculus. Understanding backdoor/frontdoor criteria and d-separation is sufficient. Software like dagitty can automatically determine valid adjustment sets from your DAG.
11. Practical DAG Construction and Pitfalls
How to Build a Good DAG:
- Start with treatment and outcome: These are your core nodes
- Add known confounders: Variables that affect both treatment and outcome
- Think about selection mechanisms: How is treatment assigned? What affects it?
- Consider mediators: What pathways connect treatment to outcome?
- Don't include everything: Only include variables causally related to treatment or outcome
- Use domain knowledge: DAGs encode assumptionsβget expert input
- Test implications: DAGs imply conditional independenciesβtest them in data
Common Pitfalls:
Pitfall 1: Including Non-Causal Associations
Only draw arrows for direct causal effects, not mere correlations. If A and B are correlated through unmeasured C, don't draw A β B or B β A.
Pitfall 2: Forgetting Unmeasured Confounders
Include unmeasured confounders in your DAG (marked as latent). This helps you recognize when identification is impossible without additional assumptions.
Pitfall 3: Overcontrolling
Don't just "control for everything." Controlling for mediators, colliders, or descendants of treatment introduces bias.
Pitfall 4: Reverse Causation
Be careful about arrow direction. Does loyalty cause promo receipt, or does promo receipt cause loyalty? Often both (feedback loops), which violates acyclicityβyou may need to model time explicitly.
Pitfall 5: Ignoring Sample Selection
If your sample is selected based on certain criteria (e.g., only analyzing customers who received an email), you're conditioning on a collider. Model the selection process explicitly.
Tools for DAG Analysis:
- DAGitty: Web-based tool for drawing DAGs and computing adjustment sets (http://dagitty.net)
- ggdag (R): R package for creating and analyzing DAGs
- CausalNex (Python): Python library for causal reasoning with Bayesian networks
- DoWhy (Python): Microsoft's library for causal inference, includes DAG-based identification
12. Key Takeaways
β DAGs provide a formal language for encoding causal assumptions and determining which variables to control for
β Confounders (X β C β Y) create backdoor paths and must be controlled for to block spurious associations
β Mediators (X β M β Y) lie on the causal pathβdon't control for them if you want the total effect
β Colliders (X β L β Y) block paths when uncontrolledβconditioning on them opens paths and induces bias
β d-separation tells us when variables are conditionally independent given a set of controls
β Backdoor criterion identifies valid adjustment sets: block all backdoor paths, don't include descendants of treatment
β Frontdoor criterion provides identification when unmeasured confounding exists but all effects are mediated
β do-calculus is the general framework; backdoor/frontdoor are special cases
β DAGs encode assumptionsβthey're only as good as your domain knowledge. Use them to make assumptions explicit and testable
13. Next Week Preview
Now we know which variables to control for (thanks to DAGs and backdoor criterion). Next week, we'll learn how to control for them using matching and propensity scores:
- Exact matching and coarsened exact matching
- Propensity score estimation and interpretation
- Matching, stratification, and inverse probability weighting
- Assessing covariate balance and overlap
- Sensitivity analysis for unobserved confounding
We'll continue with our promotional discount example, using customer loyalty and other features to create comparable treatment and control groups.