Module 1, Week 2: Causal Graphs and DAGs

1. Motivation: Which Variables Should We Control For?

Last week, we learned that observational data suffers from selection bias. In our promo example, customers who receive discounts differ from those who don't. The solution seems simple: control for those differences!

The Naive Approach:

"Let's just control for everything! We have data on customer age, gender, location, past purchases, browsing history, device type, time of day, cart value... let's throw it all into a regression."

⚠️ The Problem:

Controlling for the wrong variables can make bias worse, not better. Some variables we should control for, some we shouldn't, and some we absolutely must not.

Three Questions That Arise Naturally:

Question 1: Should we control for past purchase behavior?

Loyal customers get more promos AND purchase more. Controlling for loyalty seems right... or does it create problems?

Question 2: Should we control for email open rate?

Only customers who open promo emails see the discount. Is email open rate a confounder, a mediator, or something else?

Question 3: Should we control for cart abandonment?

We often send promos to customers who abandoned carts. Both promo receipt and purchases depend on cart abandonment. Should we control for it?

To answer these questions rigorously, we need a formal way to represent causal relationships. Enter: Directed Acyclic Graphs (DAGs).

2. Introduction to Causal Graphs (DAGs)

A Directed Acyclic Graph (DAG) is a visual representation of causal relationships among variables.

DAG Components:

Nodes (circles): Represent variables
Directed edges (arrows): Represent direct causal effects (X → Y means "X causes Y")
Acyclic: No variable can cause itself through a chain of effects (no cycles/loops)

⚠️ Common Misconceptions About DAGs

✗ Myth: "Arrows in DAGs represent correlations"

✓ Reality: Arrows represent causal effects, not mere associations. Correlation ≠ causation ≠ arrows in DAG

✗ Myth: "Control for everything to be safe"

✓ Reality: Controlling for colliders and mediators makes bias worse or blocks causal pathways

✗ Myth: "Conditioning on variables in regression equals experimental control"

✓ Reality: Statistical control can remove confounding only if you control for the right variables and measure them perfectly

✗ Myth: "If X and Y are associated, draw X → Y or Y → X"

✓ Reality: They could be confounded (both caused by Z) with no direct causal link

Simple Example: Smoking → Lung Cancer

Smoking → Lung Cancer

Arrow indicates that smoking directly causes increased lung cancer risk.

Adding a Confounder:


        Genetics
          ↙    ↘
    Smoking → Lung Cancer

Genetic predisposition affects both smoking behavior AND lung cancer risk independently. This is a confounder.

Key DAG Terminology:

Parent:

A variable with an arrow pointing to another. (Genetics is a parent of Smoking)

Child:

A variable that has an arrow pointing to it. (Smoking is a child of Genetics)

Path:

A sequence of edges connecting variables (direction doesn't matter for paths)

Directed Path:

A path where all arrows point in the same direction (following arrow heads)

3. Building a DAG for Our Promo Example

Let's construct a DAG for our e-commerce promo scenario step by step, thinking carefully about what causes what.

Step 1: Core Relationship

Promo → Purchase

This is the causal effect we want to estimate.

Step 2: Add Customer Loyalty (Confounder)


      Customer Loyalty
          ↙         ↘
       Promo    →    Purchase

Loyal customers are more likely to be targeted for promos AND more likely to purchase anyway. This creates a backdoor path: Promo ← Loyalty → Purchase

Step 3: Add Email Engagement (Mediator)


      Customer Loyalty
          ↙         ↘
       Promo  →  Email Open  →  Purchase

The promo affects purchase through email opens. Email engagement is a mediator— it's part of the causal mechanism.

Step 4: Complete DAG with Cart Abandonment (Collider)


      Customer Loyalty
          ↙    ↓    ↘
    Promo → Email Open → Purchase
       ↘       ↓        ↙
         Cart Abandon

Cart abandonment is caused by both promo receipt (we trigger promos for cart abandoners) AND by purchase intent (people who don't purchase abandon carts). This is a collider.

This DAG encodes our assumptions about the causal structure. Now let's learn how to use it to determine which variables to control for.

4. Confounders: The Variables We Must Control

Definition: Confounder

A variable that causally affects both treatment and outcome, creating a spurious association (a backdoor path) between them.

Structure: Confounder → Treatment, Confounder → Outcome

Customer Loyalty as a Confounder

Why it's a confounder:

Loyalty → Promo: Marketing targets loyal customers for retention promos
Loyalty → Purchase: Loyal customers purchase more frequently, independent of promos

The problem:

If we don't control for loyalty, we'll attribute purchases to the promo when they're actually driven by pre-existing loyalty. The association is confounded.

The solution:

Control for loyalty (e.g., number of past purchases, account age) to "block" the backdoor path and isolate the causal effect of the promo.

Answer to Question 1:

✓ Yes, control for customer loyalty

It's a confounder. Failing to control for it leads to omitted variable bias—we'd overestimate the promo effect because loyal customers purchase more anyway.

5. Mediators: The Mechanism of Causation

Definition: Mediator

A variable that lies on the causal path between treatment and outcome. The treatment affects the mediator, which in turn affects the outcome.

Structure: Treatment → Mediator → Outcome

Email Open as a Mediator

Why it's a mediator:

Promo → Email Open: You can't respond to a promo without opening the email
Email Open → Purchase: Seeing the discount triggers a purchase decision

The problem with controlling for mediators:

If we control for email opens, we're blocking part of the causal effect we want to measure! The promo works through email engagement—that's the mechanism.

Controlling for a mediator gives you the direct effect (promo → purchase not through email), but destroys the indirect effect (promo → email → purchase). Usually, we want thetotal effect (direct + indirect).

When Might We Control for Mediators?

Sometimes we specifically want to decompose effects:

Mediation analysis: How much of the promo effect is mediated by email opens vs. other channels?
Mechanism testing: Does the promo work only through email awareness, or are there other pathways?

But for estimating the total causal effect of promos on purchases, do NOT control for mediators.

Answer to Question 2:

✗ No, do NOT control for email opens (for total effect)

Email open is a mediator. Controlling for it blocks the causal pathway and underestimates the promo's total effect. Only control for it if you're specifically interested in direct effects.

6. Colliders: The Variables We Must NOT Control

Definition: Collider

A variable that is causally affected by two or more other variables. Arrows collide into it.

Structure: A → Collider ← B

Cart Abandonment as a Collider


    Promo → Cart Abandon ← Purchase Intent

Why it's a collider:

Promo → Cart Abandon: We send promos to people who abandoned carts
Purchase Intent → Cart Abandon: People with low intent abandon carts

⚠️ Collider Bias (Selection Bias)

Controlling for a collider creates a spurious association between its causes, even if they're independent. This is called collider bias or selection bias.

Example: How Collider Bias Works

Scenario:

Imagine promo receipt and purchase intent are initially independent (no causal relationship). But both affect cart abandonment.

What happens if we condition on cart abandonment?

Among people who abandoned carts:

If someone received a promo, they're more likely to have higher purchase intent (because low-intent people abandon carts even without promos, but high-intent people need the promo trigger to abandon)
This creates a spurious positive correlation between promo and purchase within the conditioned group

Conditioning on a collider "opens" a path between its parents that was previously blocked, inducing bias.

Real-World Example: Berkson's Paradox

Hospital admission example: Talent and Beauty are independent. But both increase probability of being admitted to a prestigious hospital. Among hospital patients, talent and beauty appear negatively correlated (because if you're there despite low talent, you must be very beautiful, and vice versa).

This is collider bias: conditioning on hospital admission creates a spurious correlation.

Answer to Question 3:

✗ No, do NOT control for cart abandonment

It's a collider. Controlling for it induces bias by creating a spurious association between promo receipt and purchase intent. Leave it uncontrolled to avoid collider bias.

Summary: When to Control for Variables

Structure	Example	Should Control?	Why?
Confounder	X ← C → Y	✓ Yes	Blocks backdoor path; removes spurious association
Mediator	X → M → Y	✗ No (for total effect)	Would block causal pathway; only control for direct effects
Collider	X → L ← Y	✗ No	Would open blocked path; induces spurious association

7. d-Separation: Reading Independence from Graphs

Now that we understand confounders, mediators, and colliders, we need a formal rule to determine when two variables are independent given a set of conditioning variables. This is called d-separation(directional separation).

Definition: d-Separation

Two variables X and Y are d-separated by a set Z if all paths between X and Y are "blocked" by Z. If X and Y are d-separated by Z, then X ⊥ Y | Z (X and Y are independent conditional on Z).

Three Path Types and Blocking Rules:

1. Chain: X → M → Y

Path is open: When M is not conditioned on
Path is blocked: When M is conditioned on

Intuition: Conditioning on the mediator M blocks information flow through the chain.

2. Fork: X ← C → Y

Path is open: When C is not conditioned on
Path is blocked: When C is conditioned on

Intuition: Conditioning on the confounder C blocks the spurious association it creates.

3. Collider: X → L ← Y

Path is blocked: When L is not conditioned on
Path is open: When L is conditioned on (or any descendant of L)

Intuition: Colliders naturally block paths. Conditioning on them opens the path (induces collider bias).

Example: Applying d-Separation to Our DAG

Question: Is Promo ⊥ Purchase | {Loyalty, Cart Abandon}?

DAG paths from Promo to Purchase:

Direct path: Promo → Purchase (always open, this is the causal effect we want)
Backdoor path: Promo ← Loyalty → Purchase (blocked by conditioning on Loyalty)
Collider path: Promo → Cart Abandon ← Purchase (blocked unless we condition on Cart Abandon)

Answer:

No, they are NOT d-separated. The direct path is always open (it's the causal path). But if we condition on Loyalty alone (not Cart Abandon), all backdoor paths are blocked, allowing us to identify the causal effect.

8. The Backdoor Criterion: Identifying Valid Control Sets

The backdoor criterion gives us a formal rule for which variables to control for to identify causal effects from observational data.

Backdoor Criterion (Pearl, 1995)

A set of variables Z satisfies the backdoor criterion relative to (X, Y) if:

Z blocks all backdoor paths from X to Y (paths that have an arrow INTO X)
Z contains no descendants of X (don't control for mediators or their descendants)

What is a Backdoor Path?

A backdoor path is any path from treatment X to outcome Y that begins with an arrow pointing INTO X (i.e., X ← ...).

These paths represent spurious associations (confounding) that we need to block.

Applying Backdoor to Our Promo Example


      Customer Loyalty
          ↙    ↓    ↘
    Promo → Email Open → Purchase
       ↘       ↓        ↙
         Cart Abandon

Backdoor paths from Promo to Purchase:

Promo ← Loyalty → Purchase (OPEN, must block)
Promo ← Loyalty → Email Open → Purchase (OPEN through Loyalty)

Valid adjustment sets (examples):

Z = {Loyalty} ✓ (blocks all backdoor paths, doesn't include descendants)
Z = {} ✗ (doesn't block backdoor paths)
Z = {Loyalty, Email Open} ✗ (includes mediator, blocks indirect effect)
Z = {Loyalty, Cart Abandon} ✗ (includes collider, induces bias)

Using Backdoor to Estimate Causal Effects

If Z satisfies the backdoor criterion, then:

P(Y | do(X=x)) = Σ_z P(Y | X=x, Z=z) · P(Z=z)

This is the adjustment formula. It says: to estimate the causal effect of X on Y, stratify by Z and average across strata.

In practice: run a regression of Purchase on Promo and Loyalty, or use matching/weighting methods.

9. The Frontdoor Criterion: When Backdoor Fails

Sometimes we can't satisfy the backdoor criterion because we have unmeasured confounders. Thefrontdoor criterion provides an alternative identification strategy using mediators.

Scenario: Unmeasured Confounder


    U (unmeasured)
       ↓        ↓
    Promo  →  M  →  Purchase

U is an unmeasured confounder (e.g., customer motivation we can't observe). We can't use backdoor. But if we can measure mediator M, we might use frontdoor.

Frontdoor Criterion

A set of variables M satisfies the frontdoor criterion relative to (X, Y) if:

M intercepts all directed paths from X to Y (M is a mediator)
There are no backdoor paths from X to M
All backdoor paths from M to Y are blocked by X

Practical Note:

Frontdoor identification is elegant theoretically but rarely used in practice because:

It requires strong assumptions (all paths mediated, no X→M confounders)
It requires measuring all mediators perfectly
Instrumental variables (Week 4) are often more practical for unmeasured confounding

10. Brief Introduction to do-Calculus

Backdoor and frontdoor criteria are special cases of a more general framework: do-calculus, developed by Judea Pearl.

The do-Operator

P(Y | do(X=x)) represents the distribution of Y if we intervened to set X=x, breaking all arrows into X.

This is different from P(Y | X=x), which is the conditional distribution in observational data (includes confounding).

do-Calculus Rules (Simplified)

do-calculus provides three rules for manipulating do-expressions to determine if a causal effect is identifiable:

Rule 1 (Insertion/deletion of observations): When can we ignore observing a variable?
Rule 2 (Action/observation exchange): When can we replace do(X) with observing X?
Rule 3 (Insertion/deletion of actions): When can we ignore an intervention?

These rules are applied algorithmically to derive identifiability results. Backdoor and frontdoor are derivable from do-calculus.

Practical Takeaway:

For most practical applications, you don't need to master do-calculus. Understanding backdoor/frontdoor criteria and d-separation is sufficient. Software like dagitty can automatically determine valid adjustment sets from your DAG.

11. Practical DAG Construction and Pitfalls

How to Build a Good DAG:

Start with treatment and outcome: These are your core nodes
Add known confounders: Variables that affect both treatment and outcome
Think about selection mechanisms: How is treatment assigned? What affects it?
Consider mediators: What pathways connect treatment to outcome?
Don't include everything: Only include variables causally related to treatment or outcome
Use domain knowledge: DAGs encode assumptions—get expert input
Test implications: DAGs imply conditional independencies—test them in data

Common Pitfalls:

Pitfall 1: Including Non-Causal Associations

Only draw arrows for direct causal effects, not mere correlations. If A and B are correlated through unmeasured C, don't draw A → B or B → A.

Pitfall 2: Forgetting Unmeasured Confounders

Include unmeasured confounders in your DAG (marked as latent). This helps you recognize when identification is impossible without additional assumptions.

Pitfall 3: Overcontrolling

Don't just "control for everything." Controlling for mediators, colliders, or descendants of treatment introduces bias.

Pitfall 4: Reverse Causation

Be careful about arrow direction. Does loyalty cause promo receipt, or does promo receipt cause loyalty? Often both (feedback loops), which violates acyclicity—you may need to model time explicitly.

Pitfall 5: Ignoring Sample Selection

If your sample is selected based on certain criteria (e.g., only analyzing customers who received an email), you're conditioning on a collider. Model the selection process explicitly.

Tools for DAG Analysis:

DAGitty: Web-based tool for drawing DAGs and computing adjustment sets (http://dagitty.net)
ggdag (R): R package for creating and analyzing DAGs
CausalNex (Python): Python library for causal reasoning with Bayesian networks
DoWhy (Python): Microsoft's library for causal inference, includes DAG-based identification

12. Key Takeaways

✓ DAGs provide a formal language for encoding causal assumptions and determining which variables to control for

✓ Confounders (X ← C → Y) create backdoor paths and must be controlled for to block spurious associations

✓ Mediators (X → M → Y) lie on the causal path—don't control for them if you want the total effect

✓ Colliders (X → L ← Y) block paths when uncontrolled—conditioning on them opens paths and induces bias

✓ d-separation tells us when variables are conditionally independent given a set of controls

✓ Backdoor criterion identifies valid adjustment sets: block all backdoor paths, don't include descendants of treatment

✓ Frontdoor criterion provides identification when unmeasured confounding exists but all effects are mediated

✓ do-calculus is the general framework; backdoor/frontdoor are special cases

✓ DAGs encode assumptions—they're only as good as your domain knowledge. Use them to make assumptions explicit and testable

13. Next Week Preview

Now we know which variables to control for (thanks to DAGs and backdoor criterion). Next week, we'll learn how to control for them using matching and propensity scores:

Exact matching and coarsened exact matching
Propensity score estimation and interpretation
Matching, stratification, and inverse probability weighting
Assessing covariate balance and overlap
Sensitivity analysis for unobserved confounding

We'll continue with our promotional discount example, using customer loyalty and other features to create comparable treatment and control groups.

Business Case Study: Interview Approach

📊 Case: Measuring the Impact of Premium Features on Customer Retention

Context: You're a senior data scientist at CloudStore, a SaaS company offering cloud storage. The product team recently launched premium features (advanced collaboration tools, version control, priority support) available to all users. They want to understand if premium feature adoption causes higher retention.

Initial analysis shows users who adopted premium features have 40% annual retention vs. 25% for non-adopters (a 15pp gap). The VP of Product concludes: "Premium features cause massive retention gains—let's invest more in these features and push adoption!"

However, you notice several complications:

Premium features were not randomly assigned—users self-selected based on need
Adoption required setup time, so only engaged users completed onboarding
The company targeted large enterprise teams with premium upsell campaigns
Usage data shows adopters were already more active before adoption

Your task: Draw a causal graph (DAG) to represent this scenario, identify which variables to control for, explain your reasoning using the backdoor criterion, and propose an analysis plan.

Step 1: Clarifying Questions & Data Discovery

Ask these questions to understand the data generation process:

Treatment Assignment:

Q: How did users get premium features? Self-serve activation? Sales-driven? Invited beta?
A: Self-serve for smaller teams (<10 users); enterprise teams were actively targeted by sales based on contract size and engagement metrics.
Q: Was there any randomization or A/B test involved?
A: No, completely observational.
Q: Were there barriers to adoption (cost, complexity, prerequisites)?
A: No additional cost (freemium model), but required 15-30 min onboarding and admin privileges.

Available Data:

Q: What user-level features do we have before premium adoption?
A: Account age, team size, industry, pre-adoption activity (logins, uploads, storage used), engagement scores, plan type.
Q: Can we measure user engagement continuously, or only at adoption?
A: Continuous—we have weekly engagement metrics pre- and post-adoption.
Q: What's our definition of retention?
A: Active usage in month 12 after observation period starts (binary outcome).

Potential Confounders:

Q: What drives both adoption and retention?
Hypotheses: Team size (large teams need collaboration → adopt features; also have inertia → stay retained), Industry (tech companies more engaged), Product fit (high-value use cases drive both), Engagement momentum (already-active users both adopt and stay).
Q: Could retention affect adoption (reverse causation)?
A: Unlikely at the individual level (adoption happens first), but at cohort level, maybe—users who plan to stay might invest in setup. Need to carefully define timing.

Mediators & Colliders:

Q: How do premium features affect retention? What's the mechanism?
Hypotheses: Premium features → Increased collaboration → More lock-in/stickiness → Retention. Also: Premium features → Perceived value → Retention.
Q: Are there any post-treatment variables that might be colliders?
Hypothesis: Customer support ticket volume might be a collider (caused by both adoption complexity and by dissatisfaction/churn risk).

These questions help you understand the causal structure before drawing the DAG. In a real interview, spend 10-15 minutes on this phase.

Step 2: Construct the Causal DAG

Walk through DAG construction step-by-step, explaining each decision:

Step 2a: Core Causal Relationship

Premium Adoption → Retention

This is the causal effect we want to estimate.

Step 2b: Add Observed Pre-Treatment Confounders


       Team Size          Industry
          ↓    ↓              ↓
          ↓    └─────┬────────┘
          ↓          ↓
          └──→ Premium Adoption → Retention

Reasoning:

Team Size → Premium Adoption: Larger teams need collaboration features
Team Size → Retention: Larger teams have higher switching costs, inertia
Industry → Premium Adoption: Tech companies adopt faster
Industry → Retention: Some industries have higher baseline retention

These create backdoor paths that bias our estimate if not controlled.

Step 2c: Add Pre-Treatment Engagement (Time-Varying Confounder)


       Team Size          Industry
          ↓    ↓              ↓
          ↓    └─────┬────────┘
          ↓          ↓
          └──→ Premium Adoption → Retention
                  ↑                  ↑
                  └──────────────────┘
                Pre-Adoption Engagement

Reasoning:

Pre-Adoption Engagement → Premium Adoption: Engaged users are more likely to explore and adopt new features
Pre-Adoption Engagement → Retention: Already-engaged users are more likely to stay regardless of premium features

Critical confounder! If not controlled, we'll massively overestimate the premium feature effect.

Step 2d: Add Sales Targeting (Selection Mechanism)


       Team Size          Industry
          ↓    ↓              ↓
          ↓    └─────┬────────┘
          ↓          ↓
          ↓    Sales Targeting
          ↓          ↓
          └──→ Premium Adoption → Retention
                  ↑                  ↑
                  └──────────────────┘
                Pre-Adoption Engagement

Reasoning:

Industry & Team Size → Sales Targeting: Enterprise teams in certain industries get targeted outreach
Sales Targeting → Premium Adoption: Outreach increases adoption likelihood

Sales Targeting is a mediator between Team Size/Industry and Adoption. We typically don't need to control for it (it's part of how Team Size affects Adoption), but it helps us understand the selection mechanism.

Step 2e: Add Post-Treatment Mediator (Collaboration Activity)


       Team Size          Industry
          ↓    ↓              ↓
          ↓    └─────┬────────┘
          ↓          ↓
          ↓    Sales Targeting
          ↓          ↓
          └──→ Premium Adoption → Post-Adoption Collaboration → Retention
                  ↑                                                ↑
                  └────────────────────────────────────────────────┘
                            Pre-Adoption Engagement

Reasoning:

Premium Adoption → Post-Adoption Collaboration: Premium features enable more team collaboration
Post-Adoption Collaboration → Retention: Collaboration creates lock-in and value

This is a mediator! DON'T control for it if estimating total effect.

However, if you want to understand mechanism ("How much of retention gain is mediated through collaboration?"), you might do mediation analysis—but that's a separate question.

Step 2f: Add Collider (Support Tickets)


       Team Size          Industry
          ↓    ↓              ↓
          ↓    └─────┬────────┘
          ↓          ↓
          ↓    Sales Targeting
          ↓          ↓
          └──→ Premium Adoption → Post-Adoption Collaboration → Retention
                  ↓                                                ↑
                  ↓                                                ↑
                  └──────→ Support Tickets ←───────────────────────┘
                                                   (Churn Risk)

Reasoning:

Premium Adoption → Support Tickets: Adopting new features causes setup questions, complexity issues
Churn Risk → Support Tickets: Dissatisfied users contact support before churning

This is a collider! Conditioning on support tickets would induce bias (collider bias).

Example: Among users who contacted support, premium adopters might appear to have lower retention (because non-adopters only contact support when very dissatisfied), creating a spurious negative association.

Step 2g: Final Complete DAG


    Team Size              Industry        Pre-Adoption
        ↓                      ↓            Engagement
        ↓                      ↓                ↓
        └────────┬─────────────┴────────────────┤
                 ↓                              ↓
           Sales Targeting                      ↓
                 ↓                              ↓
                 └──────────┬───────────────────┘
                            ↓
                      Premium Adoption
                            ↓
                            ├────→ Post-Adoption Collaboration
                            ↓                ↓
                            ↓                ↓
                      Support Tickets       ↓
                            ↑                ↓
                            └────────┬───────┘
                                     ↓
                                 Retention

This DAG encodes our causal assumptions. Now we use it to determine what to control for.

Step 3: Apply Backdoor Criterion & Identify Adjustment Set

Step 3a: List All Paths from Premium Adoption to Retention

Causal path (frontdoor):
Premium Adoption → Post-Adoption Collaboration → Retention
This is the mechanism we want to capture (keep open)
Backdoor path via Team Size:
Premium Adoption ← Team Size → Retention
(Confounding: Large teams both adopt more AND have higher baseline retention)
Backdoor path via Industry:
Premium Adoption ← Industry → Retention
(Confounding: Industry affects both adoption and retention)
Backdoor path via Pre-Adoption Engagement:
Premium Adoption ← Pre-Adoption Engagement → Retention
(Critical confounder: Engaged users both adopt and retain more)
Collider path (blocked by default):
Premium Adoption → Support Tickets ← Retention
(This path is naturally blocked unless we condition on Support Tickets)

Step 3b: Determine What to Control For (Backdoor Criterion)

Backdoor Criterion Requirements:

Control set must block all backdoor paths (paths with arrow into treatment)
Control set must NOT include descendants of treatment (no mediators/colliders)

Valid Adjustment Sets:

✓ Minimal Sufficient Set:

Z = { Team Size, Industry, Pre-Adoption Engagement }

Blocks all three backdoor paths
Doesn't include any post-treatment variables
Most efficient (fewest variables while satisfying backdoor criterion)

Alternative Valid Set (with Sales Targeting):

Z = { Team Size, Industry, Pre-Adoption Engagement, Sales Targeting }

Also valid (Sales Targeting is pre-treatment)
Unnecessary (Team Size + Industry already block paths through Sales Targeting)
Could improve precision if Sales Targeting strongly predicts adoption

✗ Invalid Set (includes mediator):

Z = { Team Size, Industry, Pre-Adoption Engagement, Post-Adoption Collaboration }

Violates backdoor criterion (includes descendant of treatment)
Would block the causal pathway Premium → Collaboration → Retention
Underestimates total effect (only captures direct effect)

✗ Invalid Set (includes collider):

Z = { Team Size, Industry, Pre-Adoption Engagement, Support Tickets }

Conditioning on collider induces spurious association
Creates selection bias within support ticket sub-population
Biases the estimate (magnitude and direction unpredictable)

✗ Insufficient Set (missing confounder):

Z = { Team Size, Industry }

Doesn't block backdoor path through Pre-Adoption Engagement
Residual confounding remains—omitted variable bias
Would overestimate premium feature effect

Step 3c: Verify with d-Separation

Check: Is Premium Adoption independent of Retention given our control set?

After conditioning on Z = { Team Size, Industry, Pre-Adoption Engagement }:

All backdoor paths blocked (forks closed by conditioning on confounders)
Causal path remains open (Premium → Collaboration → Retention)
Collider path remains blocked (not conditioning on Support Tickets)

✓ d-separation holds for backdoor paths. We can identify the causal effect!

Step 4: Detailed Analysis Plan & Methods

Phase 1: Validate DAG Assumptions (Testable Implications)

Before trusting our DAG, test its implied conditional independencies:

Test 1: Team Size ⊥ Pre-Adoption Engagement | Industry?
Our DAG implies these should be independent given Industry. Run regression/correlation test. If violated, revise DAG.
Test 2: Industry ⊥ Support Tickets | { Premium Adoption, Retention }?
Our DAG implies Industry doesn't directly cause Support Tickets. Test this.
Test 3: Check for unmeasured confounding using placebo outcomes
Test if Premium Adoption predicts pre-treatment retention/activity. Should be zero if we've captured all confounders.

If tests fail, update the DAG and re-derive adjustment set. Iterate until DAG is consistent with data.

Phase 2: Check Overlap/Positivity

Ensure we have common support—both adopters and non-adopters exist across covariate space:

Plot distributions of Team Size, Industry, Pre-Adoption Engagement for adopters vs. non-adopters
Check for regions where P(Adoption | X) ≈ 0 or ≈ 1 (no overlap → can't estimate causal effect)
Consider trimming extreme propensity scores (Week 3 methods)
If overlap is poor, consider restricting analysis to common support region

Overlap violations mean we can't credibly estimate effects for some subpopulations—be transparent about this.

Phase 3: Estimate Causal Effect (Multiple Methods for Robustness)

Method 1: Regression Adjustment

Retention ~ Premium Adoption + Team Size + Industry + Pre-Adoption Engagement

Pros: Simple, interpretable, fast
Cons: Assumes linear relationships, no interactions
When to use: As baseline; if relationships are roughly linear

Method 2: Propensity Score Matching (Week 3 preview)

Estimate P(Premium Adoption | Z), then match adopters to similar non-adopters:

Pros: Non-parametric, transparent balance checking, handles non-linearities
Cons: Loses unmatched units, matching algorithm choices matter
When to use: When you want to mimic an RCT design; when stakeholders want "matched pairs"

→ We'll cover this in detail in Week 3

Method 3: Inverse Probability Weighting (IPW)

Weight each observation by inverse of propensity score to create pseudo-population:

Pros: Uses all data, simple weighted mean comparison
Cons: Sensitive to extreme weights (need weight trimming)
When to use: When you want ATE (not ATT); when overlap is good

Method 4: Doubly Robust Estimation (Best Practice)

Combine outcome regression + propensity weighting:

Pros: Consistent if EITHER outcome model OR propensity model is correct
Cons: More complex, requires both models
When to use: For robustness; as your primary estimate

Recommendation: Report estimates from all methods. If they agree → strong evidence. If they diverge → investigate sensitivity.

Phase 4: Sensitivity Analysis

Assess robustness to unmeasured confounding:

Rosenbaum bounds (Week 3): How strong would an unmeasured confounder need to be to overturn our conclusion?
Negative control outcomes: Estimate "effect" of Premium Adoption on outcomes it couldn't plausibly cause (e.g., pre-treatment retention). Should be zero.
Placebo tests: Estimate effect using "fake" treatment date before actual adoption. Should find no effect.
Subgroup analysis: Check if effects vary by Team Size or Industry (heterogeneity). Unlikely premium works for everyone equally.

Phase 5: Mechanism Analysis (Optional—If Time Permits)

If stakeholders ask "HOW do premium features improve retention?", conduct mediation analysis:

Estimate indirect effect: Premium → Collaboration → Retention
Estimate direct effect: Premium → Retention (controlling for Collaboration)
Decompose total effect = direct effect + indirect effect

Caution: Mediation analysis requires additional "sequential ignorability" assumption—mediator assignment is unconfounded given covariates. Often violated.

Step 5: Present Findings & Handle Pushback

Scenario 1: Effect is Much Smaller Than Naive Estimate

Naive estimate: 15pp retention gap (40% vs 25%)

Causal estimate after adjustment: 3pp (95% CI: [1pp, 5pp])

How to communicate:

"The observed 15-point retention gap is misleading because it conflates the causal effect of premium features with selection bias. Users who adopted premium features were already more engaged, came from larger teams with higher baseline retention, and were in industries with better retention."

"After controlling for these pre-existing differences using causal inference methods, we estimate premium features cause a 3 percentage point increase in retention. This is still valuable—it's a 12% relative lift on the 25% baseline—but it's much smaller than the naive 15pp gap."

"The good news: This is the true causal effect we can expect if we increase adoption. The 15pp gap was an upper bound inflated by selection bias."

Anticipated pushback:

VP: "So we should abandon premium features? Only 3pp isn't worth the investment!"

Your response: "Not at all. A 3pp retention increase across our user base translates to $X million in incremental revenue (show calculation). The real insight is about targeting: we should focus on driving adoption among users most likely to benefit, rather than assuming premium features are a silver bullet for everyone. Heterogeneous effects analysis (subgroup breakdown) will help us optimize."

Scenario 2: Explaining Why We Can't Control for Post-Treatment Variables

Anticipated question:

PM: "Why didn't you control for post-adoption collaboration activity? That seems important!"

Your explanation:

"Post-adoption collaboration is a mediator—it's part of HOW premium features improve retention. If we control for it, we'd be blocking the causal pathway we want to measure. Think of it this way: premium features work by enabling collaboration, which creates stickiness."

"If we condition on collaboration, we'd only estimate the direct effect (premium features improving retention through pathways OTHER than collaboration), which misses the main mechanism. That would underestimate the true impact."

"However, if you're specifically interested in how much of the effect is mediated through collaboration, we can do a separate mediation analysis to decompose the total effect."

Scenario 3: Explaining Why We Can't Just "Control for Everything"

Anticipated question:

Analyst: "To be safe, shouldn't we control for every variable we have? More controls = less bias, right?"

Your explanation:

"Actually, no. Controlling for the wrong variables can increase bias. There are two dangerous cases:"

Colliders: Variables caused by both treatment and outcome. Conditioning on them induces spurious correlation (collider bias). In our case, support tickets are a collider—they're caused by both premium adoption complexity AND by churn risk. Controlling for them would create bias.
Mediators: Variables on the causal pathway from treatment to outcome. Controlling for them blocks the effect we want to measure. Post-adoption collaboration is a mediator—it's how premium features work.

"This is why we use causal graphs (DAGs) to carefully determine the minimal sufficient adjustment set—the smallest set of variables that blocks all confounding without introducing new bias."

Key Messaging Points for Executive Summary:

The Problem: Observational data shows adopters have 15pp higher retention, but this conflates causation with selection bias.
Our Approach: We used causal graphs (DAGs) to identify confounders and applied multiple causal inference methods to isolate the true effect.
Key Finding: Premium features cause a 3pp retention increase (95% CI: [1, 5pp]), about 20% of the naive estimate.
Why the Difference: Most of the observed gap reflects pre-existing differences—adopters were already more engaged, larger teams, in higher-retention industries.
Business Impact: 3pp retention lift = $X million in incremental LTV. ROI is positive if adoption costs < $Y per user.
Recommendation: Continue investing in premium features, but optimize targeting to high-value segments most likely to benefit (use CATE/heterogeneity analysis).
Next Steps: Run sensitivity analysis for unmeasured confounding; explore which user segments benefit most; consider A/B test for validation.

Step 6: Handling Interview Variations & Follow-Up Questions

What if the interviewer asks: "What if we can't measure pre-adoption engagement?"

This is testing whether you understand unmeasured confounding and alternative identification strategies.

Answer:

"If pre-adoption engagement is unmeasured but correlated with both adoption and retention, we have an unmeasured confounder. The backdoor criterion fails. We'd need an alternative identification strategy:"

Instrumental Variable (Week 4): Find something that affects adoption but not retention (except through adoption). Example: Random email delivery times might affect whether users see the premium feature announcement, without directly affecting retention.
Difference-in-Differences (Week 4): If premium features rolled out at different times to different segments, compare retention trends before/after rollout, using not-yet-treated groups as controls.
Regression Discontinuity (Week 4): If there's a threshold for premium feature eligibility (e.g., team size > 10), compare users just above vs. just below the threshold.
Sensitivity Analysis: Use Rosenbaum bounds to quantify how strong an unmeasured confounder would need to be to change our conclusions.

What if the interviewer asks: "How would you validate your DAG?"

Answer:

Test conditional independencies: The DAG implies certain variables should be independent given others. Test these using regression/correlation. If violated, revise DAG.
Domain expert review: Present DAG to product managers, engineers who built premium features, customer success team. Do they agree with causal arrows?
Placebo/falsification tests: If Premium Adoption has a causal effect, it shouldn't "predict" pre-treatment outcomes. Test if adoption predicts retention in the month BEFORE adoption—should be zero (after controlling for confounders).
Compare with quasi-experiment: If we have any natural experiment (e.g., premium features temporarily unavailable in some regions), use that as a benchmark to validate observational estimates.

What if the interviewer asks: "What if premium adoption affects team size (reverse causation)?"

Answer:

"If premium features are so valuable that they cause teams to grow (recruiting more members to take advantage of collaboration tools), we have feedback/bidirectional causation: Team Size → Premium Adoption AND Premium Adoption → Team Size (future). This violates the acyclicity assumption (no cycles)."

Solutions:

Measure time explicitly: Use Team Size at T-1 (before adoption) as confounder. Exclude post-adoption team size from DAG.
Lagged DAG: Create a time-indexed DAG with TeamSize_t-1 → Adoption_t → TeamSize_t → Retention_t+1.
Fixed effects: If using panel data, include user fixed effects to control for time-invariant team characteristics.

What if the interviewer asks: "What if we want heterogeneous effects (CATE)—which users benefit most?"

Answer:

"Great question! Heterogeneous treatment effects are critical for targeting. We'd estimate Conditional Average Treatment Effects (CATE) by subgroup:"

Regression with interactions: Add Premium × TeamSize, Premium × Industry, Premium × Engagement interactions to our regression model.
Subgroup analysis: Stratify by Team Size quartiles, Industry, Engagement terciles, and estimate separate effects within each stratum using propensity matching.
Causal Forests (Week 6): Use machine learning to flexibly estimate CATE as a function of all covariates—discovers heterogeneity without pre-specifying interactions.
Meta-learners (Week 7): T-learner, S-learner, X-learner to estimate individualized treatment effects.

This lets us build a targeting policy: "Offer premium to users with CATE > threshold."

Common Pitfalls & Mistakes to Avoid

Pitfall 1: Drawing arrows based on correlations
DAG arrows represent causal effects, not associations. Just because X and Y are correlated doesn't mean X → Y or Y → X. They could both be caused by unmeasured Z.
Pitfall 2: Forgetting to include unmeasured confounders in DAG
Show U (unmeasured) nodes in your DAG to make assumptions explicit. Helps identify when backdoor criterion fails and you need alternatives (IV, RDD, DiD).
Pitfall 3: Confusing "controlling for X helps" with "controlling for X is necessary"
Just because X is correlated with outcome doesn't mean you should control for it. Could be a mediator (shouldn't control) or collider (shouldn't control). Use DAG, not correlations, to decide.
Pitfall 4: Ignoring sample selection bias
If your analysis dataset is a selected sample (e.g., only users who completed onboarding), you're conditioning on a potential collider. Must model selection explicitly or acknowledge limitation.
Pitfall 5: Treating DAG as ground truth rather than assumption
DAGs encode assumptions, not facts. Be humble—present DAG as "our assumed causal structure based on domain knowledge," validate where possible, and do sensitivity analysis.
Pitfall 6: Not considering time/dynamics
Many real scenarios have feedback loops or time-varying confounding (e.g., engagement affects adoption, adoption affects future engagement). Use lagged/time-indexed DAGs or panel methods.
Pitfall 7: Overconfidence in observational causal estimates
Even with perfect DAG and sufficient adjustment set, unobserved confounding could remain. Always recommend A/B test for validation if feasible. Present observational estimates with appropriate caveats.

Module 1, Week 2: Causal Graphs and DAGs

📊 Continuing Our Example: 20% Discount Promo

Table of Contents

1. Motivation: Which Variables Should We Control For?

Three Questions That Arise Naturally:

2. Introduction to Causal Graphs (DAGs)

⚠️ Common Misconceptions About DAGs

Simple Example: Smoking → Lung Cancer

Adding a Confounder:

Key DAG Terminology:

3. Building a DAG for Our Promo Example

Step 1: Core Relationship

Step 2: Add Customer Loyalty (Confounder)

Step 3: Add Email Engagement (Mediator)

Step 4: Complete DAG with Cart Abandonment (Collider)

4. Confounders: The Variables We Must Control

Customer Loyalty as a Confounder

Answer to Question 1:

5. Mediators: The Mechanism of Causation

Email Open as a Mediator

When Might We Control for Mediators?

Answer to Question 2:

6. Colliders: The Variables We Must NOT Control

Cart Abandonment as a Collider

Example: How Collider Bias Works

Real-World Example: Berkson's Paradox

Answer to Question 3:

Summary: When to Control for Variables

7. d-Separation: Reading Independence from Graphs

Three Path Types and Blocking Rules:

Example: Applying d-Separation to Our DAG

8. The Backdoor Criterion: Identifying Valid Control Sets

What is a Backdoor Path?

Applying Backdoor to Our Promo Example

Using Backdoor to Estimate Causal Effects

9. The Frontdoor Criterion: When Backdoor Fails

Scenario: Unmeasured Confounder

Practical Note:

10. Brief Introduction to do-Calculus

do-Calculus Rules (Simplified)

Practical Takeaway:

11. Practical DAG Construction and Pitfalls

How to Build a Good DAG:

Common Pitfalls:

Tools for DAG Analysis:

12. Key Takeaways

13. Next Week Preview

Business Case Study: Interview Approach

📊 Case: Measuring the Impact of Premium Features on Customer Retention