Essential patterns and trade-offs every ML engineer should understand
Machine learning in production is less about fancy algorithms and more about building reliable systems that actually work at scale. The best way to learn this is through a real example.
In this guide, we'll design a restaurant ranking system for DoorDash from scratch. Instead of abstract theory, we'll introduce each concept—data pipelines, feature stores, two-stage retrieval, monitoring—exactly when we need it to solve a real problem. By the end, you'll understand how all the pieces fit together.
Our Problem: Design a system that shows users the most relevant restaurants when they open the DoorDash app. It needs to be fast (<200ms), personalized, and handle millions of users.
Before we jump into solutions, let's break down what this problem is really asking. Here are the questions that should immediately pop up:
What does "most relevant" mean?
Are users searching for something specific ("sushi near me") or are we proposing restaurants unprompted? These are two different problems:
For this design, we'll focus on the browse case (no query, just show relevant restaurants). The search case is similar but adds text matching.
How fast is "fast"?
<200ms is mentioned. Why does this matter? Because users leave if it's slow. Every 100ms of latency costs you conversions. This constraint immediately rules out certain approaches:
What does "personalized" mean?
Users should see restaurants aligned with their preferences. This is where the ML modeling happens.
We need to predict: "Will this user order from this restaurant?" This requires:
What does "millions of users" tell us?
Scale on the user side:
What about "tens of thousands of restaurants"?
Scale on the inventory side:
This is why we use a two-stage approach: filter 500K down to 200 candidates (fast), then rank those 200 precisely (slower but acceptable).
Why are we doing this? (The metric question)
The problem doesn't explicitly state the goal, but we need to ask: what metric defines success?
This metric drives every design decision. Latency matters because slow = lower conversions. Personalization matters because relevant = higher conversions.
What We Know So Far
By now, just from breaking down the problem statement, we already have a rough mental picture of what the system will look like:
What's next: Now that we understand the high-level structure, let's dig into each piece—how do we actually build the data pipelines? How does the two-stage retrieval work? What happens when things break? We'll walk through each component step by step.
Before we build anything, let's trace the data. Every ML system starts with events—things users do. For DoorDash, we have three types of participants generating data:
{user_id, timestamp, location}{user_id, restaurant_id, position_in_list}{user_id, restaurant_id, order_value, items}{user_id, restaurant_id, rating}{restaurant_id, menu_changes, timestamp}{restaurant_id, item_id, available: false}{restaurant_id, open_time, close_time}{dasher_id, order_id, timestamp}{dasher_id, lat, lng, timestamp}{dasher_id, order_id, delivery_time}These events are the raw material for our ML system. They flow in real-time, millions per hour. Consumer events help us predict what users want, merchant events tell us what's available, and dasher events help us estimate delivery time.
We need a system that can:
Event Ingestion Architecture
⚠️ Why Split Into Multiple Paths?
Using a single Kafka topic for everything creates problems:
Key Takeaway
Events are routed to different destinations based on their purpose: batch analytics needs long retention and high throughput, real-time features need low latency and recent data, and monitoring metrics bypass Kafka entirely for speed. This separation prevents slow consumers from blocking fast ones.
Why Kafka?
Raw events aren't useful by themselves. We need to aggregate them into features. Here's what we can derive:
We also have static data that doesn't change frequently. This comes from databases, not event streams:
Key difference: Events are continuous streams (clicks happen every second), static data is batched updates (ratings recomputed daily/weekly).
Events flow into two separate pipelines with different goals:
Path 1: Real-Time (Streaming)
Goal: Update features that matter right now
Tools: Kafka → Stream processing (Kafka Streams/Flink) → Redis cache
Latency: Milliseconds to seconds
Path 2: Batch (Historical)
Goal: Compute features from long-term patterns
Tools: Kafka → S3 (data lake) → Spark/Airflow (batch processing) → Feature Store
Latency: Hours to days (runs overnight, who cares)
Key Insight: Same Events, Different Timescales
A single "user places order" event contributes to:
(user_123, restaurant_456) → ordered = TrueThe same data, consumed three different ways, at three different speeds.
We have two types of data flowing into the system:
1. User Events (Dynamic - Changes Every Second)
2. Static Data (Changes Rarely - Updated Daily/Hourly)
Raw events like "user clicked restaurant" aren't useful by themselves. Models need numbers. We need to transform raw data into features—numerical representations that capture patterns.
Example: Turning Events into Features
Here's the tricky part: we need features in two places:
⚠️ Training-Serving Skew
If you compute features differently in training vs serving, your model will fail in production:
SELECT cuisine, COUNT(*) FROM clicks GROUP BY cuisine✓ Solution: Feature Store
One place to define features. Write the logic once, use everywhere:
Tools: Feast, Tecton, AWS Feature Store
Two feature computation paths:
Once we have features, how do we actually rank restaurants? Let's go from simplest to most sophisticated:
Approach 1: Heuristic Ranking (No ML)
Approach 2: Logistic Regression (Simple ML)
Approach 3: Gradient Boosted Trees (LightGBM/XGBoost)
Approach 4: Neural Networks (Two-Tower / Deep Learning)
Recommended Approach for DoorDash
Problem: We have 500K restaurants. Running our ML model on all of them for every user request = too slow. We'd miss our 200ms deadline.
Solution: Don't rank everything. Instead, use two stages:
Why this works: Stage 1 eliminates obviously bad candidates cheaply. Stage 2 uses expensive computation only on promising options. Total: ~100ms, well under our 200ms budget.
Real-time vs Batch
For DoorDash, we need online serving (computing predictions in real-time per request) because:
We can't precompute rankings for every user-location-time combination. That would be billions of possibilities.
100M requests per day = 1,200 requests per second on average, but dinner time (6-8pm) has 5x that. How do we handle it?
One server can't handle 6,000 requests/second. Solution: Run 100 identical model servers behind a load balancer. Each handles 60 req/sec. Easy to add more during peak hours.
What if the ML model times out? Feature store is down? Don't show an error—show something:
Your model is in production. Now what? You can't just "set it and forget it." Things break. User behavior changes. You need to watch it constantly.
⚠️ Your Model Creates Its Own Reality
Here's a vicious cycle:
But maybe B is actually better—users just never saw it!
✓ Solution: Exploration
Randomize 10% of rankings. Show some users "Pizza Place B" at the top even if the model ranks it #10. Now we get unbiased data on whether B is actually good. This prevents the model from getting stuck ranking the same restaurants forever.
You've built your ML system. But how do you know if changes actually improve things? How do you test new models, features, or ranking strategies without breaking production? The answer: rigorous experimentation.
Every change—new model, new feature, different ranking algorithm—goes through controlled experimentation before full rollout.
Standard A/B Test Setup
hash(user_id) % 100 < 50Scenario: Adding "Current Wait Time" Feature
Results After 2 Weeks
Decision: SHIP IT ✓
Primary metric improved significantly, secondary metrics positive or neutral, no guardrail violations. Roll out to 100% of traffic.
A/B tests are static: 50/50 split for the full duration. But what if one variant is clearly winning after 3 days? Multi-armed bandits dynamically allocate more traffic to better-performing variants.
Thompson Sampling Example
Testing 3 different ranking algorithms simultaneously:
Advantage: Minimizes "regret" (serving bad experiences). You don't waste 50% of traffic on a losing variant for 2 weeks.
Network Effects & Spillover
Problem: Restaurant capacity is shared. If treatment group orders more from "Joe's Pizza," control group sees longer wait times.
Solution: Test at restaurant-level or geographic clusters, not user-level.
Sample Ratio Mismatch (SRM)
Problem: You expect 50/50 split, but see 52/48. Something's wrong with randomization or logging.
Solution: Automated SRM checks. If detected, invalidate experiment and debug.
P-Hacking & Multiple Testing
Problem: You test 20 metrics. One shows p<0.05 by chance. You declare victory.
Solution: Pre-register primary metric. Apply Bonferroni correction for multiple tests. Require consistent directional wins.
Experimentation Infrastructure Requirements
Cold Start: New Restaurant Opens
Problem: "Joe's Pizza" just opened. Zero historical orders. Model has no idea if it's good.
Solution: Use content-based features (cuisine, price range, owner's other restaurants). Boost new restaurants temporarily for exploration. Give them a chance.
Real-Time Capacity
Problem: Model ranks "Burger King" #1. But they're slammed—60 minute wait. User orders, gets mad, cancels.
Solution: Add real-time feature: current order volume. Predict prep time. Penalize overloaded restaurants.
Model Works in Jupyter, Fails in Prod
Problem: Your notebook model gets 90% accuracy. In production: 60% accuracy. Why? Training-serving skew (features computed differently).
Solution: Integration tests. Compute features the same way in training and serving. Use feature store.
Every ML system involves fundamental tradeoffs. There are no universal "right" answers—only choices that match your constraints. Here are the most critical decisions you'll face:
1. Latency vs. Accuracy
The tradeoff: More complex models (deep neural networks, large ensembles) are more accurate but slower. Simpler models (linear models, small trees) are fast but less accurate.
Decision: For user-facing ranking, 50ms LightGBM wins. 2% accuracy isn't worth 200ms extra latency—users will leave. But for offline batch scoring (fraud detection, content moderation), use the neural network.
2. Real-Time vs. Batch Processing
The tradeoff: Real-time features are fresher but expensive to compute. Batch features are stale but cheap.
Decision: Use real-time for features that change rapidly and matter immediately (current wait time, inventory). Use batch for stable features (historical preferences, aggregate stats). Hybrid is common: 80% batch, 20% real-time.
3. Personalization vs. Simplicity
The tradeoff: Per-user models are more accurate but require more infrastructure, data, and cold-start handling. Global models are simpler but less tailored.
Decision: Global model with user features wins for most applications. Segments work for domains with distinct user types (B2B vs. B2C, power users vs. casual).
4. Model Complexity vs. Maintainability
The tradeoff: Complex models (ensembles, neural nets, multi-stage pipelines) squeeze out extra accuracy but are harder to debug, deploy, and explain.
Decision: Start simple. Only add complexity when you've exhausted simpler approaches and the business impact justifies the cost. Engineers leave, complex systems stay—make sure you can maintain it.
5. Precomputation vs. On-Demand
The tradeoff: Precomputing predictions (batch scoring) is simple and cheap. Computing on-demand (online serving) is flexible but requires more infrastructure.
Decision: For DoorDash, on-demand is necessary. For email spam detection, precomputation is fine. For YouTube recommendations, hybrid (precompute candidate pool, rerank on-demand).
6. Exploration vs. Exploitation
The tradeoff: Showing users what the model thinks is best (exploitation) maximizes short-term conversion. Showing random options (exploration) gathers data for long-term improvement but hurts immediate metrics.
Decision: Always explore at least 5-10%. The data quality improvement is worth the small conversion hit. Use bandits if you have the infrastructure.
Principle: Start Simple, Add Complexity Incrementally
Don't build the perfect system on day one. Instead:
Each version works and ships. You learn what matters by measuring real behavior, not guessing in a design doc.
Let's walk through a complete example: Sarah opens the DoorDash app at 6:45 PM on a Friday in San Francisco. What happens behind the scenes, from data ingestion to ranking, in under 150ms?
User Context
1. Request Arrives (t=0ms)
Load balancer routes request to ranking service instance #47 (current load: 45 req/sec)
2. Fetch User Features from Redis (t=3ms)
Cache hit! User features were precomputed last night by batch job, stored in Redis with 24h TTL.
3. Candidate Generation via ElasticSearch (t=23ms)
Optimization: ElasticSearch index sharded by geography. Mission District shard handles this query locally.
4. Batch Fetch Restaurant Features (t=28ms)
Batching is key: Single Redis MGET for 200 restaurants vs. 200 individual GETs. Saves ~150ms.
5. Compute Real-Time Features (t=30ms)
6. Model Inference: Rank 200 Restaurants (t=83ms)
Why LightGBM? Fast inference (0.4ms per restaurant), handles mixed feature types, good accuracy for tabular data.
7. Post-Processing & Business Rules (t=93ms)
8. Return Response (t=115ms ✓)
Key Takeaways from This Flow
Final Thoughts
ML system design is about understanding constraints and making smart trade-offs. There's no one-size-fits-all solution—what works for Netflix doesn't work for a startup, and what works for ad ranking doesn't work for fraud detection.
Start simple. Batch serving + periodic retraining gets you 80% of the way there. Add complexity only when you have evidence that it's needed. Measure everything, optimize bottlenecks, and always have a fallback plan.