ML System Design for Starters

/home/blog/ml/notes
← back to blogML System Design for StartersEssential patterns and trade-offs every ML engineer should understand
Chaïmae Sriti•October 2025
Machine learning in production is less about fancy algorithms and more about building reliable systems that actually work at scale. The best way to learn this is through a real example.
In this guide, we'll design a restaurant ranking system for DoorDash from scratch. Instead of abstract theory, we'll introduce each concept—data pipelines, feature stores, two-stage retrieval, monitoring—exactly when we need it to solve a real problem. By the end, you'll understand how all the pieces fit together.
Contents1. Data Sources & Event Streaming
2. Feature Engineering & Storage
3. Model Serving: The Two-Stage Architecture
4. Scaling to 100M Requests/Day
5. Monitoring & Iteration
6. A/B Testing & Experimentation
7. Common Pitfalls & Solutions
8. Key System Design Tradeoffs
9. End-to-End Case Study
Our Problem: Design a system that shows users the most relevant restaurants when they open the DoorDash app. It needs to be fast (<200ms), personalized, and handle millions of users.
Questions We Should Already Be AskingBefore we jump into solutions, let's break down what this problem is really asking. Here are the questions that should immediately pop up:
What does "most relevant" mean?
Are users searching for something specific ("sushi near me") or are we proposing restaurants unprompted? These are two different problems:
Query-based: User types "sushi" → we need search + ranking
Browse/discovery: User opens app → we need personalized recommendations
For this design, we'll focus on the browse case (no query, just show relevant restaurants). The search case is similar but adds text matching.
How fast is "fast"?
<200ms is mentioned. Why does this matter? Because users leave if it's slow. Every 100ms of latency costs you conversions. This constraint immediately rules out certain approaches:
❌ Can't use a huge neural network that takes 2 seconds to score each restaurant
❌ Can't score all 500K restaurants—we need to filter first
✓ Need caching, fast lookups (Redis, not database joins)
What does "personalized" mean?
Users should see restaurants aligned with their preferences. This is where the ML modeling happens.
We need to predict: "Will this user order from this restaurant?" This requires:
User history (past orders, cuisines they love, price sensitivity)
Context (time of day, location, weather)
A trained model that learns patterns from millions of past user-restaurant interactions
What does "millions of users" tell us?
Scale on the user side:
10M daily active users
Peak: 6,000 requests/second (dinner time)
Can't use a single server—need horizontal scaling, load balancing
Need to cache user features (can't recompute for every request)
What about "tens of thousands of restaurants"?
Scale on the inventory side:
500K restaurants total (across all cities)
Each has changing state: open/closed, current wait time, menu availability
Need efficient retrieval: geo-spatial search, filtering
Can't rank all 500K restaurants in <200ms—need candidate generation
This is why we use a two-stage approach: filter 500K down to 200 candidates (fast), then rank those 200 precisely (slower but acceptable).
Why are we doing this? (The metric question)
The problem doesn't explicitly state the goal, but we need to ask: what metric defines success?
Order conversion rate: % of users who open the app and place an order
Not "model accuracy"—we care about business outcomes
A slightly less accurate model that's 2x faster might increase conversions more
This metric drives every design decision. Latency matters because slow = lower conversions. Personalization matters because relevant = higher conversions.
What We Know So Far
By now, just from breaking down the problem statement, we already have a rough mental picture of what the system will look like:
The core flow: User opens app → ML model runs inference in real-time → returns top-K restaurants (ranked list) → user sees personalized results
Data collection: Every user interaction (clicks, orders, time spent browsing) is logged live through an event streaming system. These events flow into storage for later analysis.
Model training: The ML model trains on historical data (user behavior from the past 30-90 days). It needs to retrain periodically (daily or weekly) to adapt to changing user preferences and new restaurants.
Two timelines: There's an offline pipeline (batch processing historical data, training models overnight) and an online pipeline (real-time inference serving predictions in <200ms).
The scale challenge: Millions of users, hundreds of thousands of restaurants, peak traffic in seconds—this rules out naive approaches (can't just "run the model on everything").
What's next: Now that we understand the high-level structure, let's dig into each piece—how do we actually build the data pipelines? How does the two-stage retrieval work? What happens when things break? We'll walk through each component step by step.
1. Where Does the Data Come From?Before we build anything, let's trace the data. Every ML system starts with events—things users do. For DoorDash, we have three types of participants generating data:
Consumer Events (people ordering):• Opens app → {user_id, timestamp, location}
• Clicks restaurant → {user_id, restaurant_id, position_in_list}
• Places order → {user_id, restaurant_id, order_value, items}
• Rates order → {user_id, restaurant_id, rating}
Merchant Events (restaurants):• Updates menu → {restaurant_id, menu_changes, timestamp}
• Marks item unavailable → {restaurant_id, item_id, available: false}
• Changes hours → {restaurant_id, open_time, close_time}
Dasher Events (delivery drivers):• Accepts order → {dasher_id, order_id, timestamp}
• Updates location → {dasher_id, lat, lng, timestamp}
• Completes delivery → {dasher_id, order_id, delivery_time}
These events are the raw material for our ML system. They flow in real-time, millions per hour. Consumer events help us predict what users want, merchant events tell us what's available, and dasher events help us estimate delivery time.
The Event Stream: KafkaWe need a system that can:
Handle 100K+ events per second (peak dinner time)
Not lose data (every click matters for training)
Let multiple consumers read the same events (one writes to storage, one updates real-time features, one monitors conversions)
Event Ingestion Architecture
Mobile App / Backend
User interactions
↓
Event Gateway (Kafka Producer)
{user_id, event_type, payload, timestamp}
↙
↓
↘
Topic: analytics-events
Retention: 30 days
Partitioned by event_date
↓
Batch Consumer
→ S3 Data Lake
Optimized for throughput
Batch: hourly
Topic: realtime-events
Retention: 1 day
Partitioned by user_id
↓
Stream Processor
Flink/Kafka Streams
Aggregate per user
→ Redis (recent signals)
Direct Write
(No Kafka)
Low-latency metrics
↓
Metrics Store
Prometheus/Datadog
Pre-aggregated
→ Real-time dashboards
⚠️ Why Split Into Multiple Paths?
Using a single Kafka topic for everything creates problems:
Consumer lag: Slow batch consumers block fast real-time consumers
Retention conflicts: Analytics needs 30+ days, real-time needs hours
Hot partitions: Partitioning by user_id causes uneven load (power users)
Wrong tool: Kafka is for event streaming, not low-latency metrics (use direct writes instead)
Key Takeaway
Events are routed to different destinations based on their purpose: batch analytics needs long retention and high throughput, real-time features need low latency and recent data, and monitoring metrics bypass Kafka entirely for speed. This separation prevents slow consumers from blocking fast ones.
Why Kafka?
High throughput: Can handle millions of events per second
Durability: Events are persisted to disk, not just in memory
Replay: If a consumer crashes, it can replay events from where it left off
Multiple consumers: Same event stream feeds different systems (no duplicate API calls)
What Can We Extract From These Events?Raw events aren't useful by themselves. We need to aggregate them into features. Here's what we can derive:
From consumer events:• Favorite cuisine (Italian appears in 80% of clicks)
• Price sensitivity (avg order value: $22)
• Browsing time (clicks 5 restaurants before ordering)
• Preferred delivery time (orders mostly between 6-8pm)
From merchant events:• Menu update frequency (changes menu 2x per week)
• Item availability patterns (out of stock every Friday night)
• Operating hours consistency (open 7 days vs weekends only)
From dasher events:• Average delivery time per zone (downtown: 25 min, suburbs: 40 min)
• Peak hour congestion patterns (6-8pm deliveries take 2x longer)
• Dasher reliability (acceptance rate, completion rate)
But Wait—Not All Data Is EventsWe also have static data that doesn't change frequently. This comes from databases, not event streams:
Restaurant Static Data:• Menu items and prices (stored in PostgreSQL)
• Location and cuisine type (rarely changes)
• Historical rating (4.5 stars, computed weekly)
• Average delivery time (25 min, updated daily)
Merchant Static Data:• Business info (owner, account creation date)
• Monthly order volume (500 orders/month)
• Quality score (4.7, updated weekly)
Dasher Static Data:• Driver profile (vehicle type, service area)
• Overall rating (4.8 stars)
• Completion rate (95% of accepted orders completed)
• Delivery zones (operates in downtown SF)
Key difference: Events are continuous streams (clicks happen every second), static data is batched updates (ratings recomputed daily/weekly).
Two Processing PathsEvents flow into two separate pipelines with different goals:
Path 1: Real-Time (Streaming)
Goal: Update features that matter right now
User just clicked 3 Italian restaurants in the last 5 minutes → boost Italian in next request
Restaurant is getting slammed (10 orders in last 10 min) → increase estimated delivery time
Tools: Kafka → Stream processing (Kafka Streams/Flink) → Redis cache
Latency: Milliseconds to seconds
Path 2: Batch (Historical)
Goal: Compute features from long-term patterns
User's favorite cuisine over the last 90 days (aggregate all past orders)
Restaurant's average rating over the last month
User's typical order value (median of last 50 orders)
Tools: Kafka → S3 (data lake) → Spark/Airflow (batch processing) → Feature Store
Latency: Hours to days (runs overnight, who cares)
Key Insight: Same Events, Different Timescales
A single "user places order" event contributes to:
Real-time: "This restaurant is busy right now" (updated in seconds)
Batch: "This restaurant has a 4.5 star average" (recomputed daily)
Model training: A labeled example: (user_123, restaurant_456) → ordered = True
The same data, consumed three different ways, at three different speeds.
Putting It Together: Two Data SourcesWe have two types of data flowing into the system:
1. User Events (Dynamic - Changes Every Second)
Three types of users generate events:
• Consumers: Browse restaurants, click, order, rate (most events)
• Merchants (Restaurants): Update menu, mark items unavailable, change hours
• Dashers (Delivery drivers): Accept orders, update location, complete deliveries
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#e1bee7','primaryTextColor':'#000','primaryBorderColor':'#9c27b0','lineColor':'#666','secondaryColor':'#c5e1a5','tertiaryColor':'#ffccbc','fontSize':'14px'}}}%%
graph LR
    Events[User Events<br/>clicks, orders] --> Online[Online Pipeline<br/>Stream Processing]
    Events --> Offline[Offline Pipeline<br/>Batch Processing]

    Online --> Redis[Redis]
    Redis --> Serve[Serve Predictions<br/>Latency: ms]

    Offline --> S3[S3 Data Lake]
    S3 --> Train[Train Model<br/>Latency: hours-days]

    style Events fill:#bbdefb,stroke:#1976d2,stroke-width:2px
    style Online fill:#e1bee7,stroke:#9c27b0,stroke-width:2px
    style Offline fill:#c5e1a5,stroke:#689f38,stroke-width:2px
    style Redis fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
    style S3 fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
    style Serve fill:#ffccbc,stroke:#f4511e,stroke-width:2px
    style Train fill:#b3e5fc,stroke:#039be5,stroke-width:2px
Online Pipeline (Real-Time):User clicks → Kafka → Flink/Kafka Streams → Redis → Used for predictions (ms latency)
Offline Pipeline (Batch):User clicks → Kafka → S3 → Spark/Airflow → Used for training (hours-days latency)
2. Static Data (Changes Rarely - Updated Daily/Hourly)
Three types of static data:
• Restaurant data: Menu, hours, location, cuisine type, average rating
• Merchant data: Restaurant owner info, business metrics, payout history
• Dasher data: Driver ratings, vehicle type, completion rate, delivery zones
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#fff9c4','primaryTextColor':'#000','primaryBorderColor':'#fbc02d','lineColor':'#666','secondaryColor':'#c5e1a5','tertiaryColor':'#e1bee7','fontSize':'14px'}}}%%
graph LR
    Static[Static Data<br/>restaurant, merchant,<br/>dasher info] --> DB[Database<br/>PostgreSQL/DynamoDB]
    DB --> Batch[Batch Jobs<br/>Spark/Airflow]
    Batch --> Cache[Cache<br/>Redis]
    Cache --> Serve[Serve Predictions<br/>Latency: ms]
    Batch --> Train[Train Model<br/>Latency: hours-days]

    style Static fill:#dcedc8,stroke:#689f38,stroke-width:2px
    style DB fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
    style Batch fill:#c5e1a5,stroke:#689f38,stroke-width:2px
    style Cache fill:#e1bee7,stroke:#9c27b0,stroke-width:2px
    style Serve fill:#ffccbc,stroke:#f4511e,stroke-width:2px
    style Train fill:#b3e5fc,stroke:#039be5,stroke-width:2px
Static Data Pipeline:Static data → PostgreSQL/DynamoDB (database) → Spark/Airflow (batch process) → Redis (cache) + Used for training
Examples:
• Restaurant: 4.5 star rating, Italian cuisine, avg delivery 25 min
• Merchant: Account since 2020, 500 orders/month, 4.7 quality score
• Dasher: 4.8 rating, 95% completion rate, drives in downtown SF
Updated: hourly or daily (doesn't change per request)
2. How Do We Transform Raw Data into Features?Raw events like "user clicked restaurant" aren't useful by themselves. Models need numbers. We need to transform raw data into features—numerical representations that capture patterns.
Example: Turning Events into Features
Raw events (last 7 days):user_123 clicked restaurant_456 (Italian)
user_123 clicked restaurant_789 (Italian)
user_123 ordered from restaurant_456 (Italian, $25)
user_123 clicked restaurant_234 (Mexican)
Becomes features:user_favorite_cuisine = "Italian" (appears in 75% of clicks)
user_avg_order_value = 25.0
user_click_to_order_rate = 0.25 (1 order / 4 clicks)
user_orders_last_7days = 1
The Feature Store ProblemHere's the tricky part: we need features in two places:
Training (offline): Compute features from historical data in S3 using SparkExample: "What was user_123's favorite cuisine on March 15, 2024?"
Serving (online): Compute features in real-time from Redis cacheExample: "What is user_123's favorite cuisine RIGHT NOW?"
⚠️ Training-Serving Skew
If you compute features differently in training vs serving, your model will fail in production:
Training: SQL query: SELECT cuisine, COUNT(*) FROM clicks GROUP BY cuisine
Serving: Python function with slightly different logic
→ Result: Model sees different data in production than in training = garbage predictions
✓ Solution: Feature Store
One place to define features. Write the logic once, use everywhere:
@feature
def user_favorite_cuisine(user_id, timestamp):
    # Same code runs in training AND serving
    clicks = get_clicks(user_id, last_7_days(timestamp))
    return most_common(clicks, "cuisine")

# Training: user_favorite_cuisine(123, "2024-03-15")
# Serving:  user_favorite_cuisine(123, now())
Tools: Feast, Tecton, AWS Feature Store
Feature Pipeline Architecture%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#e1bee7','primaryTextColor':'#000','primaryBorderColor':'#9c27b0','lineColor':'#666','fontSize':'14px'}}}%%
graph TB
    Events[Raw Events<br/>Kafka] --> Batch[Batch Processing<br/>Spark - runs daily]
    Events --> Stream[Stream Processing<br/>Flink - real-time]

    Static[Static Data<br/>PostgreSQL] --> Batch

    Batch --> OfflineStore[Offline Feature Store<br/>S3 Parquet]
    Stream --> OnlineStore[Online Feature Store<br/>Redis]
    Batch --> OnlineStore

    OfflineStore --> Training[Model Training<br/>Historical features]
    OnlineStore --> Serving[Model Serving<br/>Real-time features]

    style Events fill:#bbdefb,stroke:#1976d2,stroke-width:2px
    style Static fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
    style Batch fill:#c5e1a5,stroke:#689f38,stroke-width:2px
    style Stream fill:#e1bee7,stroke:#9c27b0,stroke-width:2px
    style OfflineStore fill:#ffccbc,stroke:#f4511e,stroke-width:2px
    style OnlineStore fill:#ffccbc,stroke:#f4511e,stroke-width:2px
    style Training fill:#b3e5fc,stroke:#039be5,stroke-width:2px
    style Serving fill:#b3e5fc,stroke:#039be5,stroke-width:2px
Two feature computation paths:
Batch features (slow, historical):• Computed daily from S3 using Spark
• Example: "user's favorite cuisine over last 90 days"
• Stored in offline store (training) + synced to online store (serving)
Real-time features (fast, recent):• Computed on-the-fly from Kafka using Flink
• Example: "user clicked 3 Italian restaurants in last 5 minutes"
• Stored directly in online store (Redis)
Modeling Approaches: From Simple to AdvancedOnce we have features, how do we actually rank restaurants? Let's go from simplest to most sophisticated:
Approach 1: Heuristic Ranking (No ML)
How it works:score = (restaurant_rating / distance_km) * cuisine_match_bonus
# Example: (4.5 / 2.0) * 1.5 = 3.375
# Rank all restaurants by score, return top 20
Pros:• No training needed
• Fast (5ms)
• Easy to debug
• Good baseline
Cons:• Not personalized
• Hard to tune
• Ignores user history
• Leaves money on table
Use when: You're just starting, need something fast, or as a fallback when ML fails
Approach 2: Logistic Regression (Simple ML)
How it works:• Learn weights for each feature: P(order) = sigmoid(w1*distance + w2*rating + w3*cuisine_match + ...)
• Train on historical data: (user, restaurant) pairs with label = 1 if ordered, 0 if not
• Rank by predicted probability
Pros:• Personalized (uses user features)
• Very fast inference (10ms)
• Interpretable (can see feature weights)
• Easy to train
Cons:• Assumes linear relationships
• Can't capture complex patterns
• Needs manual feature engineering
• Lower accuracy than tree models
Use when: You need fast, interpretable predictions and your problem is relatively simple
Approach 3: Gradient Boosted Trees (LightGBM/XGBoost)
How it works:• Build ensemble of decision trees that learn complex feature interactions
• Input: 200 features (user, restaurant, context, cross-features like user_favorite_cuisine == restaurant_cuisine)
• Output: P(order | user, restaurant) for each candidate
# Training
model = LightGBM(objective='binary', num_trees=500)
model.fit(features, labels)  # labels: 1=ordered, 0=not ordered

# Serving
scores = model.predict(candidate_features)  # Score 200 restaurants
return top_k(scores, k=20)
Pros:• High accuracy (best bang for buck)
• Learns feature interactions
• Handles missing data well
• Fast inference (50ms for 200 candidates)
• Industry standard for ranking
Cons:• Slower than logistic regression
• Still needs feature engineering
• Harder to interpret
• Large model size (200MB)
Use when: You need production-grade accuracy and can afford 50ms latency (most ranking systems use this)
Approach 4: Neural Networks (Two-Tower / Deep Learning)
How it works (Two-Tower example):User Tower:  [user_id, past_cuisines, location] → Dense layers → user_embedding (128-dim)
Restaurant Tower: [restaurant_id, cuisine, rating] → Dense layers → restaurant_embedding (128-dim)
Score = dot_product(user_embedding, restaurant_embedding)

# Serving trick:
# 1. Precompute restaurant embeddings (batch, once per day)
# 2. At inference: compute user embedding (fast) + vector search (ANN) → top 200 restaurants
# 3. Pass to LightGBM for final ranking
Pros:• Can search millions of items efficiently (ANN)
• Learns representations automatically
• Captures complex patterns
• Good for candidate generation (Stage 1)
Cons:• Harder to train (GPUs, hyperparameters)
• Less interpretable (black box)
• Requires more data
• Overkill for smaller catalogs
• Infrastructure complexity (vector DB)
Use when: You have millions of items (like YouTube videos) or need to learn complex embeddings. For DoorDash (500K restaurants), LightGBM is often sufficient.
Recommended Approach for DoorDash
Stage 1 (Candidate Generation): Simple heuristics + geo-filtering• Filter: distance < 5km, is_open=true, accepts_delivery=true
• Sort by: distance × rating → keep top 200
• Latency: ~20ms
Stage 2 (Ranking): LightGBM• Score 200 candidates with full feature set (200 features)
• Predict P(order | user, restaurant)
• Latency: ~50ms
Total latency: 70ms (well under 200ms budget)
3. The Two-Stage TrickProblem: We have 500K restaurants. Running our ML model on all of them for every user request = too slow. We'd miss our 200ms deadline.
Solution: Don't rank everything. Instead, use two stages:
Stage 1: Candidate Generation (Fast & Rough)
• Goal: Narrow 500K restaurants down to ~200 candidates
• Method: Simple filters + heuristics
- Filter: Must be open, within 5km, accepts delivery
- Heuristic: Sort by (distance × rating)
• Latency: ~20ms
• It's okay if we miss some good restaurants here
Stage 2: Ranking (Precise & Slow)
• Goal: Rank these 200 candidates precisely
• Method: Full ML model with all features
- Model: LightGBM (gradient boosted trees)
- Predicts: P(user will order | restaurant)
• Latency: ~80ms (for 200 restaurants)
• Return top 20 to show the user
Why this works: Stage 1 eliminates obviously bad candidates cheaply. Stage 2 uses expensive computation only on promising options. Total: ~100ms, well under our 200ms budget.
Real-time vs Batch
For DoorDash, we need online serving (computing predictions in real-time per request) because:
User location changes (traveling, at work vs home)
Restaurant availability changes (busy, closed, out of items)
Context matters (time of day, weather)
We can't precompute rankings for every user-location-time combination. That would be billions of possibilities.
Full Serving ArchitectureUser opens DoorDash app
    ↓
[API Gateway / Load Balancer]
    ├─> Distributes requests across 100+ servers
    └─> Rate limiting, authentication
    ↓
[Ranking Service Instance 1...N] (each handles ~60 req/sec)
    │
    ├─[1] Feature Fetching (parallel) - 5ms
    │   ├─> Redis (User Features)
    │   │   └─> {user_123: {fav_cuisine: "Italian", avg_price: 22}}
    │   └─> Redis (Restaurant Features)
    │       └─> {rest_456: {rating: 4.5, delivery_time: 25, is_open: true}}
    │
    ├─[2] Candidate Generation - 20ms
    │   ├─> ElasticSearch / Vector DB
    │   │   └─> Geo-search: restaurants within 5km
    │   │   └─> Filter: is_open=true, accepts_delivery=true
    │   │   └─> Returns ~500 restaurants
    │   │
    │   └─> Heuristic Ranking
    │       └─> Score = 1/distance × rating
    │       └─> Keep top 200 candidates
    │
    ├─[3] Model Inference (LightGBM) - 50ms
    │   ├─> Load model from memory (loaded at startup)
    │   ├─> For each of 200 restaurants:
    │   │   └─> Compute features (user × restaurant × context)
    │   │   └─> Predict: P(order | shown)
    │   └─> Sort by score, keep top 50
    │
    ├─[4] Post-Processing - 10ms
    │   ├─> Business rules:
    │   │   └─> Boost promoted restaurants (+0.1 to score)
    │   │   └─> Boost new restaurants (cold start handling)
    │   │   └─> Diversity: max 3 from same cuisine in top 10
    │   │
    │   └─> Exploration (10% of traffic):
    │       └─> Randomize 2-3 positions for data collection
    │
    └─[5] Response - 115ms total
        └─> Return top 20 restaurants with metadata

Fallback Strategy (if main path fails):
    ├─> If Redis timeout: Use stale cache (5min old)
    ├─> If model timeout: Use heuristic ranking (distance × rating)
    └─> If total > 200ms: Return partial results (top 10 instead of 20)

Latency Breakdown:
    Feature fetch:     5ms  (parallel Redis calls)
    Candidate gen:    20ms  (ElasticSearch + filtering)
    Model inference:  50ms  (200 restaurants × LightGBM)
    Post-processing:  10ms  (sorting + business rules)
    API overhead:     30ms  (network, serialization)
    ─────────────────────
    Total:           115ms  (< 200ms target ✓)
4. Handling Scale: 100M Requests/Day100M requests per day = 1,200 requests per second on average, but dinner time (6-8pm) has 5x that. How do we handle it?
Caching Everything PossibleUser features: Redis cache (24hr TTL) — "user_123 loves Italian" doesn't change hourly
Restaurant features: Redis cache (1hr TTL) — ratings update slowly
Model predictions: DON'T cache — too personalized (user + location + time + restaurants combo)
Horizontal ScalingOne server can't handle 6,000 requests/second. Solution: Run 100 identical model servers behind a load balancer. Each handles 60 req/sec. Easy to add more during peak hours.
Infrastructure Layout (for 6,000 req/sec peak):

[Users] (10M daily active)
    ↓
[CDN / Edge Locations]
    ├─> Static assets cached
    └─> API requests forwarded
    ↓
[Load Balancer - AWS ALB / NGINX]
    ├─> Health checks every 10s
    ├─> Route to healthy instances only
    └─> Sticky sessions (optional)
    ↓
[Ranking Service Cluster] (Auto-scaling: 50-150 instances)
    ├─> Instance 1  (4 vCPU, 8GB RAM) → 60 req/sec
    ├─> Instance 2  (4 vCPU, 8GB RAM) → 60 req/sec
    ├─> ...
    └─> Instance 100 → handles total 6,000 req/sec
    │
    └─> Each instance runs:
        ├─> Model loaded in memory (200MB)
        ├─> Connection pool to Redis (10 connections)
        └─> Connection pool to ElasticSearch (5 connections)

[Feature Store - Redis Cluster]
    ├─> Primary: 3 nodes (sharded by user_id)
    ├─> Replicas: 6 nodes (2 replicas per primary)
    ├─> Cache size: 500GB total
    └─> Eviction: LRU (least recently used)

[Restaurant Database - ElasticSearch]
    ├─> 10 nodes
    ├─> Sharded by restaurant_id
    └─> Geo-spatial index for location queries

[Model Registry - S3]
    └─> model_v23.pkl (updated daily)

[Monitoring - Datadog/Prometheus]
    ├─> Latency metrics (p50, p95, p99)
    ├─> Error rates per instance
    └─> Auto-scaling triggers:
        └─> If p95 latency > 150ms → add 10 instances

Cost Optimization:
    ├─> Off-peak (2am-6am): scale down to 20 instances
    ├─> Peak (6pm-8pm): scale up to 150 instances
    └─> Spot instances for 50% of fleet (30% cost savings)
Fallback When Things BreakWhat if the ML model times out? Feature store is down? Don't show an error—show something:
If model times out: Use simple heuristic (nearby + highly rated)
If feature store down: Use stale cached features (better than nothing)
If everything fails: Show popular restaurants in the area
5. Monitoring: How Do We Know It's Working?Your model is in production. Now what? You can't just "set it and forget it." Things break. User behavior changes. You need to watch it constantly.
What We Actually Care AboutBusiness metric (most important):• Order conversion rate: % of users who see rankings and place an order
• If this drops, something is wrong—even if model accuracy looks fine
System health:• Latency p95 < 150ms (alert if breached)
• Error rate < 0.1%
• Fallback rate (how often we use the backup ranking)
Model health:• Are prediction scores reasonable? (avg score ~0.3 for our model)
• Distribution check: If suddenly 90% of restaurants get score >0.9, something's broken
The Position Bias Problem⚠️ Your Model Creates Its Own Reality
Here's a vicious cycle:
Model ranks "Pizza Place A" #1
Users see it first → 50% click rate
"Pizza Place B" is ranked #10 → only 5% click rate
Training data: "A has 10x more clicks than B"
Model learns: "A must be better!" → ranks A even higher
But maybe B is actually better—users just never saw it!
✓ Solution: Exploration
Randomize 10% of rankings. Show some users "Pizza Place B" at the top even if the model ranks it #10. Now we get unbiased data on whether B is actually good. This prevents the model from getting stuck ranking the same restaurants forever.
6. A/B Testing & Experimentation: How Do We Iterate?You've built your ML system. But how do you know if changes actually improve things? How do you test new models, features, or ranking strategies without breaking production? The answer: rigorous experimentation.
The A/B Testing FrameworkEvery change—new model, new feature, different ranking algorithm—goes through controlled experimentation before full rollout.
Standard A/B Test Setup
Control Group (50% of traffic):• Uses current production model (v1.0)
• Baseline conversion rate: 8.0%
Treatment Group (50% of traffic):• Uses new experimental model (v2.0)
• Hypothesis: Adding real-time restaurant capacity features improves conversion
Randomization:• Users randomly assigned based on: hash(user_id) % 100 < 50
• Sticky assignment: same user always sees same variant
Concrete Example: Testing a New FeatureScenario: Adding "Current Wait Time" Feature
Hypothesis: Showing estimated wait time improves user satisfaction and reduces cancellations
Experiment Design:• Duration: 2 weeks
• Sample size: 100,000 users per variant
• Primary metric: Order conversion rate
• Secondary metrics: Average order value, cancellation rate, time to order
• Guardrail metrics: App crash rate, latency p95 (must stay < 200ms)
Results After 2 Weeks
Control (No Wait Time):• Conversion rate: 8.0% (baseline)
• Avg order value: $22.50
• Cancellation rate: 12%
• Time to order: 180 seconds
Treatment (With Wait Time):• Conversion rate: 8.4% (+5% relative ✓)
• Avg order value: $22.30 (-0.9%, not significant)
• Cancellation rate: 9.5% (-21% relative ✓)
• Time to order: 165 seconds (-8.3%, faster decisions ✓)
Decision: SHIP IT ✓
Primary metric improved significantly, secondary metrics positive or neutral, no guardrail violations. Roll out to 100% of traffic.
Multi-Armed Bandit: Adaptive ExplorationA/B tests are static: 50/50 split for the full duration. But what if one variant is clearly winning after 3 days? Multi-armed bandits dynamically allocate more traffic to better-performing variants.
Thompson Sampling Example
Testing 3 different ranking algorithms simultaneously:
Day 1: Equal traffic (33% each)
Day 3: Algorithm B showing +2% conversion
→ Shift traffic: A=25%, B=50%, C=25%
Day 7: B confirmed best, C clearly worst
→ Final allocation: A=20%, B=70%, C=10%
Day 14: Ship B to 100%, retire A and C
Advantage: Minimizes "regret" (serving bad experiences). You don't waste 50% of traffic on a losing variant for 2 weeks.
Common Experimentation PitfallsNetwork Effects & Spillover
Problem: Restaurant capacity is shared. If treatment group orders more from "Joe's Pizza," control group sees longer wait times.
Solution: Test at restaurant-level or geographic clusters, not user-level.
Sample Ratio Mismatch (SRM)
Problem: You expect 50/50 split, but see 52/48. Something's wrong with randomization or logging.
Solution: Automated SRM checks. If detected, invalidate experiment and debug.
P-Hacking & Multiple Testing
Problem: You test 20 metrics. One shows p<0.05 by chance. You declare victory.
Solution: Pre-register primary metric. Apply Bonferroni correction for multiple tests. Require consistent directional wins.
Experimentation Infrastructure Requirements
Assignment service: Consistent, logged user→variant mapping
Feature flags: Instant rollback if something breaks
Metrics pipeline: Real-time dashboards showing conversion, latency, errors per variant
Statistical engine: Automated significance testing, confidence intervals, power analysis
7. Common Pitfalls & SolutionsCold Start: New Restaurant Opens
Problem: "Joe's Pizza" just opened. Zero historical orders. Model has no idea if it's good.
Solution: Use content-based features (cuisine, price range, owner's other restaurants). Boost new restaurants temporarily for exploration. Give them a chance.
Real-Time Capacity
Problem: Model ranks "Burger King" #1. But they're slammed—60 minute wait. User orders, gets mad, cancels.
Solution: Add real-time feature: current order volume. Predict prep time. Penalize overloaded restaurants.
Model Works in Jupyter, Fails in Prod
Problem: Your notebook model gets 90% accuracy. In production: 60% accuracy. Why? Training-serving skew (features computed differently).
Solution: Integration tests. Compute features the same way in training and serving. Use feature store.
8. Key System Design TradeoffsEvery ML system involves fundamental tradeoffs. There are no universal "right" answers—only choices that match your constraints. Here are the most critical decisions you'll face:
1. Latency vs. Accuracy
The tradeoff: More complex models (deep neural networks, large ensembles) are more accurate but slower. Simpler models (linear models, small trees) are fast but less accurate.
Example:
• LightGBM (300 trees): 50ms inference, 87% accuracy
• Deep neural network: 250ms inference, 89% accuracy (+2%)
• Linear model: 5ms inference, 82% accuracy (-5%)
Decision: For user-facing ranking, 50ms LightGBM wins. 2% accuracy isn't worth 200ms extra latency—users will leave. But for offline batch scoring (fraud detection, content moderation), use the neural network.
2. Real-Time vs. Batch Processing
The tradeoff: Real-time features are fresher but expensive to compute. Batch features are stale but cheap.
Example:
• Real-time: "Restaurant received 10 orders in last 10 minutes" → compute per request
• Batch: "Restaurant avg rating over last 30 days" → precompute daily
Decision: Use real-time for features that change rapidly and matter immediately (current wait time, inventory). Use batch for stable features (historical preferences, aggregate stats). Hybrid is common: 80% batch, 20% real-time.
3. Personalization vs. Simplicity
The tradeoff: Per-user models are more accurate but require more infrastructure, data, and cold-start handling. Global models are simpler but less tailored.
Options:
• One model per user: Max personalization, but 10M users = 10M models (infeasible)
• One global model with user features: Practical personalization via user embeddings
• Clustered models: 100 user segments, one model per segment (middle ground)
Decision: Global model with user features wins for most applications. Segments work for domains with distinct user types (B2B vs. B2C, power users vs. casual).
4. Model Complexity vs. Maintainability
The tradeoff: Complex models (ensembles, neural nets, multi-stage pipelines) squeeze out extra accuracy but are harder to debug, deploy, and explain.
Example:
• Simple: LightGBM → One model, easy to inspect feature importance
• Complex: Ensemble of 5 models + neural reranker → +2% accuracy, 5x ops burden
Decision: Start simple. Only add complexity when you've exhausted simpler approaches and the business impact justifies the cost. Engineers leave, complex systems stay—make sure you can maintain it.
5. Precomputation vs. On-Demand
The tradeoff: Precomputing predictions (batch scoring) is simple and cheap. Computing on-demand (online serving) is flexible but requires more infrastructure.
Precomputation works when:
• Output space is small (email: spam or not spam)
• Context doesn't change (recommendation for tomorrow's email)
• Latency isn't critical (nightly batch job is fine)
On-demand required when:
• Output space is huge (rank 500K restaurants per user-location combo)
• Context matters (real-time location, current inventory)
• Results must be instant (user is waiting)
Decision: For DoorDash, on-demand is necessary. For email spam detection, precomputation is fine. For YouTube recommendations, hybrid (precompute candidate pool, rerank on-demand).
6. Exploration vs. Exploitation
The tradeoff: Showing users what the model thinks is best (exploitation) maximizes short-term conversion. Showing random options (exploration) gathers data for long-term improvement but hurts immediate metrics.
Strategies:
• 100% exploitation: Best short-term, but model gets stuck in local optima
• 10% exploration: Slight short-term hit, much better long-term data quality
• Multi-armed bandit: Adaptive—explore less as confidence grows
Decision: Always explore at least 5-10%. The data quality improvement is worth the small conversion hit. Use bandits if you have the infrastructure.
Principle: Start Simple, Add Complexity Incrementally
Don't build the perfect system on day one. Instead:
V1 (Week 1): No ML—rank by distance × rating → 8% conversion (baseline)
V2 (Week 4): Simple LightGBM with 10 features → 8.4% conversion (+5%)
V3 (Week 12): Full system (two-stage, feature store, exploration) → 9% (+12%)
Each version works and ships. You learn what matters by measuring real behavior, not guessing in a design doc.
9. End-to-End Case Study: From Request to ResponseLet's walk through a complete example: Sarah opens the DoorDash app at 6:45 PM on a Friday in San Francisco. What happens behind the scenes, from data ingestion to ranking, in under 150ms?
User Context
• User: Sarah (user_id: 12847)
• Location: Mission District, SF (37.7599, -122.4148)
• Time: Friday, 6:45 PM
• Historical behavior: Orders Italian 60% of the time, avg order value $25, prefers 4.5+ star restaurants
Step-by-Step Execution1. Request Arrives (t=0ms)
POST /api/v1/restaurants/rank
{"user_id": "12847", "lat": 37.7599, "lng": -122.4148}
Load balancer routes request to ranking service instance #47 (current load: 45 req/sec)
2. Fetch User Features from Redis (t=3ms)
GET user:12847:features
→ {"fav_cuisine": "Italian", "avg_order_value": 25, "price_sensitivity": 0.7,
"orders_last_30d": 8, "avg_rating_ordered": 4.6}
Cache hit! User features were precomputed last night by batch job, stored in Redis with 24h TTL.
3. Candidate Generation via ElasticSearch (t=23ms)
Query: geo_distance filter (5km radius) + is_open=true + accepts_delivery=true
→ Returns 847 restaurants within range
Apply heuristic ranking: score = rating / (distance_km + 1)
→ Keep top 200 candidates for detailed ranking
Optimization: ElasticSearch index sharded by geography. Mission District shard handles this query locally.
4. Batch Fetch Restaurant Features (t=28ms)
MGET restaurant:8472:features restaurant:1923:features ... (200 keys)
→ For each: {"rating": 4.5, "avg_delivery_min": 28, "cuisine": "Italian",
"orders_last_hour": 12, "prep_time_estimate": 20}
Batching is key: Single Redis MGET for 200 restaurants vs. 200 individual GETs. Saves ~150ms.
5. Compute Real-Time Features (t=30ms)
For each of 200 restaurants:
• Distance: haversine(user_lat, user_lng, restaurant_lat, restaurant_lng)
• User-restaurant affinity: user.fav_cuisine == restaurant.cuisine
• Time context: is_dinner_time=1, is_weekend=0, is_raining=0
• Estimated delivery time: restaurant.prep_time + distance/avg_speed + current_wait
6. Model Inference: Rank 200 Restaurants (t=83ms)
Model: LightGBM (300 trees, depth=6) loaded in memory at service startup
For each restaurant, predict: P(user will order | shown)
Features (35 total):
• User: fav_cuisine, avg_order_value, price_sensitivity, ...
• Restaurant: rating, cuisine, avg_delivery, orders_last_hour, ...
• Interaction: distance, cuisine_match, price_diff, ...
• Context: time_of_day, day_of_week, weather, ...
Output: Ranked list with scores
1. "Tony's Pizza" (Italian, 1.2km, 4.7★) → score: 0.42
2. "Pasta Fino" (Italian, 2.1km, 4.6★) → score: 0.38
3. "Mission Tacos" (Mexican, 0.8km, 4.8★) → score: 0.35
...
Why LightGBM? Fast inference (0.4ms per restaurant), handles mixed feature types, good accuracy for tabular data.
7. Post-Processing & Business Rules (t=93ms)
Apply business logic:
• Boost promoted restaurants: "Tony's Pizza" paid for promotion → +0.05 to score
• Cold-start handling: "New Sushi Spot" opened yesterday → boost to position #8
• Diversity: Max 3 Italian restaurants in top 10 (user sees variety)
Exploration (10% of requests):
• Random check: hash(user_id + timestamp) % 100 < 10 → YES, explore!
• Randomly shuffle positions 7-10 to gather data on lower-ranked restaurants
8. Return Response (t=115ms ✓)
{
  "restaurants": [
    {"id": "8472", "name": "Tony's Pizza", "score": 0.47, "eta_min": 28, ...},
    {"id": "1923", "name": "Pasta Fino", "score": 0.38, "eta_min": 32, ...},
    ...
  ],
  "latency_ms": 115,
  "request_id": "req_friday_1845_12847"
}
What Happens Next?Logging for training:• User shown: ["Tony's Pizza" at #1, "Pasta Fino" at #2, ...]
• User clicked: "Tony's Pizza" (position 1, score 0.47)
• User ordered: YES, order_value=$28
• Label for training: (user_12847, restaurant_8472, context) → ordered=1
Event stream:• Event: "ranking_shown" → Kafka → S3 (for batch training)
• Event: "restaurant_clicked" → Kafka → Redis (update real-time features)
• Event: "order_placed" → Kafka → Metrics (business dashboards)
Model retraining (next morning):• Spark job aggregates yesterday's events from S3
• Generates 50K new training examples with labels
• Retrains model on last 90 days of data (10M examples)
• A/B tests new model vs. current production model
• If +1% conversion → deploy new model to 100%
Key Takeaways from This Flow
Data → Features → Model → Decision: Clear progression through the system
Batch + Real-time: User features computed daily, context features computed per request
Two-stage retrieval: 847 candidates → 200 → 20 shown. Each stage optimized for speed vs. accuracy
Caching everywhere: User features (Redis), restaurant features (Redis), geo index (ElasticSearch)
Business rules + ML: Model provides scores, business logic adjusts (promotions, diversity, exploration)
Closed loop: Predictions → user actions → logged events → training data → better model
Final Thoughts
ML system design is about understanding constraints and making smart trade-offs. There's no one-size-fits-all solution—what works for Netflix doesn't work for a startup, and what works for ad ranking doesn't work for fraud detection.
Start simple. Batch serving + periodic retraining gets you 80% of the way there. Add complexity only when you have evidence that it's needed. Measure everything, optimize bottlenecks, and always have a fallback plan.
← back to blog