System Design for ML Engineers

System design interviews for ML engineers are increasingly common at top tech companies. Unlike traditional system design, these interviews focus on the unique challenges of building, deploying, and maintaining machine learning systems at scale. This guide covers the essential patterns and concepts you need to master.

ML System Design Interview Format

### What to Expect

Typical structure (45-60 minutes):

- 5 min: Problem statement and clarifying questions - 10 min: High-level design and requirements - 20 min: Deep dive into components - 10 min: Scaling, monitoring, and improvements - 5 min: Questions for interviewer

Common problem types:

- Design a recommendation system (Netflix, Spotify, YouTube) - Design a fraud detection system - Design a search ranking system - Design an ad click prediction system - Design a content moderation system

### How You're Evaluated

Technical depth:

- Understanding of ML system components - Awareness of tradeoffs - Appropriate algorithm selection

System thinking:

- End-to-end pipeline design - Scalability considerations - Fault tolerance and monitoring

Communication:

- Clear explanation of choices - Structured approach - Handling ambiguity well

The ML System Design Framework

### Step 1: Clarify Requirements

Functional requirements:

- What should the system do? - Who are the users? - What inputs and outputs?

Non-functional requirements:

- Latency requirements (real-time vs. batch?) - Scale (QPS, data volume, model size?) - Accuracy vs. other tradeoffs?

Constraints:

- Budget limitations - Existing infrastructure - Timeline for delivery

Example questions to ask:

For a recommendation system: - "What are we recommending? Products, content, users?" - "What's the expected QPS for recommendations?" - "How quickly should recommendations reflect new user behavior?" - "What's more important: relevance, diversity, or novelty?"

### Step 2: Define Metrics

Online metrics (production):

- Click-through rate (CTR) - Conversion rate - User engagement (time spent, return visits) - Revenue per user

Offline metrics (development):

- Precision, recall, F1 for classification - RMSE, MAE for regression - AUC-ROC for ranking - NDCG for recommendations

Business metrics:

- Revenue impact - User satisfaction (NPS) - Operational cost

The key insight:

Always tie ML metrics to business outcomes. "We optimize for CTR because 1% CTR improvement correlates to $X million annual revenue increase."

### Step 3: High-Level Architecture

Standard components:

1. **Data Collection & Storage** - Event logging - Data lake/warehouse - Feature store

2. **Feature Engineering Pipeline** - Batch feature computation - Real-time feature serving - Feature versioning

3. **Training Pipeline** - Data preprocessing - Model training - Hyperparameter tuning - Model versioning

4. **Model Serving** - Inference service - A/B testing infrastructure - Fallback mechanisms

5. **Monitoring & Feedback** - Performance monitoring - Data quality monitoring - Feedback loops

Deep Dive: Key Components

### Data Pipeline Design

Batch data pipeline:

Raw Events → ETL → Data Warehouse → Feature Extraction → Feature Store ↓ Training Data Generation → Model Training

Real-time data pipeline:

Events → Stream Processing (Kafka/Kinesis) → Real-time Features ↓ Feature Store (online)

Key considerations:

- Data freshness requirements - Backfill capabilities - Data quality checks - Schema evolution

Common tools:

- Batch: Spark, Airflow, dbt - Streaming: Kafka, Flink, Kinesis - Storage: S3, BigQuery, Snowflake - Feature store: Feast, Tecton, Vertex AI Feature Store

### Feature Engineering at Scale

Batch features:

- User historical statistics - Item popularity metrics - Time-aggregated features - Complex SQL transformations

Real-time features:

- Current session activity - Recent clicks/views - Time since last action - Real-time counters

Feature store benefits:

- Consistency between training and serving - Feature sharing across teams - Point-in-time correctness - Feature versioning and lineage

**Design pattern: Feature serving**

Request → Feature Store (online) → Concatenate features ↓ Model Service → Prediction

### Training Pipeline Design

Components:

1. Data validation 2. Feature transformation 3. Model training 4. Hyperparameter optimization 5. Model evaluation 6. Model registry

Training patterns:

Batch retraining:

- Schedule: Daily/weekly - Full dataset retraining - Simplest approach - Good for most use cases

Online learning:

- Continuous updates from new data - Lower latency to adapt - More complex to implement - Good for fast-changing data

Transfer learning/fine-tuning:

- Start with pretrained model - Fine-tune on specific data - Faster training - Good for LLMs and computer vision

Key considerations:

- Reproducibility (fixed seeds, versioned data) - Training data sampling strategies - Handling data imbalance - Validation set strategy

### Model Serving Architectures

**Pattern 1: Direct Model Serving**

Client → API Gateway → Model Service → Response

- Simplest architecture - Single model per request - Good for low-complexity predictions

**Pattern 2: Two-Stage Retrieval + Ranking**

Client → Retrieval (fast, broad) → Ranking (precise, slow) → Response

- Common for recommendations/search - Retrieval: Get top 1000 candidates - Ranking: Score and sort top 100 - Balance latency and accuracy

**Pattern 3: Ensemble Serving**

Client → Model A ↘ → Model B → Combiner → Response → Model C ↗

- Multiple models contribute - Weighted combination or voting - More robust predictions - Higher complexity and latency

Serving infrastructure choices:

- TensorFlow Serving - TorchServe - Triton Inference Server - Custom FastAPI/Flask services

Optimization techniques:

- Model quantization (FP32 → INT8) - Model distillation - Batching requests - GPU inference servers - Caching frequent predictions

### A/B Testing for ML

Why A/B testing is critical:

- Offline metrics don't always correlate with online performance - Detect unintended consequences - Measure business impact

A/B testing architecture:

User Request → Experiment Assignment → Model A or Model B ↓ Log experiment data → Analysis Pipeline

Key considerations:

- Traffic splitting (50/50, 90/10, etc.) - Statistical significance - Guardrail metrics (metrics that shouldn't degrade) - Long-term vs. short-term effects

Common pitfalls:

- Peeking at results too early - Novelty effects (users react to change itself) - Network effects (users influence each other) - Multiple testing problem

### Monitoring ML Systems

Model performance monitoring:

- Prediction distribution drift - Feature drift - Prediction latency - Error rates

Data quality monitoring:

- Missing values - Schema changes - Distribution shifts - Anomaly detection

Infrastructure monitoring:

- Service health - Resource utilization - Queue depths - Error rates

Alerting strategy:

- Performance degradation alerts - Data freshness alerts - Model staleness alerts - Fallback activation alerts

Example: Design a Recommendation System

Let's walk through designing a Netflix-style recommendation system.

### Requirements

Functional:

- Generate personalized content recommendations - Support different surfaces (home page, similar items, etc.) - Real-time personalization based on recent activity

Non-functional:

- 50ms latency for recommendation serving - 100K QPS at peak - Fresh recommendations within 1 hour of new behavior

### High-Level Design

The architecture includes three main layers:

1. Data Pipeline: User Events → Kafka → Spark ETL → Feature Store 2. Training Pipeline: Training Data → Model Training → Model Registry 3. Serving Layer: User Request → Candidate Gen → Ranking Model → Re-rank Rules → Response

### Component Deep Dive

Candidate Generation:

- Purpose: Quickly find ~1000 potentially relevant items from millions - Approach: Collaborative filtering, content-based filtering, or both - Latency budget: 10-15ms

Ranking Model:

- Purpose: Score candidates based on likelihood of engagement - Approach: Deep learning model with user and item features - Latency budget: 20-25ms

Re-ranking:

- Purpose: Apply business rules and diversification - Examples: Remove watched items, ensure genre diversity, apply content policies - Latency budget: 5-10ms

Real-time personalization:

- Recent view history updates in Kafka - Streamed to feature store - Available for next recommendation request

### Scale Considerations

Handling 100K QPS:

- Horizontal scaling of serving tier - Caching frequent recommendations - Precomputed recommendations for popular users - CDN for static content

Data scale:

- Millions of users, millions of items - Billions of interactions - Terabytes of training data

Practice Problems

To prepare for ML system design interviews, practice these common problems:

1. **Design a fraud detection system for financial transactions** - Key challenges: Real-time detection, imbalanced data, adversarial behavior

2. **Design a search ranking system** - Key challenges: Query understanding, real-time ranking, personalization

3. **Design an ad click prediction system** - Key challenges: High QPS, explore-exploit tradeoff, feedback delay

4. **Design a content moderation system** - Key challenges: Multiple modalities (text, image, video), false positive cost, scale

5. **Design a chatbot/conversational AI system** - Key challenges: Context management, response generation, safety

For each problem, practice: - Asking clarifying questions - Drawing the architecture - Explaining component choices - Discussing tradeoffs and alternatives

Conclusion

ML system design requires thinking about the full lifecycle: from data collection through serving and monitoring. The key is demonstrating that you understand:

1. The unique challenges of ML systems (data quality, model freshness, feedback loops) 2. Standard architectural patterns for each component 3. Tradeoffs between different approaches 4. How to scale and maintain ML systems in production

Practice by designing systems end-to-end and being prepared to go deep on any component. The best candidates show both breadth (understanding the full picture) and depth (detailed knowledge of specific areas).

System Design for ML Engineers

ML System Design Interview Format

The ML System Design Framework

Deep Dive: Key Components

Example: Design a Recommendation System

Practice Problems

Conclusion

Stay Updated

Related Articles

Top 20 Machine Learning Interview Questions

Never Miss an Update

Ready to Find Your Next Role?