System Design for ML Engineers
Essential system design patterns for ML systems. Covers scalability, monitoring, data pipelines, and model serving architectures.
System design interviews for ML engineers are increasingly common at top tech companies. Unlike traditional system design, these interviews focus on the unique challenges of building, deploying, and maintaining machine learning systems at scale. This guide covers the essential patterns and concepts you need to master.
ML System Design Interview Format
### What to Expect
Typical structure (45-60 minutes):
- 5 min: Problem statement and clarifying questions - 10 min: High-level design and requirements - 20 min: Deep dive into components - 10 min: Scaling, monitoring, and improvements - 5 min: Questions for interviewer
Common problem types:
- Design a recommendation system (Netflix, Spotify, YouTube) - Design a fraud detection system - Design a search ranking system - Design an ad click prediction system - Design a content moderation system
### How You're Evaluated
Technical depth:
- Understanding of ML system components - Awareness of tradeoffs - Appropriate algorithm selection
System thinking:
- End-to-end pipeline design - Scalability considerations - Fault tolerance and monitoring
Communication:
- Clear explanation of choices - Structured approach - Handling ambiguity well
The ML System Design Framework
### Step 1: Clarify Requirements
Functional requirements:
- What should the system do? - Who are the users? - What inputs and outputs?
Non-functional requirements:
- Latency requirements (real-time vs. batch?) - Scale (QPS, data volume, model size?) - Accuracy vs. other tradeoffs?
Constraints:
- Budget limitations - Existing infrastructure - Timeline for delivery
Example questions to ask:
For a recommendation system: - "What are we recommending? Products, content, users?" - "What's the expected QPS for recommendations?" - "How quickly should recommendations reflect new user behavior?" - "What's more important: relevance, diversity, or novelty?"
### Step 2: Define Metrics
Online metrics (production):
- Click-through rate (CTR) - Conversion rate - User engagement (time spent, return visits) - Revenue per user
Offline metrics (development):
- Precision, recall, F1 for classification - RMSE, MAE for regression - AUC-ROC for ranking - NDCG for recommendations
Business metrics:
- Revenue impact - User satisfaction (NPS) - Operational cost
The key insight:
Always tie ML metrics to business outcomes. "We optimize for CTR because 1% CTR improvement correlates to $X million annual revenue increase."
### Step 3: High-Level Architecture
Standard components:
1. **Data Collection & Storage** - Event logging - Data lake/warehouse - Feature store
2. **Feature Engineering Pipeline** - Batch feature computation - Real-time feature serving - Feature versioning
3. **Training Pipeline** - Data preprocessing - Model training - Hyperparameter tuning - Model versioning
4. **Model Serving** - Inference service - A/B testing infrastructure - Fallback mechanisms
5. **Monitoring & Feedback** - Performance monitoring - Data quality monitoring - Feedback loops
Deep Dive: Key Components
### Data Pipeline Design
Batch data pipeline:
Raw Events → ETL → Data Warehouse → Feature Extraction → Feature Store ↓ Training Data Generation → Model Training
Real-time data pipeline:
Events → Stream Processing (Kafka/Kinesis) → Real-time Features ↓ Feature Store (online)
Key considerations:
- Data freshness requirements - Backfill capabilities - Data quality checks - Schema evolution
Common tools:
- Batch: Spark, Airflow, dbt - Streaming: Kafka, Flink, Kinesis - Storage: S3, BigQuery, Snowflake - Feature store: Feast, Tecton, Vertex AI Feature Store
### Feature Engineering at Scale
Batch features:
- User historical statistics - Item popularity metrics - Time-aggregated features - Complex SQL transformations
Real-time features:
- Current session activity - Recent clicks/views - Time since last action - Real-time counters
Feature store benefits:
- Consistency between training and serving - Feature sharing across teams - Point-in-time correctness - Feature versioning and lineage
**Design pattern: Feature serving**
Request → Feature Store (online) → Concatenate features ↓ Model Service → Prediction
### Training Pipeline Design
Components:
1. Data validation 2. Feature transformation 3. Model training 4. Hyperparameter optimization 5. Model evaluation 6. Model registry
Training patterns:
Batch retraining:
- Schedule: Daily/weekly - Full dataset retraining - Simplest approach - Good for most use cases
Online learning:
- Continuous updates from new data - Lower latency to adapt - More complex to implement - Good for fast-changing data
Transfer learning/fine-tuning:
- Start with pretrained model - Fine-tune on specific data - Faster training - Good for LLMs and computer vision
Key considerations:
- Reproducibility (fixed seeds, versioned data) - Training data sampling strategies - Handling data imbalance - Validation set strategy
### Model Serving Architectures
**Pattern 1: Direct Model Serving**
Client → API Gateway → Model Service → Response
- Simplest architecture - Single model per request - Good for low-complexity predictions
**Pattern 2: Two-Stage Retrieval + Ranking**
Client → Retrieval (fast, broad) → Ranking (precise, slow) → Response
- Common for recommendations/search - Retrieval: Get top 1000 candidates - Ranking: Score and sort top 100 - Balance latency and accuracy
**Pattern 3: Ensemble Serving**
Client → Model A ↘ → Model B → Combiner → Response → Model C ↗
- Multiple models contribute - Weighted combination or voting - More robust predictions - Higher complexity and latency
Serving infrastructure choices:
- TensorFlow Serving - TorchServe - Triton Inference Server - Custom FastAPI/Flask services
Optimization techniques:
- Model quantization (FP32 → INT8) - Model distillation - Batching requests - GPU inference servers - Caching frequent predictions
### A/B Testing for ML
Why A/B testing is critical:
- Offline metrics don't always correlate with online performance - Detect unintended consequences - Measure business impact
A/B testing architecture:
User Request → Experiment Assignment → Model A or Model B ↓ Log experiment data → Analysis Pipeline
Key considerations:
- Traffic splitting (50/50, 90/10, etc.) - Statistical significance - Guardrail metrics (metrics that shouldn't degrade) - Long-term vs. short-term effects
Common pitfalls:
- Peeking at results too early - Novelty effects (users react to change itself) - Network effects (users influence each other) - Multiple testing problem
### Monitoring ML Systems
Model performance monitoring:
- Prediction distribution drift - Feature drift - Prediction latency - Error rates
Data quality monitoring:
- Missing values - Schema changes - Distribution shifts - Anomaly detection
Infrastructure monitoring:
- Service health - Resource utilization - Queue depths - Error rates
Alerting strategy:
- Performance degradation alerts - Data freshness alerts - Model staleness alerts - Fallback activation alerts
Example: Design a Recommendation System
Let's walk through designing a Netflix-style recommendation system.
### Requirements
Functional:
- Generate personalized content recommendations - Support different surfaces (home page, similar items, etc.) - Real-time personalization based on recent activity
Non-functional:
- 50ms latency for recommendation serving - 100K QPS at peak - Fresh recommendations within 1 hour of new behavior
### High-Level Design
The architecture includes three main layers:
1. Data Pipeline: User Events → Kafka → Spark ETL → Feature Store 2. Training Pipeline: Training Data → Model Training → Model Registry 3. Serving Layer: User Request → Candidate Gen → Ranking Model → Re-rank Rules → Response
### Component Deep Dive
Candidate Generation:
- Purpose: Quickly find ~1000 potentially relevant items from millions - Approach: Collaborative filtering, content-based filtering, or both - Latency budget: 10-15ms
Ranking Model:
- Purpose: Score candidates based on likelihood of engagement - Approach: Deep learning model with user and item features - Latency budget: 20-25ms
Re-ranking:
- Purpose: Apply business rules and diversification - Examples: Remove watched items, ensure genre diversity, apply content policies - Latency budget: 5-10ms
Real-time personalization:
- Recent view history updates in Kafka - Streamed to feature store - Available for next recommendation request
### Scale Considerations
Handling 100K QPS:
- Horizontal scaling of serving tier - Caching frequent recommendations - Precomputed recommendations for popular users - CDN for static content
Data scale:
- Millions of users, millions of items - Billions of interactions - Terabytes of training data
Practice Problems
To prepare for ML system design interviews, practice these common problems:
1. **Design a fraud detection system for financial transactions** - Key challenges: Real-time detection, imbalanced data, adversarial behavior
2. **Design a search ranking system** - Key challenges: Query understanding, real-time ranking, personalization
3. **Design an ad click prediction system** - Key challenges: High QPS, explore-exploit tradeoff, feedback delay
4. **Design a content moderation system** - Key challenges: Multiple modalities (text, image, video), false positive cost, scale
5. **Design a chatbot/conversational AI system** - Key challenges: Context management, response generation, safety
For each problem, practice: - Asking clarifying questions - Drawing the architecture - Explaining component choices - Discussing tradeoffs and alternatives
Conclusion
ML system design requires thinking about the full lifecycle: from data collection through serving and monitoring. The key is demonstrating that you understand:
1. The unique challenges of ML systems (data quality, model freshness, feedback loops) 2. Standard architectural patterns for each component 3. Tradeoffs between different approaches 4. How to scale and maintain ML systems in production
Practice by designing systems end-to-end and being prepared to go deep on any component. The best candidates show both breadth (understanding the full picture) and depth (detailed knowledge of specific areas).
Stay Updated
Get weekly career tips, salary insights, and job alerts delivered to your inbox.
No spam. Unsubscribe anytime.
Related Articles
Never Miss an Update
Join 50,000+ tech professionals getting weekly career insights, salary data, and exclusive job opportunities.
Ready to Find Your Next Role?
Browse thousands of opportunities from top tech companies