Deployment Patterns #

Different ML deployment scenarios require different strategies. Patterns help manage risk and enable gradual rollout.

Common Deployment Scenarios #

New Product/Service #

When offering ML capabilities for the first time. Use gradual ramp-up patterns.

Automating Human Tasks #

When humans currently perform the task (e.g., inspectors). Shadow mode leverages existing human judgments.

Model Replacement #

When replacing an existing ML system with an improved version. Use canary or blue-green patterns.

Key Patterns #

Shadow Mode Deployment #

Run ML system in parallel with human operators
AI outputs are logged but not used for decisions
Verify AI accuracy against human judgments
Ideal when replacing existing human processes

graph TD
    A[Input Data] --> B[Human Inspector]
    A --> C[AI System]
    B --> D[Decision]
    C --> E[Logged Predictions]

Canary Deployment #

Start with small percentage of traffic (5%)
Monitor performance metrics closely
Gradually increase traffic if performing well
Easy rollback if issues emerge
Reference to “canary in coal mine” for early problem detection

Blue-Green Deployment #

Maintain two production environments
Blue: Current/previous system
Green: New ML system
Switch traffic instantly from blue to green
Keep blue running for immediate rollback
Full or gradual traffic shifting

graph TD
    A[Router] --> B{Load Balancer}
    B --> C[Blue Environment<br/>Old System]
    B --> D[Green Environment<br/>New ML System]

Pattern Selection #

New systems: Start with canary deployment
Replacing humans: Use shadow mode for validation
System updates: Blue-green for instant rollback capability
High-risk applications: Prefer human-in-the-loop patterns

Implementation Considerations #

Gradual Rollout Strategy #

Start with 1-5% traffic in canary deployments
Monitor both ML metrics and system performance
Implement automated rollback triggers
Use feature flags for fine-grained control

Infrastructure Requirements #

Container orchestration (Kubernetes) for blue-green
Load balancers supporting weight-based routing
Monitoring and alerting systems for automated rollbacks

Code Examples #

Shadow Mode Implementation #

import requests
import json
from datetime import datetime

class ShadowModePredictor:
    def __init__(self, production_url, shadow_url):
        self.production_url = production_url
        self.shadow_url = shadow_url
        self.log_file = "shadow_predictions.log"

    def predict(self, input_data, human_decision=None):
        # Get human decision for comparison
        if human_decision is None:
            human_decision = self.get_human_judgment(input_data)

        # Shadow prediction (non-blocking)
        try:
            shadow_response = requests.post(
                self.shadow_url,
                json=input_data,
                timeout=1.0  # Short timeout to avoid blocking
            )
            shadow_result = shadow_response.json()
        except:
            shadow_result = {"error": "shadow_timeout"}

        # Log shadow prediction with human decision
        log_entry = {
            "timestamp": datetime.now().isoformat(),
            "input": input_data,
            "human_decision": human_decision,
            "shadow_result": shadow_result
        }

        with open(self.log_file, "a") as f:
            f.write(json.dumps(log_entry) + "\n")

        return human_decision

    def get_human_judgment(self, input_data):
        # Placeholder - replace with actual human interface
        return input("Human decision: ")

Canary Deployment with Feature Flags #

import random

class CanaryDeployment:
    def __init__(self, v1_predictor, v2_predictor, canary_percentage=5):
        self.v1 = v1_predictor
        self.v2 = v2_predictor
        self.canary_percent = canary_percentage
        self.metrics = {"v1_requests": 0, "v2_requests": 0, "errors": 0}

    def predict(self, input_data):
        # Route to canary based on percentage
        if random.randint(1, 100) <= self.canary_percent:
            return self._predict_v2(input_data)
        else:
            return self._predict_v1(input_data)

    def _predict_v1(self, input_data):
        try:
            result = self.v1.predict(input_data)
            self.metrics["v1_requests"] += 1
            return result
        except Exception as e:
            self.metrics["errors"] += 1
            raise e

    def _predict_v2(self, input_data):
        try:
            result = self.v2.predict(input_data)
            self.metrics["v2_requests"] += 1
            return result
        except Exception as e:
            self.metrics["errors"] += 1
            # Fallback to v1 on errors
            return self._predict_v1(input_data)

    def get_metrics(self):
        return self.metrics.copy()

Simple Blue-Green Toggle #

class BlueGreenDeployment:
    def __init__(self, blue_predictor, green_predictor):
        self.blue = blue_predictor
        self.green = green_predictor
        self.active = "blue"  # or "green"

    def switch_to_green(self):
        # Run health checks on green
        if self._health_check(self.green):
            self.active = "green"
            return True
        return False

    def switch_to_blue(self):
        self.active = "blue"
        return True

    def predict(self, input_data):
        predictor = self.green if self.active == "green" else self.blue
        return predictor.predict(input_data)

    def _health_check(self, predictor):
        try:
            # Simple health check with test input
            result = predictor.predict({"test": "data"})
            return isinstance(result, dict) and "prediction" in result
        except:
            return False

Performance Optimization Techniques #

Model Compression & Acceleration #

Quantization: Reduce precision (FP32 → INT8) for faster inference
Pruning: Remove redundant parameters to reduce model size
Knowledge Distillation: Train smaller models to mimic larger ones
ONNX Runtime: Cross-platform acceleration with optimized kernels

Infrastructure Optimization #

GPU Utilization: Batch requests for GPU-accelerated prediction
Caching: Cache frequent predictions to reduce compute load
Load Balancing: Distribute requests across multiple model instances
Resource Scaling: Auto-scale based on request patterns

Latency Reduction Strategies #

Edge Deployment: Process data closer to source when possible
Request Batching: Group multiple predictions to utilize vectorization
Model Warmup: Pre-load models in memory for instant availability
Connection Pooling: Reuse connections in high-throughput scenarios

Troubleshooting Common Issues #

Performance Degradation Problems #

Memory Leaks: Monitor process memory usage over time
GPU Memory Fragmentation: Restart services periodically
Cold Starts: Implement model warming strategies

Accuracy Degradation Issues #

Data Drift: Compare training vs production distributions
Concept Drift: Monitor prediction confidence scores
Input Validation: Check for unexpected input formats or ranges

Infrastructure & Reliability Issues #

Network Timeouts: Increase timeouts and implement retry logic
Resource Constraints: Monitor CPU/GPU utilization and scale accordingly
Dependency Failures: Implement circuit breakers for external services

Deployment-Rollback Anti-Patterns #

Big Bang Deployment: Full switch without validation (avoid!)
No Monitoring: Deploying without proper observability
Manual Rollbacks: Relying on complex manual processes
Ignoring Validation: Skipping automated tests before deployment

Real-World Case Studies #

Netflix Recommendation System #

Challenge: Deploying personalized recommendation models serving >100M users Solution: Used canary deployments starting with 1% traffic, monitoring engagement metrics Results: 3-5% improvement in user engagement with 99.9% uptime Lessons: Automated rollback on accuracy drop >0.1%, A/B testing integration

Waymo Autonomous Driving #

Challenge: Ultra-high-stakes deployment in self-driving vehicles Solution: Extensive shadow mode testing followed by gradual rollout with human monitoring Results: Millions of safe miles driven before full automation in controlled areas Lessons: Safety metrics take precedence over accuracy, continuous human validation

Airbnb Search Ranking #

Challenge: Real-time search results for millions of listings worldwide Solution: Blue-green deployment for instant rollback capability, continuous A/B testing Results: 10-20% improvement in booking conversion rates Lessons: Feature flags for gradual feature rollout, comprehensive metrics monitoring

Practical Implementation Framework #

# Example deployment pipeline using MLflow and Kubernetes
import mlflow
import kubernetes

class MLDeploymentPipeline:
    def deploy_with_canary(self, new_model_uri):
        # 1. Deploy to canary environment
        canary_deployment = self.create_canary_deployment(new_model_uri)

        # 2. Route 5% traffic to canary
        self.update_traffic_routing(canary_weight=5)

        # 3. Monitor for 24 hours
        metrics = self.monitor_canary_performance()

        if metrics["accuracy_drop"] < 0.01 and metrics["latency_increase"] < 50:
            # 4. Gradually roll out to 100%
            self.gradual_rollout(canary_deployment)
            return {"status": "success"}
        else:
            # 5. Rollback to previous version
            self.rollback_to_previous()
            return {"status": "rollback", "reason": "performance_degraded"}

    def monitor_canary_performance(self):
        # Track accuracy, latency, error rates for 24+ hours
        return self.collect_metrics(hours=24)