Deployment Patterns

Deployment Patterns #

Different ML deployment scenarios require different strategies. Patterns help manage risk and enable gradual rollout.

Common Deployment Scenarios #

New Product/Service #

When offering ML capabilities for the first time. Use gradual ramp-up patterns.

Automating Human Tasks #

When humans currently perform the task (e.g., inspectors). Shadow mode leverages existing human judgments.

Model Replacement #

When replacing an existing ML system with an improved version. Use canary or blue-green patterns.

Key Patterns #

Shadow Mode Deployment #

  • Run ML system in parallel with human operators
  • AI outputs are logged but not used for decisions
  • Verify AI accuracy against human judgments
  • Ideal when replacing existing human processes
graph TD
    A[Input Data] --> B[Human Inspector]
    A --> C[AI System]
    B --> D[Decision]
    C --> E[Logged Predictions]

Canary Deployment #

  • Start with small percentage of traffic (5%)
  • Monitor performance metrics closely
  • Gradually increase traffic if performing well
  • Easy rollback if issues emerge
  • Reference to “canary in coal mine” for early problem detection

Blue-Green Deployment #

  • Maintain two production environments
  • Blue: Current/previous system
  • Green: New ML system
  • Switch traffic instantly from blue to green
  • Keep blue running for immediate rollback
  • Full or gradual traffic shifting
graph TD
    A[Router] --> B{Load Balancer}
    B --> C[Blue Environment<br/>Old System]
    B --> D[Green Environment<br/>New ML System]

Pattern Selection #

  • New systems: Start with canary deployment
  • Replacing humans: Use shadow mode for validation
  • System updates: Blue-green for instant rollback capability
  • High-risk applications: Prefer human-in-the-loop patterns

Implementation Considerations #

Gradual Rollout Strategy #

  • Start with 1-5% traffic in canary deployments
  • Monitor both ML metrics and system performance
  • Implement automated rollback triggers
  • Use feature flags for fine-grained control

Infrastructure Requirements #

  • Container orchestration (Kubernetes) for blue-green
  • Load balancers supporting weight-based routing
  • Monitoring and alerting systems for automated rollbacks

Code Examples #

Shadow Mode Implementation #

import requests
import json
from datetime import datetime

class ShadowModePredictor:
    def __init__(self, production_url, shadow_url):
        self.production_url = production_url
        self.shadow_url = shadow_url
        self.log_file = "shadow_predictions.log"

    def predict(self, input_data, human_decision=None):
        # Get human decision for comparison
        if human_decision is None:
            human_decision = self.get_human_judgment(input_data)

        # Shadow prediction (non-blocking)
        try:
            shadow_response = requests.post(
                self.shadow_url,
                json=input_data,
                timeout=1.0  # Short timeout to avoid blocking
            )
            shadow_result = shadow_response.json()
        except:
            shadow_result = {"error": "shadow_timeout"}

        # Log shadow prediction with human decision
        log_entry = {
            "timestamp": datetime.now().isoformat(),
            "input": input_data,
            "human_decision": human_decision,
            "shadow_result": shadow_result
        }

        with open(self.log_file, "a") as f:
            f.write(json.dumps(log_entry) + "\n")

        return human_decision

    def get_human_judgment(self, input_data):
        # Placeholder - replace with actual human interface
        return input("Human decision: ")

Canary Deployment with Feature Flags #

import random

class CanaryDeployment:
    def __init__(self, v1_predictor, v2_predictor, canary_percentage=5):
        self.v1 = v1_predictor
        self.v2 = v2_predictor
        self.canary_percent = canary_percentage
        self.metrics = {"v1_requests": 0, "v2_requests": 0, "errors": 0}

    def predict(self, input_data):
        # Route to canary based on percentage
        if random.randint(1, 100) <= self.canary_percent:
            return self._predict_v2(input_data)
        else:
            return self._predict_v1(input_data)

    def _predict_v1(self, input_data):
        try:
            result = self.v1.predict(input_data)
            self.metrics["v1_requests"] += 1
            return result
        except Exception as e:
            self.metrics["errors"] += 1
            raise e

    def _predict_v2(self, input_data):
        try:
            result = self.v2.predict(input_data)
            self.metrics["v2_requests"] += 1
            return result
        except Exception as e:
            self.metrics["errors"] += 1
            # Fallback to v1 on errors
            return self._predict_v1(input_data)

    def get_metrics(self):
        return self.metrics.copy()

Simple Blue-Green Toggle #

class BlueGreenDeployment:
    def __init__(self, blue_predictor, green_predictor):
        self.blue = blue_predictor
        self.green = green_predictor
        self.active = "blue"  # or "green"

    def switch_to_green(self):
        # Run health checks on green
        if self._health_check(self.green):
            self.active = "green"
            return True
        return False

    def switch_to_blue(self):
        self.active = "blue"
        return True

    def predict(self, input_data):
        predictor = self.green if self.active == "green" else self.blue
        return predictor.predict(input_data)

    def _health_check(self, predictor):
        try:
            # Simple health check with test input
            result = predictor.predict({"test": "data"})
            return isinstance(result, dict) and "prediction" in result
        except:
            return False

Performance Optimization Techniques #

Model Compression & Acceleration #

  • Quantization: Reduce precision (FP32 → INT8) for faster inference
  • Pruning: Remove redundant parameters to reduce model size
  • Knowledge Distillation: Train smaller models to mimic larger ones
  • ONNX Runtime: Cross-platform acceleration with optimized kernels

Infrastructure Optimization #

  • GPU Utilization: Batch requests for GPU-accelerated prediction
  • Caching: Cache frequent predictions to reduce compute load
  • Load Balancing: Distribute requests across multiple model instances
  • Resource Scaling: Auto-scale based on request patterns

Latency Reduction Strategies #

  • Edge Deployment: Process data closer to source when possible
  • Request Batching: Group multiple predictions to utilize vectorization
  • Model Warmup: Pre-load models in memory for instant availability
  • Connection Pooling: Reuse connections in high-throughput scenarios

Troubleshooting Common Issues #

Performance Degradation Problems #

  • Memory Leaks: Monitor process memory usage over time
  • GPU Memory Fragmentation: Restart services periodically
  • Cold Starts: Implement model warming strategies

Accuracy Degradation Issues #

  • Data Drift: Compare training vs production distributions
  • Concept Drift: Monitor prediction confidence scores
  • Input Validation: Check for unexpected input formats or ranges

Infrastructure & Reliability Issues #

  • Network Timeouts: Increase timeouts and implement retry logic
  • Resource Constraints: Monitor CPU/GPU utilization and scale accordingly
  • Dependency Failures: Implement circuit breakers for external services

Deployment-Rollback Anti-Patterns #

  • Big Bang Deployment: Full switch without validation (avoid!)
  • No Monitoring: Deploying without proper observability
  • Manual Rollbacks: Relying on complex manual processes
  • Ignoring Validation: Skipping automated tests before deployment

Real-World Case Studies #

Netflix Recommendation System #

Challenge: Deploying personalized recommendation models serving >100M users Solution: Used canary deployments starting with 1% traffic, monitoring engagement metrics Results: 3-5% improvement in user engagement with 99.9% uptime Lessons: Automated rollback on accuracy drop >0.1%, A/B testing integration

Waymo Autonomous Driving #

Challenge: Ultra-high-stakes deployment in self-driving vehicles Solution: Extensive shadow mode testing followed by gradual rollout with human monitoring Results: Millions of safe miles driven before full automation in controlled areas Lessons: Safety metrics take precedence over accuracy, continuous human validation

Airbnb Search Ranking #

Challenge: Real-time search results for millions of listings worldwide Solution: Blue-green deployment for instant rollback capability, continuous A/B testing Results: 10-20% improvement in booking conversion rates Lessons: Feature flags for gradual feature rollout, comprehensive metrics monitoring

Practical Implementation Framework #

# Example deployment pipeline using MLflow and Kubernetes
import mlflow
import kubernetes

class MLDeploymentPipeline:
    def deploy_with_canary(self, new_model_uri):
        # 1. Deploy to canary environment
        canary_deployment = self.create_canary_deployment(new_model_uri)

        # 2. Route 5% traffic to canary
        self.update_traffic_routing(canary_weight=5)

        # 3. Monitor for 24 hours
        metrics = self.monitor_canary_performance()

        if metrics["accuracy_drop"] < 0.01 and metrics["latency_increase"] < 50:
            # 4. Gradually roll out to 100%
            self.gradual_rollout(canary_deployment)
            return {"status": "success"}
        else:
            # 5. Rollback to previous version
            self.rollback_to_previous()
            return {"status": "rollback", "reason": "performance_degraded"}

    def monitor_canary_performance(self):
        # Track accuracy, latency, error rates for 24+ hours
        return self.collect_metrics(hours=24)