Video Conferencing System

Design Video Conferencing System #

Problem Statement #

Design a scalable video conferencing platform that supports real-time audio/video communication for multiple participants. The system must handle varying network conditions, manage room-based sessions, and provide high-quality streams with minimal latency while supporting screen sharing and messaging features.

Requirements #

Functional Requirements #

Create and join video conference rooms with unique room IDs
Real-time video/audio streaming for multiple participants (up to 100 per room)
Screen sharing and presentation capabilities
Text chat within conference rooms
Recording of conference sessions
User authentication and room access control

Non-Functional Requirements #

Low latency (<150ms) for audio/video streams
High availability with 99.95% uptime
Support for HD video (1080p) with adaptive bitrate
Secure end-to-end encryption for streams
Scalability to handle millions of concurrent users

Key Constraints & Assumptions #

Scale assumptions: 10M daily active users, 500k concurrent users, 50k active conference rooms; 100 participant average per room during peak hours ^[Assumption: Based on video conferencing growth patterns.]
SLA: 99.95% availability, p99 latency <150ms for streams, <500ms for joins
Network conditions: Handle poor connectivity with adaptive quality (480p to 4K), support various bandwidths (100Kbps to 10Mbps)
Participant limits: Rooms scale from 2 to 100 participants, with different quality settings for larger rooms

High-Level Design #

The system uses a hybrid peer-to-peer and server-side architecture with WebRTC for direct browser communication and SFU/MCU servers for multi-party calls. Signaling server manages room state and participant coordination.

graph TD
    A[Participant A] --> B[Signaling Server]
    A --> C[STUN/TURN Server]
    A --> D[SFU Server]
    E[Participant B] --> D
    E --> B
    F[Participant C] --> G[MCU Server]
    B --> H[Room Management Service]
    H --> I[Redis Cache]
    H --> J[PostgreSQL DB]
    D --> K[WebRTC Gateway]
    L[Load Balancer] --> D
    L --> G
    M[Media Server Cluster] --> N[NATS Message Bus]
    N --> O[Chat Service]
    N --> P[Recording Service]

^[Mermaid diagram showing hybrid P2P-SFU architecture for scalable video conferencing.]

Data Model #

Rooms: Relational storage (PostgreSQL) with room_id, participants list, settings, creation_time
Participants: Cached in Redis for real-time presence, with session state and media capabilities
Messages: Time-series database for chat history and events
Recordings: Object storage (S3) for video files with metadata in PostgreSQL

API Design #

WebSocket-based signaling with REST APIs:

POST /api/v1/rooms - Create room: {"name": "Team Meeting", "max_participants": 50} → {"roomId": "abc123", "join_url": "https://vc.com/join/abc123"}
POST /api/v1/rooms/{roomId}/join - Join room: {"userId": "user1", "stream_capabilities": {"video": true, "audio": true}} → WebSocket connection established
WebSocket events: {"type": "offer", "sdp": "..."}, {"type": "ice_candidate", "candidate": "..."} for WebRTC signaling
POST /api/v1/rooms/{roomId}/record - Start recording: {"duration_minutes": 60} → {"recordingId": "rec001", "status": "started"}
GET /api/v1/rooms/{roomId}/chat - Fetch chat messages with pagination

^[APIs use JWT authentication, WebSockets maintain persistent connections for real-time signaling.]

Detailed Design #

Signaling Server: Node.js with Socket.IO for WebRTC signaling, manages room state and participant discovery
Media Servers: JanuSFUs for selective forwarding (small groups), MCUs for larger rooms (mixing streams)
STUN/TURN Servers: Coturn for NAT traversal, handling 90% of connection issues automatically
Room Management: Service for room lifecycle, participant limits, and access control policies
WebRTC Implementation: Browser-native for direct peer connections, server-side transcoding when needed
Bandwidth Adaptation: Adaptive bitrate streaming with SVC (Scalable Video Coding) for quality adjustment
Security: DTLS-SRTP for media encryption, room passwords/tokens for access control
Caching: Redis for room state and participant presence, with pub/sub for real-time updates

Scalability & Bottlenecks #

Horizontal Scaling: Media server autoscaling based on room count and participant numbers
Load Distribution: Geographic load balancers route users to nearest media servers for reduced latency
Participant Limits: SFU for <10 participants, MCU/switching for larger rooms to reduce bandwidth
Caching Strategy: Distributed Redis clusters for room state, 99% hit rate for active room data
Bottlenecks: CPU-intensive transcoding on media servers; mitigated with GPU acceleration and workload distribution

Trade-offs & Alternatives #

SFU vs MCU: SFU preserves quality but increases client bandwidth vs. MCU reduces bandwidth but adds latency/cpu
P2P vs Server-assisted: P2P minimizes server load but struggles at scale vs. server-assisted more complex but scalable
Recording Options: Server-side recording ensures quality vs. client-side more private but inconsistent
Persistent vs Ephemeral Rooms: Ephemeral reduces storage needs vs. persistent enables meeting history/logs

Future Improvements #

Integration with calendars for scheduling
AI-powered noise cancellation and background blur
Virtual backgrounds and avatars for privacy
Breakout rooms for larger meetings
Live transcription and translation

Interview Talking Points #

Explain SFU/MCU choice: SFU for scalability with selective forwarding vs. MCU for large rooms with mixing
Discuss WebRTC complexity: Browser-native enables direct peer connections but requires signaling servers
Address latency: Geographic distribution and peer selection minimize round-trip times
Compare P2P vs Server: P2P scales poorly beyond 5-6 participants vs. server-based more reliable at scale
Handle network issues: Adaptive bitrate streaming maintains continuity in poor conditions
Security approach: End-to-end encryption with secure key exchange for private communications
Bottleneck mitigation: Horizontal scaling and workload distribution handle concurrent user spikes
Quality vs Scale: Trading off resolution/frame rate vs. participant count in large rooms