Top K Heavy Hitters

Design Top K Heavy Hitters #

Problem Statement #

The system identifies and returns the top K most frequent items (heavy hitters) in a continuous stream of data within a specified time interval. It must process high-volume streams in real-time, supporting applications like trending services, analytics, or DDoS detection on platforms handling billions of events daily.

Requirements #

Functional Requirements #

Identify and retrieve the top K most frequent items (e.g., top 10 viewed videos) for a given time window (e.g., last 5 minutes, 1 hour).
Support configurable time intervals and values of K.
Process streaming data continuously without assuming bounded data size.
Provide both approximate (fast) and exact (slow) results based on query needs.

Non-Functional Requirements #

High scalability to handle millions of events per second.
Availability with fault tolerance to hardware failures.
Low-latency retrieval (tens of milliseconds for approximate results, seconds for exact).
High accuracy for exact results; acceptable approximation for fast results.

Key Constraints & Assumptions #

Scale assumptions: 10 billion events/day, peak throughput of 500k events/sec; support K up to 1000 and time windows from 1 minute to 24 hours. ^[Assumption: Reasonable scale for major platforms like YouTube.]
SLA: 99.9% availability, p99 latency <100ms for fast queries.
Data size: Items IDs fit in 64-bit integers or short strings; no item metadata storage required.
Time precision: Granular timestamps for events; system clocks synchronized via NTP.
Accuracy tolerance: Fast path allows ~5% overestimation error; slow path requires exact counts. ^[Assumption: Based on practical frequencies in count-min sketch.]

High-Level Design #

The system uses a Lambda architecture with fast and slow processing paths, coupled with message queues and distributed storage. Key components include:

API Gateway: Entry point for event ingestion and queries.
Load Balancer & Processors: Distribute events to processors for aggregation.
Distributed Messaging (Kafka): Handles high-volume event streams.
Fast Path Processors: Use Count-Min Sketch for approximate real-time results.
Slow Path Processors & MapReduce: Batch processing for exact results over longer windows.
Storage Layer: Distributed file system (e.g., HDFS) and cache for quick access.
Query Service: Merges results from fast/slow paths for final delivery.

graph TD
    A[User Events] --> B[API Gateway]
    B --> C[Load Balancer]
    C --> D[Event Processors]
    D --> E[Kafka Topics]
    E --> F[Fast Path: Count-Min Sketch]
    F --> G[Cache/Filter Results]
    E --> H[Slow Path: Data Partitioner]
    H --> I[Partition Processors]
    I --> J[HDFS Storage]
    J --> K[MapReduce Jobs]
    K --> L[Exact Top K]
    M[Query Requests] --> B
    B --> N[Query Service]
    G --> N
    L --> N
    N --> O[Top K Response]

^[Mermaid diagram illustrating the flow from event ingestion to query response.]

Data Model #

Event Schema: {item_id: string/int, timestamp: epoch_ms, event_type: string} (e.g., video_id or search_term).
Count Storage: In fast path, in-memory Count-Min Sketch (2D array: width ~10k, height ~10, storing estimated counts).
Persistent Storage: Key-value store (e.g., HDFS or S3) with partitioned files: partitions by time window, each containing {item_id, exact_count} maps.
Query Index: Cached Top-K lists per time window, updated periodically.

API Design #

Core endpoints exposed via RESTful API:

POST /events - Ingest events (batch support). Request: [{item_id, timestamp, event_type}]. Response: 202 Accepted.
GET /topk?k=10&window=5m - Retrieve top K items. Query params: k (int), window (e.g., “5m”, “1h”). Response: [{"item_id": "vid123", "count": 1000}, ...] with timestamp and approximation flag.

^[APIs assume JSON payloads and HTTP status codes; authentication via API keys.]

Detailed Design #

Event Ingestion & Pre-aggregation: API Gateway buffers events in in-memory hash maps, flushing on thresholds or intervals to reduce load.
Distributed Messaging: Kafka partitions events by item_id hash for even distribution, preventing hotspots.
Fast Path (Probabilistic): Processors update Count-Min Sketch; query merges multiple sketches for approximate top-K.
Slow Path (Exact): Data Partitioner assigns events to shards; Partition Processors aggregate in batches; MapReduce emits sorted frequency lists, merging across shards.
Technology Choices: Kafka for durability & scalability; Redis for fast-path caching; Spark for MapReduce simplicity.

Scalability & Bottlenecks #

Horizontal Scaling: Add processors/shards as load grows; load balancer auto-scales.
Data Partitioning & Sharding: Consistent hashing across processors reduces hotspots.
Caching & Replication: In-memory sketches replicated for availability; distributed storage handles petabytes.
Bottlenecks: Network I/O in merging sketches/lists; memory limits in sketches (tuned via width/height); MapReduce latency for long windows. Mitigate with CDN-like edge caching for queries.

Trade-offs & Alternatives #

Fast vs. Slow Path: Fast path trades accuracy (~5-10% error) for speed (<50ms latency); slow path ensures exactness but adds 10-60s delay.
Count-Min Sketch vs. Alternatives: Chosen for memory efficiency (O(1) space) over Lossy Counting or Space Saving; alternatives provide better guarantees but may need more space.
Lambda vs. Kappa Architecture: Simplifies real-time + batch; Kappa (stream-only) could reduce complexity but risks losing historical data precision.
SQL vs. NoSQL: NoSQL (e.g., Redis/S3) preferred for unstructured frequency data; SQL could work for exact counts but limits stream processing speed.

Future Improvements #

Implement hybrid sketches (e.g., Count-Median) for better accuracy with slight space increase.
Add machine learning to predict heavy hitters based on patterns.
Introduce in-storage indexing for sub-second exact queries.
Multi-datacenter replication for global consistency in queries.

Interview Talking Points #

Emphasize Lambda architecture benefits: balances real-time approximation with batch accuracy, common in big data systems.
Discuss Count-Min Sketch trade-offs: probabilistic but space-efficient for massive streams vs. exact but memory-intensive options.
Highlight partitioning strategies: ensures scalability but avoid uneven loads causing performance degradation.
Compare fast/slow paths: approximate results suit dashboards; exact for critical business metrics.
Data flow complexity: Event ingestion → processing → query merging; focus on bottlenecks like merging large K lists.
Assumptions impact: Scale estimates drive tech choices; flexible APIs allow varied K/window queries.
Alternatives evaluation: Kafka + Spark simplifies; Redis caching reduces storage loads.
Future-proofing: ML forecasting could evolve reactive to predictive analytics.