Real-Time Analytics Pipeline

Real-time analytics using event stream processing flows.

Last modified on March 1, 2026

Problem Statement & Constraints

Design a scalable system to ingest and process high-volume user event streams from a web application in real-time, enabling immediate analytics and dashboards for metrics like engagement. The pipeline must handle variable loads, ensure data accuracy despite late arrivals, and support fault-tolerant operations to maintain continuous availability.

Functional Requirements

Ingest user event streams (clicks, views, conversions) from web/mobile apps.
Process events in real-time with stateful aggregations.
Provide analytics dashboards and API for metric queries.
Archive raw events for batch reprocessing.

Non-Functional Requirements

Scale: Handle 10k–100k events/sec; peak-hour bursts.
Availability: 99.9% uptime; resilient to temporary source unavailability.
Consistency: Eventual consistency for aggregations; at-least-once event processing.
Latency: P99 < 500ms for event-to-dashboard visibility.
Workload Profile:
- Read:Write ratio: ~20:80
- Peak throughput: 100k events/sec
- Retention: 30 days hot; 1y archive

High-Level Architecture

graph LR App --> Collector Collector --> Kafka Kafka --> StreamProc[Processor] StreamProc --> OLAP OLAP --> Dashboard StreamProc -.->|late| DLQ Kafka -.->|archive| Lake

Apps emit events to a stateless Collector fleet that enriches payloads and publishes to Kafka. A Stream Processor (Flink) consumes the stream, computes stateful aggregations using event-time watermarks, and sinks results into an OLAP database for real-time Dashboards. Late events exceeding allowed windows route to a DLQ, while all raw events archive continuously to a Data Lake.

Data Design

Kafka buffers high-throughput raw streams and intermediate aggregates. A columnar OLAP datastore (ClickHouse/Druid) uses heavy partitioning for millisecond analytics querying, while the Data Lake provides cheap, infinite retention for raw Parquet files.

Event Stream (Kafka Topics)

Topic	Partition Key	Throughput	Retention
`user_events`	`session_id`	100k msg/s	7 days
`aggregates`	`metric_name`	1k msg/s	24 hours
`late_events`	`event_id`	< 100 msg/s	14 days

Metrics Store (OLAP - ClickHouse/Druid)

Table	Granularity	Partitioning	Primary Key
raw_events	Event-level	`YYYY-MM-DD`	`session_id, timestamp`
minly_aggs	1 Minute	`YYYY-MM`	`metric_type, window_start`
hourly_aggs	1 Hour	`YYYY`	`metric_type, window_start`

Deep Dive & Trade-offs

Deep Dive

Stream processing: Apache Flink computes stateful aggregations, relying on exactly-once checkpointing for accuracy across varied time windows.
Watermarks & lateness: Event-time watermarks handle out-of-order data. Allowed-lateness settings trigger retroactive updates, dropping older data to a DLQ.
OLAP query layer: Columnar stores optimized for range scans support Query APIs for dashboard polling and real-time WebSocket pushes.
Event archival: Kafka Connect mirrors raw topics to a data lake in Parquet, completely decoupling real-time processing from offline ML training.
Load shedding: Kafka lag backpressure triggers dynamic shedding of non-critical events to protect high-value data freshness.
Schema registry: A centralized registry enforces Avro/Protobuf schemas, validating messages stream-side to prevent corrupted OLAP sinks.

Trade-offs

Flink vs. Kafka Streams: Flink offers robust exactly-once state management but requires a cluster; Kafka Streams is a lightweight library but lacks comprehensive multi-sink guarantees.
Event-time vs. Processing-time: Event-time is accurate but requires complex watermark management; Processing-time is simpler but inconsistent under lag or replay.
OLAP Choice: ClickHouse is simpler for high-perf single-nodes; Druid scales better for high-concurrency queries; Managed (BigQuery) reduces ops but limits tuning.

Operational Excellence

SLIs / SLOs

SLO: P99 end-to-end event processing latency (collector to OLAP) < 500 ms.
SLO: Dashboard data freshness < 2 seconds from the latest processed event.
SLIs: event_processing_latency_p99, consumer_lag_seconds, dashboard_freshness_lag, event_drop_rate, dlq_event_count.

Reliability & Resiliency

Replay: Re-run historical topics in staging to validate windowing logic.
Chaos: Kill task managers to verify exactly-once checkpoint restoration.
Scale: Load-test at 3x peak to validate collector and OLAP throughput.

Service Resilience tag
Design patterns for reliable microservice behavior under load; implementing strict request idempotency, non-blocking async I/O, robust circuit breakers, durable background queues, and observability.
Model Serving & Inference tag
Principles for highly available production ML inference; utilizing immutable model registries, dynamic request micro-batching, safe canary rollouts, graceful fallback degradation, and efficient GPU memory management.
Distributed Web Crawler tag category
A highly resilient architectural design for a Google-scale web crawler; heavily focusing on breadth-first search (BFS), extensive DNS resolution caching, and polite handling of malicious domains.
Video Transcoding & Streaming Pipeline tag category
An inherently scalable video ingestion and transcoding system architecture; asynchronously chunking heavy media, extracting actionable features, and steadily outputting adaptive bitrates via worker pools.