Ad Click Aggregator

Real-time aggregation systems for ad click analytics data.

Problem Statement & Constraints

Design a system to aggregate millions of ad click events in real-time to provide up-to-the-minute reporting for advertisers. The system must handle high-volume streams, filter out fraudulent or duplicate clicks, and ensure that click counts are accurate for billing purposes.

Functional Requirements

  • Aggregate clicks by ad ID and time window (e.g., 1 minute).
  • Detect and filter duplicate clicks.
  • Provide an API for real-time query results.

Non-Functional Requirements

  • Scale: 10 billion click events per day (peak 200k events/sec).
  • Latency: End-to-end data delay (event time to ingestion in report) < 1 minute.
  • Consistency: Exactly-once semantics for billing; hyper-accurate counts (probabilistic structures acceptable for pre-aggregation).
  • Availability: Robustness against regional outages or stream spikes.
  • Workload Profile:
    • Read:Write ratio: ~20:80
    • Peak throughput: 200k events/sec
    • Retention: 90 days hot, 1y archive

High-Level Architecture

graph LR SDK --> Ingest Ingest --> Kafka Kafka --> Dedup Dedup --> Agg Agg --> OLAP[(OLAP)] OLAP --> Query Kafka --> Fraud Fraud -.->|flag| Dedup

Stateless HTTP/gRPC APIs ingest events into Kafka. A two-stage deduplication (Bloom filter + Redis) filters duplicates before stream engines (Flink/Spark) perform 1-minute tumbling window aggregations. Final counts map to an OLAP store (ClickHouse/Druid), while a parallel fraud sidecar feeds corrections back to the dedup stage.

Data Design

Kafka topics buffer high-throughput temporary streams, partitioned by ad ID to guarantee ordering. A columnar OLAP database tailored for real-time aggregations provides long-term reporting and sub-second roll-ups.

Message Stream (Kafka Topics)

TopicPartition KeyDescriptionRetention
raw_clicksclick_idOriginal event stream from SDK.7 days
ad_eventsad_idDeduplicated events for aggregation.24 hours
fraud_verdictsad_idML flags for retroactive subtraction.24 hours

Reporting Schema (OLAP - ClickHouse)

TableColumnTypeDescription
clicks_aggad_idUInt32 (PK)Unique advertisement ID.
window_tsDateTime (PK)1-min window start timestamp.
click_countAggregateFunctionRolling count for the window.
revenueDecimalSum of bid price for clicked ads.

Deep Dive & Trade-offs

Deep Dive

  • Exactly-once semantics: Kafka transactional producers and stream engine checkpointing ensure atomic cycles. Idempotent OLAP upserts guarantee end-to-end consistency.

  • Backpressure & flow control: Rate limits per advertiser throttle ingestion. Consumer lag thresholds trigger autoscaling to maintain a 1-minute freshness SLO.

  • Data reconciliation: Nightly batch jobs re-read raw events to validate real-time aggregates, generating adjustment records for discrepancies to ensure billing integrity.

Trade-offs

  • Bloom filter + Redis vs. Pure Redis: Two-stage is memory-efficient but allows rare duplicates; Pure Redis is exact but expensive in memory and I/O at 200k events/sec.

  • Tumbling vs. Sliding Windows: Tumbling is simpler and cheaper; Sliding provides smoother curves at the cost of higher CPU and state overhead.

  • Real-time vs. Batch Fraud Scoring: Real-time catches fraud early but adds latency/infra; Batch is simpler but results in temporary report inflation until correction runs.

Operational Excellence

SLIs / SLOs

  • SLO: 99% of click events are reflected in query results within 1 minute of event time.
  • SLO: 99.9% of query API requests return in < 500 ms.
  • SLIs: kafka_consumer_lag, aggregation_window_latency_p95, query_latency_p99, dedup_false_positive_rate, fraud_flag_rate.

Reliability & Resiliency

  • Verification: Replay historical Kafka topics to validate aggregation and late-event handling.
  • Chaos: Kill task managers mid-checkpoint to verify exactly-once recovery.
  • Load: Test ingestion at 2x peak traffic to validate backpressure and scaling.