Spark Trial

Batch processing framework and scalable ETL data workflows.

Last modified on March 1, 2026

Context & Motivation

Context: spark-trial demonstrates end-to-end ETL for NYC Yellow Taxi trip data using Apache Spark with Scala and SBT, performing schema alignment, validation, and statistical aggregations across multiple years.

Motivation: I have used Spark SQL and PySpark in the past, but also wanted to complete my understand of the Spark ecosystem by writing a Scala-based Spark application that handles real-world data challenges: schema drift across years, large IO volumes, and the need for reproducible analytics. The project serves as a practical example of Spark’s core abstractions (DataFrames, Datasets, RDDs) and optimization techniques (partitioning, caching) while also addressing common pitfalls like schema evolution and data skew.

The Local Implementation

Current Logic: A Scala/SBT project that downloads parquet-formatted trip data from public sources, loads into Spark, normalizes schemas across years, runs validations, and computes aggregated yearly and quarterly metrics with configurable year ranges and quarterly sampling (Jan/Apr/Jul/Oct).
Bottleneck: Large IO volume and schema drift across years; local runs are limited by available memory and CPU.

Comparison to Industry Standards

My Project: Practical ETL example focused on reproducible analytics for public datasets.
Industry: Production ETL adds orchestration (Airflow/Argo), data cataloging, monitoring, and schema evolution tooling.
Gap Analysis: To reach production readiness, integrate with workflow orchestration for scheduling and error recovery, add data lineage and schema versioning systems, implement comprehensive monitoring (job metrics, data quality checks), and automate handling of schema evolution and drift.

Risks & Mitigations

Schema drift: implement schema merging strategies and robust validation with automatic alerting on unexpected column changes.
Data skew: monitor partition sizes and repartition when a small number of keys dominate; use salted keys or custom partitioners for known-skewed columns.
Resource exhaustion on large datasets: set memory guardrails per executor, use adaptive query execution (AQE) to auto-tune shuffle partitions, and profile spill-to-disk behavior.

Model Serving & Inference tag
Principles for highly available production ML inference; utilizing immutable model registries, dynamic request micro-batching, safe canary rollouts, graceful fallback degradation, and efficient GPU memory management.
Distributed Web Crawler tag
A highly resilient architectural design for a Google-scale web crawler; heavily focusing on breadth-first search (BFS), extensive DNS resolution caching, and polite handling of malicious domains.
Video Transcoding & Streaming Pipeline tag
An inherently scalable video ingestion and transcoding system architecture; asynchronously chunking heavy media, extracting actionable features, and steadily outputting adaptive bitrates via worker pools.
Intervals & Constraints tag
A framework for balancing Latency (system Completeness) against Verification (data Integrity) by effectively choosing between Speculative execution and Pessimistic consensus intervals.