Theory
From a DataFrame line to bytes on the wire
Spark turns your code into a DAG, splits it at shuffle boundaries (operations that move data across the network — joins, groupBy, distinct), and the boundaries become stages. Each stage is a set of tasks — one per partition — run in parallel on executors.
Performance debugging in Spark reduces to three questions:
- How many partitions am I shuffling? Too few = no parallelism. Too many = task-overhead death.
- Is the join skewed? One key with 90% of the rows will land on one executor and stall the whole stage. Salting or AQE skew join handling fixes it.
- Am I spilling to disk? A shuffle that doesn't fit in executor memory writes to disk and your job slows 10×.
The Spark UI's Stage view answers all three.