What is a 'shuffle' in Spark?

Moving data across the network between executors at a stage boundary

A re-ordering of rows for sorting

Moving data across the network between executors at a stage boundary

Spark's Execution Model — Jobs, Stages, Tasks — Semantic Web Academy

Theory

From a DataFrame line to bytes on the wire

Spark turns your code into a DAG, splits it at shuffle boundaries (operations that move data across the network — joins, groupBy, distinct), and the boundaries become stages. Each stage is a set of tasks — one per partition — run in parallel on executors.

Performance debugging in Spark reduces to three questions:

How many partitions am I shuffling? Too few = no parallelism. Too many = task-overhead death.
Is the join skewed? One key with 90% of the rows will land on one executor and stall the whole stage. Salting or AQE skew join handling fixes it.
Am I spilling to disk? A shuffle that doesn't fit in executor memory writes to disk and your job slows 10×.

The Spark UI's Stage view answers all three.

Analogy

A Spark job is an airport baggage system. Reading and filtering bags on a single belt is fast — that's a narrow transformation, no bag leaves its lane. But when every passenger must be re-sorted by destination city, all the bags have to be hauled across the terminal to new carousels: that cross-terminal haul is the shuffle, and it's where time and money evaporate. If one city (say 'unknown') gets 90% of the bags, that single carousel jams while the others sit idle — skew. And if a carousel overflows onto the floor, that's spill to disk. Debugging Spark is just finding which carousel is on fire.

Stages and shuffles

LayoutLabelsClick a node to focus its neighbourhood · drag to pan · scroll to zoom

A Spark job, drawn

Each shuffle boundary creates a new stage. Each stage's tasks run in parallel, one per partition. Find the slow stage in the UI and you have found the bug.

Reflect

Most 'Spark is slow' tickets are actually 'one stage is slow because of skew or spill'. The Stage view of the Spark UI is the most important UI in this whole track — learn its layout cold.

›Which stage of your slowest job has the longest median task time?
›How often does your team's runbook actually say 'open the Spark UI first'?

Reading in progress · 0 of 2 activities done