Joins and Skew — Where Big Data Hurts

Broadcast vs shuffle joins, salting skewed keys, AQE.

0/3 done

Theory

Three joins, two failure modes

Broadcast (BHJ) — small side fits in memory; Spark ships it to every executor and joins locally. Fast. Default trigger is spark.sql.autoBroadcastJoinThreshold (10MB default — often worth raising).

Sort-Merge (SMJ) — both sides large; both are shuffled, sorted, then merged. The workhorse for big joins.

Shuffle Hash (SHJ) — middle ground, rarely chosen manually.

The two ways big joins blow up:

  • Skew — one join key dominates. Fix with AQE skew join (Spark 3.x auto-splits skewed partitions) or manual key salting.
  • Cartesian explosion — a join condition is missing or non-equi; row count multiplies. Always inspect explain() for CartesianProduct.

Analogy

A broadcast join is handing every cashier their own pocket copy of the small loyalty-card lookup — each one resolves customers locally, no queue. A sort-merge join is when both lists are huge: you alphabetise two giant guest lists and walk them side by side. Skew is the wedding where 300 guests are all called 'Smith' — one usher drowns while the rest finish early; salting hands out 'Smith-1, Smith-2, ...' wristbands so the crowd spreads across every usher. Pick the join that matches the size of your lists, and never let one hot key become the whole party.

Reading in progress · 0 of 3 activities done