Which join strategy ships the small side to every executor?

Joins and Skew — Where Big Data Hurts — Semantic Web Academy

Three joins, two failure modes

Broadcast (BHJ) — small side fits in memory; Spark ships it to every executor and joins locally. Fast. Default trigger is spark.sql.autoBroadcastJoinThreshold (10MB default — often worth raising).

Sort-Merge (SMJ) — both sides large; both are shuffled, sorted, then merged. The workhorse for big joins.

Shuffle Hash (SHJ) — middle ground, rarely chosen manually.

The two ways big joins blow up:

Skew — one join key dominates. Fix with AQE skew join (Spark 3.x auto-splits skewed partitions) or manual key salting.
Cartesian explosion — a join condition is missing or non-equi; row count multiplies. Always inspect explain() for CartesianProduct.

Joins and Skew — Where Big Data Hurts

Theory

Three joins, two failure modes

Analogy