PySpark Idioms — Pythonic Without Killing Performance

UDF cost, `selectExpr`, `withColumn` traps, when to drop to SQL.

0/2 done

Theory

Three habits that keep PySpark fast

  1. Avoid Python UDFs when SQL functions exist. A Python UDF serialises every row to Python, runs it, serialises back. 10–100× slower than F.when / F.regexp_extract / F.expr. If you must, prefer Pandas UDFs (vectorised, Arrow-based).
  2. Don't chain 50 withColumn calls — each one creates a new logical plan node. Use select(F.col('a'), F.col('b'), ...) or selectExpr for bulk transforms.
  3. Drop into Spark SQL for non-trivial transformations. spark.sql("""WITH ... SELECT ...""") is often clearer than the equivalent DataFrame chain, and the Catalyst optimiser treats them identically.

Analogy

A built-in Spark SQL function is a factory robot arm: it works on the whole conveyor of rows at once, inside the engine, at full speed. A plain Python UDF is a human hired to inspect each item by hand — and worse, every item must be lifted off the belt, carried to the human's desk, inspected, and carried back (the serialise-to-Python round-trip). For a million rows, that walk to the desk and back is the entire cost. A Pandas UDF at least lets the human inspect a whole crate at a time instead of one screw. Reach for the robot arm first; hire the human only when no robot exists for the job.

Reflect

PySpark's biggest performance trap is its biggest ergonomic win: Python is too easy to drop into. Train your team to reach for F.expr/spark.sql first and Python UDFs only when the SQL truly can't express it.

  • Where in your codebase do Python UDFs hide that could be replaced with `F.expr` or built-ins?
  • What's the team norm: code review catches UDFs, or does CI flag them?

Reading in progress · 0 of 2 activities done