From Batch to Streaming — When and Why

Latency is the question, not the technology. Plus: prerequisite from the Kafka track.

0/1 done

Theory

Streaming is a latency choice, not a fashion choice

If the business can wait an hour, batch is cheaper, simpler and easier to debug. Streaming earns its complexity only when:

  • A downstream system needs sub-minute latency (fraud, personalisation, alerting).
  • A source is inherently streaming (clickstream, IoT, CDC) and you don't want to wait for the next batch window.
  • The data must be enriched on the fly before storage (stateful joins, deduplication).

This level layers on the Apache Kafka & Streaming track. If you haven't taken it, that's your prereq for broker internals (topics, partitions, consumer groups, offsets). Here we focus on the DE concerns: CDC, exactly-once, stream→table duality, Flink basics.

Analogy

Batch is the postal service: cheap, scheduled, reliable, and you don't ask for a parcel to arrive in 30 seconds. Streaming is the phone line: continuous, low-latency, but you pay for it always being on and the failure modes are subtler (dropped call > lost letter). Most platforms need both — fraud detection over the phone, monthly billing through the post.

Reflect

The honest test: write down the latency SLA of each downstream consumer of a candidate stream. If none is under 15 minutes, micro-batch (Spark Structured Streaming every 5 minutes) gives you 90% of the value at 30% of the operational cost of true streaming.

  • Which of your streams could be micro-batches without anyone noticing?
  • Which batch jobs would consumers genuinely pay for in lower latency — and what's the budget you'd set?

Reading in progress · 0 of 1 activity done