Cube Pre-aggregations Deep Dive

Rollup, originalSql, refresh strategies and the matching algorithm.

0/1 done

Overview

How Cube actually picks a rollup

Pre-aggregations are not magic — there is a deterministic matching algorithm. For an incoming query Q, Cube looks at every pre-aggregation P on the cubes Q touches and asks:

  1. Are Q's measures a subset of P's measures? No → skip.
  2. Are Q's dimensions a subset of P's dimensions? No → skip.
  3. Is Q's time grain ≥ P's grain? (day-rollup can answer month, but not the reverse.)
  4. Are Q's filters compatible with P's filter constraints?

First match wins. If nothing matches, Cube falls back to the live SQL.

The three pre-aggregation kinds

KindWhat it storesWhen to use
rollupPre-aggregated measures by dims+grain95% of cases
original_sqlA snapshot of the base SQLWhen you can't change the warehouse but want a refresh boundary
rollup_joinRollup of multiple cubes joined firstHigh-cardinality joins that always co-occur

Refresh strategies

  • Cron-stylerefresh_key: { every: '30 minute' }. Simple, predictable.
  • SQL key — refresh when MAX(updated_at) changes. Cheap when the source table has a watermark.
  • Incremental — refresh only the latest partition. Mandatory at TB scale.

Pre-organised library shelves

Pre-aggregation matching is search routing in a library. A patron asks for 'all 19th-century French novels'; the librarian checks: do we have a pre-organised shelf for 19th-century European novels by country? Yes — pull the shelf, hand the patron the France slice. Subset of dimensions, compatible grain — served instantly. No matching shelf? Walk the stacks (scan the fact table). Pre-aggregations are pre-organised shelves your engine maintains overnight.

Reflect

The instrumentation move that makes pre-aggregation design a science instead of guesswork: log every query Cube serves with its (measures, dimensions, grain, was-rollup-hit?) tuple. After a week, the top 20 unique tuples cover 80% of traffic. Build pre-aggregations for those 20 — and walk away from the rest.

  • Do you currently log query shape and rollup-hit-rate? If not, that's the cheapest performance win available.
  • What is the smallest set of pre-aggregations that would cover 80% of your slowest dashboards?

Reading in progress · 0 of 1 activity done