Joins

Linking tables, and using joins instead of other tools

Joins are the heart of PLIDA work. The whole reason the spine exists is to link tables. Most analyses end in a three-way join that mixes a small demographic table, a spine mapping, and a huge fact table.

This chapter covers the join types that matter for PLIDA, the two habits that matter most (push filters before joins, and use semi_join/anti_join instead of filter(x %in% ...)), and the point at which Arrow stops and you need DuckDB. The evidence lives in Appendix A3.

Setup

Same recipe as Data manipulation: load tidyverse plus the DuckDB-facing packages, set a memory and thread budget, and either open a connection to the materialised plida_tables.duckdb database (default) or skip the database and point DuckDB at parquet folders directly. This chapter joins five tables, so we create five lazy tbl() handles up front — itr_db, demo_db, ato_sp_db, dhda_sp_db, mbs_db. None of them reads anything yet; they are just references to tables that the dplyr verbs below will pipe against.

From a DuckDB database (default)
From parquet, no database file

Code

library(tidyverse)
library(scales)
library(duckdb)
library(DBI)
library(dbplyr)

# Centralised paths for this project — a short file that just defines
# where the DuckDB and parquet folders live. See the Setup chapter.
source("R/00-paths.R")
data_dir <- paths$duckdb_dir # replace with your own path

con <- dbConnect(
  duckdb::duckdb(),
  dbdir     = file.path(data_dir, "plida_tables.duckdb"),
  read_only = TRUE
)
dbExecute(con, "SET memory_limit = '20GB'") # Set at about 90% of your RAM

[1] 0

Code

dbExecute(con, "SET threads TO 8")          # Set as your number of cores less 1

[1] 0

Code

itr_db     <- tbl(con, "itr")        |> rename_with(tolower)
demo_db    <- tbl(con, "demo")       |> rename_with(tolower)
ato_sp_db  <- tbl(con, "ato_spine")  |> rename_with(tolower)
dhda_sp_db <- tbl(con, "dhda_spine") |> rename_with(tolower)
mbs_db     <- tbl(con, "mbs")        |> rename_with(tolower)

Code

library(tidyverse)
library(scales)
library(glue)
library(duckdb)
library(DBI)
library(dbplyr)

source("R/00-paths.R")
pq <- paths$parquet_root # replace with your own parquet folder

con <- dbConnect(duckdb::duckdb())              # in-memory DuckDB, no file
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET threads TO 8")

# Helper: `tbl()` on a SQL fragment that asks DuckDB to read a parquet glob.
from_parquet <- function(glob) {
  tbl(con, sql(glue("SELECT * FROM read_parquet('{glob}')")))
}

itr_db     <- from_parquet(file.path(pq, "itr_context/*.parquet")) |> rename_with(tolower)
demo_db    <- from_parquet(file.path(pq, "demo/*.parquet"))        |> rename_with(tolower)
ato_sp_db  <- from_parquet(file.path(pq, "ato-spine/*.parquet"))   |> rename_with(tolower)
dhda_sp_db <- from_parquet(file.path(pq, "dhda-spine/*.parquet"))  |> rename_with(tolower)
mbs_db     <- from_parquet(file.path(pq, "dhda-mbs/*.parquet"))    |> rename_with(tolower)

The running example

A small enough slice of the ITR to see each join type clearly: who was a manager (ANZSCO major 1) in 2014 and what were they doing in 2015? Two year-slices of the ITR, ~1.5M and ~10M rows, joined on the ATO identifier.

Code

mgrs_14 <- itr_db |>
  filter(incm_yr == 2014L,
         ocptn_grp_cd >= 1000L, ocptn_grp_cd < 2000L) |>
  select(synthetic_aeuid, occ_14 = ocptn_grp_cd)

y15 <- itr_db |>
  filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |>
  select(synthetic_aeuid, occ_15 = ocptn_grp_cd)

Each of these is a lazy handle. Nothing has scanned a row yet.

The four “does it match?” joins

`inner_join()` — intersection

Keep rows where the key exists on both sides. The pipeline below is every PLIDA transition question in miniature: what were 2014 managers doing in 2015?

Code

mgr_stayers <- mgrs_14 |>
  inner_join(y15, by = "synthetic_aeuid") |>
  mutate(
    occ_major_15 = as.integer(floor(occ_15 / 1000)),
    still_mgr    = occ_major_15 == 1L
  ) |>
  count(occ_major_15, still_mgr) |>
  collect()
head(mgr_stayers)

⏱️ inner join (1.5M × 10M): 0.118s

Code

occ_labels <- c(
  "1" = "Managers", "2" = "Professionals", "3" = "Technicians",
  "4" = "Community/personal", "5" = "Clerical/admin",
  "6" = "Sales", "7" = "Machinery ops", "8" = "Labourers"
)
mgr_stayers |>
  filter(occ_major_15 %in% 2:8) |>
  mutate(occ_15_label = factor(occ_labels[as.character(occ_major_15)],
                               levels = occ_labels)) |>
  ggplot(aes(occ_15_label, n)) +
  geom_col(fill = "#1b9e77") +
  scale_y_continuous(labels = label_comma()) +
  labs(x = "ANZSCO major group in 2015", y = "Managers in 2014",
       title = "What managers did next (1-year)",
       subtitle = "People who reported ANZSCO 1 in 2014, across their 2015 occupation") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

`left_join()` — every left row, matched or not

Useful when you want non-matches too — “of 2014 managers, which ones did not file in 2015?”

Code

mgr_all <- mgrs_14 |>
  left_join(y15, by = "synthetic_aeuid") |>
  mutate(filed_2015 = !is.na(occ_15)) |>
  count(filed_2015) |>
  collect()
mgr_all

⏱️ left join (1.5M × 10M): 0.109s

`semi_join()` and `anti_join()` — filtering joins

These are the most under-used verbs in the dplyr vocabulary for PLIDA work. semi_join() keeps rows from the left table that have at least one match on the right, without pulling the right-hand columns in. anti_join() keeps rows with no match.

Code

mgr_exits <- mgrs_14 |>
  anti_join(y15, by = "synthetic_aeuid") |>
  summarise(n = n()) |>
  collect()
mgr_exits

⏱️ anti join (1.5M vs 10M): 0.109s

Code

mgr_stayers_ct <- mgrs_14 |>
  semi_join(y15, by = "synthetic_aeuid") |>
  summarise(n = n()) |>
  collect()
mgr_stayers_ct

⏱️ semi join (1.5M vs 10M): 0.123s

Use semi_join() / anti_join() instead of filter(x %in% ...)

The obvious-looking alternative to a semi-join is:

# DON'T — silently collects the right-hand column into R memory
y15_keys <- y15 |> pull(synthetic_aeuid)
mgrs_14 |> filter(synthetic_aeuid %in% y15_keys) |> summarise(n = n())

This is a landmine on large data. pull() pulls the entire right-hand key column into an R vector — for a 10M-row right side, that’s already 160 MB; for a 372M-row one, it breaks the machine. And then %in% has to send that vector back to DuckDB inside the query, which is slow both to build and to run.

semi_join() and anti_join() keep the whole operation inside DuckDB: the ID column on the right stays in the database, and DuckDB does the matching without anything coming back to R. No big R vector, no giant SQL literal.

Rule of thumb: if you find yourself writing filter(x %in% (y |> pull(key))), it is a semi-join. Write the semi-join.

This is the first and most common way joins replace other data-manipulation tools. The second is de-duplication — semi_join(y |> distinct()) is typically cheaper than a distinct() on the joined result. The third is set difference — anti_join() is exactly “in A but not in B”.

`full_join()` — union of both sides

Keep every row from both sides, filling NA where there is no match. The classic PLIDA diagnostic it answers: how many people appear in the ATO spine but not the DHDA spine?

Code

spine_coverage <- ato_sp_db |>
  transmute(spine_id, in_ato = TRUE) |>
  full_join(dhda_sp_db |> transmute(spine_id, in_dhda = TRUE), by = "spine_id") |>
  mutate(in_ato  = coalesce(in_ato,  FALSE),
         in_dhda = coalesce(in_dhda, FALSE)) |>
  count(in_ato, in_dhda) |>
  collect()
spine_coverage

⏱️ full join (30M × 30M): 0.552s

In the synthetic fplida corpus every person is present in every spine, so the off-diagonals are zero. On real PLIDA, the off-diagonals tell you how much attrition each spine is hiding — the reason to write this join in the first place.

The habit that matters most: filter before joining

The single change that most reliably makes a slow join fast on a large table is moving filter() calls before the join().

# DON'T
mbs_db |>
  inner_join(dhda_sp_db, by = "synthetic_aeuid") |>
  filter(svcdate >= "2015-01-01", svcdate < "2015-02-01")

# DO
mbs_db |>
  filter(svcdate >= "2015-01-01", svcdate < "2015-02-01") |>
  inner_join(dhda_sp_db, by = "synthetic_aeuid")

Why the order matters: the first (DON’T) pipeline first tries to join every row of MBS (372 million of them) to the spine, and only then throws away the rows outside the date range. Under Arrow this usually runs out of memory. DuckDB will often rearrange the steps for you behind the scenes, but writing the filter first is safer: it means only the rows you actually want ever reach the join.

Two corollaries:

Select columns early. If you only need synthetic_aeuid and numserv from MBS, say so before the join. Parquet only reads the columns you ask for, but only if the select() is upstream of everything else. Extra columns cost memory in the join.
If something is slow, check the SQL with show_query(). You will see whether your filter made it into the inner scan, or whether it got stuck outside the join.

The PLIDA three-table join

The canonical analytical question joins three tables: demographics on the universal spine_id, an agency spine table to bridge identifiers, and a large fact table keyed on the agency identifier.

Code

big <- demo_db |>
  select(spine_id, year_of_birth, core_gender) |>
  inner_join(dhda_sp_db, by = "spine_id") |>
  inner_join(mbs_db |> select(synthetic_aeuid, numserv, feecharged),
             by = "synthetic_aeuid") |>
  mutate(age_2015   = 2015L - year_of_birth,
         age_decade = as.integer(floor(age_2015 / 10) * 10)) |>
  filter(age_2015 >= 0L, age_2015 < 100L) |>
  group_by(age_decade, core_gender) |>
  summarise(services = sum(numserv,    na.rm = TRUE),
            fees     = sum(feecharged, na.rm = TRUE),
            claims   = n(), .groups = "drop") |>
  collect()
head(big)

⏱️ 3-table join on MBS (372M): 3.073s

Code

big |>
  filter(core_gender %in% c("M", "F")) |>
  mutate(gender = if_else(core_gender == "F", "Female", "Male")) |>
  ggplot(aes(age_decade, services / 1e6, fill = gender)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = label_comma(suffix = "M")) +
  scale_fill_manual(values = c("Female" = "#E69F00", "Male" = "#56B4E9")) +
  labs(x = "Age group (decades)", y = "MBS services, 2015", fill = NULL,
       title = "MBS service use by age and sex, 2015",
       subtitle = "30M × 30M × 372M, one DuckDB query") +
  theme_minimal(base_size = 11)

Three tables, the largest 372M rows, a 20 GB memory budget. On the T3 benchmark (A3) DuckDB finishes this in about 4 seconds. Arrow is killed by the operating system part-way through, because it needs to keep the whole join in memory at once. dplyr / data.table cannot attempt it at all because the MBS parquet is larger than RAM.

The size rule for joins

Why does DuckDB manage where Arrow doesn’t? The join needs to build a lookup structure that contains one side of the join. If that structure fits in RAM, Arrow is fine. If it doesn’t, Arrow dies — it has no mechanism for spilling pieces of the lookup to disk. DuckDB does: when it runs low on memory, it writes some of the intermediate work to a scratch directory and continues. On a 32 GB machine, the MBS join (372M rows) is over the Arrow line and under the DuckDB line.

Default to DuckDB for any PLIDA join that involves a real fact table. Arrow is fine for joins between small-to-medium tables.

Storage choices that affect join speed

A couple of one-off choices about how the data is written change how fast the joins that use it are. All backed up by numbers in A3.

Sorting a fact table by the person-ID column helps single-person lookups a lot. On the 371M-row MBS table, sorting by SYNTHETIC_AEUID made “give me everything for person X” about 47× faster. This is the same trick as in Reading and storing: once the data is sorted, the parquet reader can skip all the chunks that don’t contain person X.
Sorting does not help broad analytical joins. A three-way join filtered on year_of_birth gets no benefit from sorting MBS by synthetic_aeuid, because people born in 1970 are spread uniformly across the whole range of IDs — there’s no contiguous range of IDs to skip to. Sort for single-person lookups, not for group analyses.
Integer join keys are about 1.9× faster than 12-character string keys. If you are joining on synthetic_aeuid all day, creating a numeric version once at ingest (e.g. hash(synthetic_aeuid)) and joining on that is worth benchmarking. Keep the original string around — hashes are not guaranteed unique — but for the hot-path joins the integer version is faster.
Declaring a PRIMARY KEY or UNIQUE constraint does not make joins faster. It slows down the one-time ingest (roughly 4×) and gives you an integrity check at write-time — nothing more. Use constraints if you want DuckDB to guarantee no duplicates, not for query speed.

A quick recipe card

inner_join() for “matches on both sides”.
left_join() for “every left row, matched or not”.
semi_join() for “left rows with a match on the right” — use this instead of filter(x %in% ...).
anti_join() for “left rows with no match on the right”.
full_join() for the symmetric case — diagnostics, spine-coverage questions.
Put every filter() as close to the scan as possible. Select columns early.
For small/medium tables, Arrow and DuckDB both work. For anything that looks like a PLIDA fact-table join, use DuckDB.
Sort fact tables by the join key if your workload is dominated by point lookups. Don’t otherwise.

What lives in A3

T3: demo × dhda_spine × mbs (372M rows) across five backends.
Walkthrough timings for small inner / left / semi / anti / full joins.
Storage variants: baseline vs sorted vs integer-hash-keyed MBS, across point / narrow / full queries.
PRIMARY KEY variant build cost and query timings.

--- title: "Joins" subtitle: "Linking tables, and using joins instead of other tools" --- Joins are the heart of PLIDA work. The whole reason the spine exists is to link tables. Most analyses end in a three-way join that mixes a small demographic table, a spine mapping, and a huge fact table. This chapter covers the join types that matter for PLIDA, the two habits that matter most (**push filters before joins**, and **use `semi_join`/`anti_join` instead of `filter(x %in% ...)`**), and the point at which Arrow stops and you need DuckDB. The evidence lives in [Appendix A3](A3-join-benchmarks.qmd). ## Setup Same recipe as [Data manipulation](02-data-manipulation.qmd): load tidyverse plus the DuckDB-facing packages, set a memory and thread budget, and either open a connection to the materialised `plida_tables.duckdb` database (default) or skip the database and point DuckDB at parquet folders directly. This chapter joins five tables, so we create five lazy `tbl()` handles up front — `itr_db`, `demo_db`, `ato_sp_db`, `dhda_sp_db`, `mbs_db`. None of them reads anything yet; they are just references to tables that the dplyr verbs below will pipe against. ::: {.panel-tabset} ## From a DuckDB database (default) ```{r setup} library(tidyverse) library(scales) library(duckdb) library(DBI) library(dbplyr) # Centralised paths for this project — a short file that just defines # where the DuckDB and parquet folders live. See the Setup chapter. source("R/00-paths.R") data_dir <- paths$duckdb_dir # replace with your own path con <- dbConnect( duckdb::duckdb(), dbdir = file.path(data_dir, "plida_tables.duckdb"), read_only = TRUE ) dbExecute(con, "SET memory_limit = '20GB'") # Set at about 90% of your RAM dbExecute(con, "SET threads TO 8") # Set as your number of cores less 1 itr_db <- tbl(con, "itr") |> rename_with(tolower) demo_db <- tbl(con, "demo") |> rename_with(tolower) ato_sp_db <- tbl(con, "ato_spine") |> rename_with(tolower) dhda_sp_db <- tbl(con, "dhda_spine") |> rename_with(tolower) mbs_db <- tbl(con, "mbs") |> rename_with(tolower) ``` ## From parquet, no database file ```{r setup-parquet} #| eval: false library(tidyverse) library(scales) library(glue) library(duckdb) library(DBI) library(dbplyr) source("R/00-paths.R") pq <- paths$parquet_root # replace with your own parquet folder con <- dbConnect(duckdb::duckdb()) # in-memory DuckDB, no file dbExecute(con, "SET memory_limit = '20GB'") dbExecute(con, "SET threads TO 8") # Helper: `tbl()` on a SQL fragment that asks DuckDB to read a parquet glob. from_parquet <- function(glob) { tbl(con, sql(glue("SELECT * FROM read_parquet('{glob}')"))) } itr_db <- from_parquet(file.path(pq, "itr_context/*.parquet")) |> rename_with(tolower) demo_db <- from_parquet(file.path(pq, "demo/*.parquet")) |> rename_with(tolower) ato_sp_db <- from_parquet(file.path(pq, "ato-spine/*.parquet")) |> rename_with(tolower) dhda_sp_db <- from_parquet(file.path(pq, "dhda-spine/*.parquet")) |> rename_with(tolower) mbs_db <- from_parquet(file.path(pq, "dhda-mbs/*.parquet")) |> rename_with(tolower) ``` ::: ```{r scaffold} #| include: false # Internal rendering scaffolding for the `#| time_it: "..."` chunk option # used below. Readers can ignore this. library(tictoc) library(knitr) timings <- tibble() knitr::knit_hooks$set(time_it = function(before, options, envir) { if (before) { gc(full = TRUE, verbose = FALSE) assign(".time_it_t0", Sys.time(), envir = envir) } else { secs <- as.numeric(difftime(Sys.time(), get(".time_it_t0", envir = envir), units = "secs")) task <- options$time_it approach <- if (!is.null(options$approach)) options$approach else "duckdb" timings <<- bind_rows(timings, tibble(task = task, approach = approach, seconds = secs)) sprintf("\n::: {.callout-note appearance='minimal' collapse='false'}\n⏱️ **%s**: **%.3fs**\n:::\n", task, secs) } }) ``` ## The running example A small enough slice of the ITR to see each join type clearly: _who was a manager (ANZSCO major 1) in 2014 and what were they doing in 2015?_ Two year-slices of the ITR, ~1.5M and ~10M rows, joined on the ATO identifier. ```{r managers-setup} mgrs_14 <- itr_db |> filter(incm_yr == 2014L, ocptn_grp_cd >= 1000L, ocptn_grp_cd < 2000L) |> select(synthetic_aeuid, occ_14 = ocptn_grp_cd) y15 <- itr_db |> filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |> select(synthetic_aeuid, occ_15 = ocptn_grp_cd) ``` Each of these is a lazy handle. Nothing has scanned a row yet. ## The four "does it match?" joins ### `inner_join()` — intersection Keep rows where the key exists on both sides. The pipeline below is every PLIDA transition question in miniature: what were 2014 managers doing in 2015? ```{r inner-duckdb} #| time_it: "inner join (1.5M × 10M)" mgr_stayers <- mgrs_14 |> inner_join(y15, by = "synthetic_aeuid") |> mutate( occ_major_15 = as.integer(floor(occ_15 / 1000)), still_mgr = occ_major_15 == 1L ) |> count(occ_major_15, still_mgr) |> collect() head(mgr_stayers) ``` ```{r plot-inner} #| fig-width: 8 #| fig-height: 4.5 occ_labels <- c( "1" = "Managers", "2" = "Professionals", "3" = "Technicians", "4" = "Community/personal", "5" = "Clerical/admin", "6" = "Sales", "7" = "Machinery ops", "8" = "Labourers" ) mgr_stayers |> filter(occ_major_15 %in% 2:8) |> mutate(occ_15_label = factor(occ_labels[as.character(occ_major_15)], levels = occ_labels)) |> ggplot(aes(occ_15_label, n)) + geom_col(fill = "#1b9e77") + scale_y_continuous(labels = label_comma()) + labs(x = "ANZSCO major group in 2015", y = "Managers in 2014", title = "What managers did next (1-year)", subtitle = "People who reported ANZSCO 1 in 2014, across their 2015 occupation") + theme_minimal(base_size = 11) + theme(axis.text.x = element_text(angle = 20, hjust = 1)) ``` ### `left_join()` — every left row, matched or not Useful when you want non-matches too — "of 2014 managers, which ones did _not_ file in 2015?" ```{r left-duckdb} #| time_it: "left join (1.5M × 10M)" mgr_all <- mgrs_14 |> left_join(y15, by = "synthetic_aeuid") |> mutate(filed_2015 = !is.na(occ_15)) |> count(filed_2015) |> collect() mgr_all ``` ### `semi_join()` and `anti_join()` — filtering joins These are the most under-used verbs in the dplyr vocabulary for PLIDA work. `semi_join()` keeps rows from the left table that have at least one match on the right, without pulling the right-hand columns in. `anti_join()` keeps rows with no match. ```{r anti-duckdb} #| time_it: "anti join (1.5M vs 10M)" mgr_exits <- mgrs_14 |> anti_join(y15, by = "synthetic_aeuid") |> summarise(n = n()) |> collect() mgr_exits ``` ```{r semi-duckdb} #| time_it: "semi join (1.5M vs 10M)" mgr_stayers_ct <- mgrs_14 |> semi_join(y15, by = "synthetic_aeuid") |> summarise(n = n()) |> collect() mgr_stayers_ct ``` ::: {.callout-important} ## Use `semi_join()` / `anti_join()` instead of `filter(x %in% ...)` The obvious-looking alternative to a semi-join is: ```r # DON'T — silently collects the right-hand column into R memory y15_keys <- y15 |> pull(synthetic_aeuid) mgrs_14 |> filter(synthetic_aeuid %in% y15_keys) |> summarise(n = n()) ``` This is a landmine on large data. `pull()` pulls the entire right-hand key column into an R vector — for a 10M-row right side, that's already 160 MB; for a 372M-row one, it breaks the machine. And then `%in%` has to send that vector back to DuckDB inside the query, which is slow both to build and to run. `semi_join()` and `anti_join()` keep the whole operation inside DuckDB: the ID column on the right stays in the database, and DuckDB does the matching without anything coming back to R. No big R vector, no giant SQL literal. **Rule of thumb: if you find yourself writing `filter(x %in% (y |> pull(key)))`, it is a semi-join. Write the semi-join.** ::: This is the first and most common way joins replace other data-manipulation tools. The second is de-duplication — `semi_join(y |> distinct())` is typically cheaper than a `distinct()` on the joined result. The third is set difference — `anti_join()` is exactly "in A but not in B". ### `full_join()` — union of both sides Keep every row from both sides, filling `NA` where there is no match. The classic PLIDA diagnostic it answers: _how many people appear in the ATO spine but not the DHDA spine?_ ```{r full-duckdb} #| time_it: "full join (30M × 30M)" spine_coverage <- ato_sp_db |> transmute(spine_id, in_ato = TRUE) |> full_join(dhda_sp_db |> transmute(spine_id, in_dhda = TRUE), by = "spine_id") |> mutate(in_ato = coalesce(in_ato, FALSE), in_dhda = coalesce(in_dhda, FALSE)) |> count(in_ato, in_dhda) |> collect() spine_coverage ``` In the synthetic fplida corpus every person is present in every spine, so the off-diagonals are zero. On real PLIDA, the off-diagonals tell you how much attrition each spine is hiding — the reason to write this join in the first place. ## The habit that matters most: filter before joining The single change that most reliably makes a slow join fast on a large table is moving `filter()` calls **before** the `join()`. ```r # DON'T mbs_db |> inner_join(dhda_sp_db, by = "synthetic_aeuid") |> filter(svcdate >= "2015-01-01", svcdate < "2015-02-01") # DO mbs_db |> filter(svcdate >= "2015-01-01", svcdate < "2015-02-01") |> inner_join(dhda_sp_db, by = "synthetic_aeuid") ``` Why the order matters: the first (DON'T) pipeline first tries to join every row of MBS (372 million of them) to the spine, and only then throws away the rows outside the date range. Under Arrow this usually runs out of memory. DuckDB will often rearrange the steps for you behind the scenes, but writing the filter first is safer: it means only the rows you actually want ever reach the join. Two corollaries: - **Select columns early.** If you only need `synthetic_aeuid` and `numserv` from MBS, say so before the join. Parquet only reads the columns you ask for, but only if the `select()` is upstream of everything else. Extra columns cost memory in the join. - **If something is slow, check the SQL with `show_query()`.** You will see whether your filter made it into the inner scan, or whether it got stuck outside the join. ## The PLIDA three-table join The canonical analytical question joins three tables: demographics on the universal `spine_id`, an agency spine table to bridge identifiers, and a large fact table keyed on the agency identifier. ```{r big-duckdb} #| time_it: "3-table join on MBS (372M)" big <- demo_db |> select(spine_id, year_of_birth, core_gender) |> inner_join(dhda_sp_db, by = "spine_id") |> inner_join(mbs_db |> select(synthetic_aeuid, numserv, feecharged), by = "synthetic_aeuid") |> mutate(age_2015 = 2015L - year_of_birth, age_decade = as.integer(floor(age_2015 / 10) * 10)) |> filter(age_2015 >= 0L, age_2015 < 100L) |> group_by(age_decade, core_gender) |> summarise(services = sum(numserv, na.rm = TRUE), fees = sum(feecharged, na.rm = TRUE), claims = n(), .groups = "drop") |> collect() head(big) ``` ```{r plot-big} #| fig-width: 8 #| fig-height: 4.5 big |> filter(core_gender %in% c("M", "F")) |> mutate(gender = if_else(core_gender == "F", "Female", "Male")) |> ggplot(aes(age_decade, services / 1e6, fill = gender)) + geom_col(position = "dodge") + scale_y_continuous(labels = label_comma(suffix = "M")) + scale_fill_manual(values = c("Female" = "#E69F00", "Male" = "#56B4E9")) + labs(x = "Age group (decades)", y = "MBS services, 2015", fill = NULL, title = "MBS service use by age and sex, 2015", subtitle = "30M × 30M × 372M, one DuckDB query") + theme_minimal(base_size = 11) ``` Three tables, the largest 372M rows, a 20 GB memory budget. On the T3 benchmark ([A3](A3-join-benchmarks.qmd#t3-three-table-join)) DuckDB finishes this in about **4 seconds**. Arrow is killed by the operating system part-way through, because it needs to keep the whole join in memory at once. `dplyr` / `data.table` cannot attempt it at all because the MBS parquet is larger than RAM. ::: {.callout-important} ## The size rule for joins Why does DuckDB manage where Arrow doesn't? The join needs to build a lookup structure that contains one side of the join. If that structure fits in RAM, Arrow is fine. If it doesn't, Arrow dies — it has no mechanism for spilling pieces of the lookup to disk. DuckDB does: when it runs low on memory, it writes some of the intermediate work to a scratch directory and continues. On a 32 GB machine, the MBS join (372M rows) is over the Arrow line and under the DuckDB line. **Default to DuckDB for any PLIDA join that involves a real fact table. Arrow is fine for joins between small-to-medium tables.** ::: ## Storage choices that affect join speed A couple of one-off choices about how the data is written change how fast the joins that use it are. All backed up by numbers in [A3](A3-join-benchmarks.qmd#storage-variants). - **Sorting a fact table by the person-ID column helps single-person lookups a lot.** On the 371M-row MBS table, sorting by `SYNTHETIC_AEUID` made "give me everything for person X" about **47× faster**. This is the same trick as in [Reading and storing](01-reading-storing.qmd#the-one-sorting-trick-that-matters): once the data is sorted, the parquet reader can skip all the chunks that don't contain person X. - **Sorting does not help broad analytical joins.** A three-way join filtered on `year_of_birth` gets no benefit from sorting MBS by `synthetic_aeuid`, because people born in 1970 are spread uniformly across the whole range of IDs — there's no contiguous range of IDs to skip to. Sort for single-person lookups, not for group analyses. - **Integer join keys are about 1.9× faster than 12-character string keys.** If you are joining on `synthetic_aeuid` all day, creating a numeric version once at ingest (e.g. `hash(synthetic_aeuid)`) and joining on that is worth benchmarking. Keep the original string around — hashes are not guaranteed unique — but for the hot-path joins the integer version is faster. - **Declaring a `PRIMARY KEY` or `UNIQUE` constraint does not make joins faster.** It slows down the one-time ingest (roughly 4×) and gives you an integrity check at write-time — nothing more. Use constraints if you want DuckDB to guarantee no duplicates, not for query speed. ## A quick recipe card - `inner_join()` for "matches on both sides". - `left_join()` for "every left row, matched or not". - `semi_join()` for "left rows with a match on the right" — use this instead of `filter(x %in% ...)`. - `anti_join()` for "left rows with no match on the right". - `full_join()` for the symmetric case — diagnostics, spine-coverage questions. - Put every `filter()` as close to the scan as possible. Select columns early. - For small/medium tables, Arrow and DuckDB both work. For anything that looks like a PLIDA fact-table join, use DuckDB. - Sort fact tables by the join key if your workload is dominated by point lookups. Don't otherwise. ## What lives in [A3](A3-join-benchmarks.qmd) - T3: demo × dhda_spine × mbs (372M rows) across five backends. - Walkthrough timings for small inner / left / semi / anti / full joins. - Storage variants: baseline vs sorted vs integer-hash-keyed MBS, across point / narrow / full queries. - `PRIMARY KEY` variant build cost and query timings. ```{r disconnect} #| include: false dbDisconnect(con, shutdown = TRUE) ```

Setup

The running example

The four “does it match?” joins

inner_join() — intersection

left_join() — every left row, matched or not

semi_join() and anti_join() — filtering joins

full_join() — union of both sides

The habit that matters most: filter before joining

The PLIDA three-table join

Storage choices that affect join speed

A quick recipe card

What lives in A3

`inner_join()` — intersection

`left_join()` — every left row, matched or not

`semi_join()` and `anti_join()` — filtering joins

`full_join()` — union of both sides