Linking tables, and using joins instead of other tools

Joins are the heart of PLIDA work. The whole reason the spine exists is to link tables. Most analyses end in a three-way join that mixes a small demographic table, a spine mapping, and a huge fact table.

This chapter covers the join types that matter for PLIDA, the two habits that matter most (push filters before joins, and use semi_join/anti_join instead of filter(x %in% ...)), and the point at which Arrow stops and you need DuckDB. The evidence lives in Appendix A3.

Setup

Same recipe as Data manipulation: load tidyverse plus the DuckDB-facing packages, set a memory and thread budget, and either open a connection to the materialised plida_tables.duckdb database (default) or skip the database and point DuckDB at parquet folders directly. This chapter joins five tables, so we create five lazy tbl() handles up front — itr_db, demo_db, ato_sp_db, dhda_sp_db, mbs_db. None of them reads anything yet; they are just references to tables that the dplyr verbs below will pipe against.

Code
library(tidyverse)
library(scales)
library(duckdb)
library(DBI)
library(dbplyr)

# Centralised paths for this project — a short file that just defines
# where the DuckDB and parquet folders live. See the Setup chapter.
source("R/00-paths.R")
data_dir <- paths$duckdb_dir # replace with your own path

con <- dbConnect(
  duckdb::duckdb(),
  dbdir     = file.path(data_dir, "plida_tables.duckdb"),
  read_only = TRUE
)
dbExecute(con, "SET memory_limit = '20GB'") # Set at about 90% of your RAM
[1] 0
Code
dbExecute(con, "SET threads TO 8")          # Set as your number of cores less 1
[1] 0
Code
itr_db     <- tbl(con, "itr")        |> rename_with(tolower)
demo_db    <- tbl(con, "demo")       |> rename_with(tolower)
ato_sp_db  <- tbl(con, "ato_spine")  |> rename_with(tolower)
dhda_sp_db <- tbl(con, "dhda_spine") |> rename_with(tolower)
mbs_db     <- tbl(con, "mbs")        |> rename_with(tolower)
Code
library(tidyverse)
library(scales)
library(glue)
library(duckdb)
library(DBI)
library(dbplyr)

source("R/00-paths.R")
pq <- paths$parquet_root # replace with your own parquet folder

con <- dbConnect(duckdb::duckdb())              # in-memory DuckDB, no file
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET threads TO 8")

# Helper: `tbl()` on a SQL fragment that asks DuckDB to read a parquet glob.
from_parquet <- function(glob) {
  tbl(con, sql(glue("SELECT * FROM read_parquet('{glob}')")))
}

itr_db     <- from_parquet(file.path(pq, "itr_context/*.parquet")) |> rename_with(tolower)
demo_db    <- from_parquet(file.path(pq, "demo/*.parquet"))        |> rename_with(tolower)
ato_sp_db  <- from_parquet(file.path(pq, "ato-spine/*.parquet"))   |> rename_with(tolower)
dhda_sp_db <- from_parquet(file.path(pq, "dhda-spine/*.parquet"))  |> rename_with(tolower)
mbs_db     <- from_parquet(file.path(pq, "dhda-mbs/*.parquet"))    |> rename_with(tolower)

The running example

A small enough slice of the ITR to see each join type clearly: who was a manager (ANZSCO major 1) in 2014 and what were they doing in 2015? Two year-slices of the ITR, ~1.5M and ~10M rows, joined on the ATO identifier.

Code
mgrs_14 <- itr_db |>
  filter(incm_yr == 2014L,
         ocptn_grp_cd >= 1000L, ocptn_grp_cd < 2000L) |>
  select(synthetic_aeuid, occ_14 = ocptn_grp_cd)

y15 <- itr_db |>
  filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |>
  select(synthetic_aeuid, occ_15 = ocptn_grp_cd)

Each of these is a lazy handle. Nothing has scanned a row yet.

The four “does it match?” joins

inner_join() — intersection

Keep rows where the key exists on both sides. The pipeline below is every PLIDA transition question in miniature: what were 2014 managers doing in 2015?

Code
mgr_stayers <- mgrs_14 |>
  inner_join(y15, by = "synthetic_aeuid") |>
  mutate(
    occ_major_15 = as.integer(floor(occ_15 / 1000)),
    still_mgr    = occ_major_15 == 1L
  ) |>
  count(occ_major_15, still_mgr) |>
  collect()
head(mgr_stayers)

⏱️ inner join (1.5M × 10M): 0.118s

Code
occ_labels <- c(
  "1" = "Managers", "2" = "Professionals", "3" = "Technicians",
  "4" = "Community/personal", "5" = "Clerical/admin",
  "6" = "Sales", "7" = "Machinery ops", "8" = "Labourers"
)
mgr_stayers |>
  filter(occ_major_15 %in% 2:8) |>
  mutate(occ_15_label = factor(occ_labels[as.character(occ_major_15)],
                               levels = occ_labels)) |>
  ggplot(aes(occ_15_label, n)) +
  geom_col(fill = "#1b9e77") +
  scale_y_continuous(labels = label_comma()) +
  labs(x = "ANZSCO major group in 2015", y = "Managers in 2014",
       title = "What managers did next (1-year)",
       subtitle = "People who reported ANZSCO 1 in 2014, across their 2015 occupation") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

left_join() — every left row, matched or not

Useful when you want non-matches too — “of 2014 managers, which ones did not file in 2015?”

Code
mgr_all <- mgrs_14 |>
  left_join(y15, by = "synthetic_aeuid") |>
  mutate(filed_2015 = !is.na(occ_15)) |>
  count(filed_2015) |>
  collect()
mgr_all

⏱️ left join (1.5M × 10M): 0.109s

semi_join() and anti_join() — filtering joins

These are the most under-used verbs in the dplyr vocabulary for PLIDA work. semi_join() keeps rows from the left table that have at least one match on the right, without pulling the right-hand columns in. anti_join() keeps rows with no match.

Code
mgr_exits <- mgrs_14 |>
  anti_join(y15, by = "synthetic_aeuid") |>
  summarise(n = n()) |>
  collect()
mgr_exits

⏱️ anti join (1.5M vs 10M): 0.109s

Code
mgr_stayers_ct <- mgrs_14 |>
  semi_join(y15, by = "synthetic_aeuid") |>
  summarise(n = n()) |>
  collect()
mgr_stayers_ct

⏱️ semi join (1.5M vs 10M): 0.123s

ImportantUse semi_join() / anti_join() instead of filter(x %in% ...)

The obvious-looking alternative to a semi-join is:

# DON'T — silently collects the right-hand column into R memory
y15_keys <- y15 |> pull(synthetic_aeuid)
mgrs_14 |> filter(synthetic_aeuid %in% y15_keys) |> summarise(n = n())

This is a landmine on large data. pull() pulls the entire right-hand key column into an R vector — for a 10M-row right side, that’s already 160 MB; for a 372M-row one, it breaks the machine. And then %in% has to send that vector back to DuckDB inside the query, which is slow both to build and to run.

semi_join() and anti_join() keep the whole operation inside DuckDB: the ID column on the right stays in the database, and DuckDB does the matching without anything coming back to R. No big R vector, no giant SQL literal.

Rule of thumb: if you find yourself writing filter(x %in% (y |> pull(key))), it is a semi-join. Write the semi-join.

This is the first and most common way joins replace other data-manipulation tools. The second is de-duplication — semi_join(y |> distinct()) is typically cheaper than a distinct() on the joined result. The third is set difference — anti_join() is exactly “in A but not in B”.

full_join() — union of both sides

Keep every row from both sides, filling NA where there is no match. The classic PLIDA diagnostic it answers: how many people appear in the ATO spine but not the DHDA spine?

Code
spine_coverage <- ato_sp_db |>
  transmute(spine_id, in_ato = TRUE) |>
  full_join(dhda_sp_db |> transmute(spine_id, in_dhda = TRUE), by = "spine_id") |>
  mutate(in_ato  = coalesce(in_ato,  FALSE),
         in_dhda = coalesce(in_dhda, FALSE)) |>
  count(in_ato, in_dhda) |>
  collect()
spine_coverage

⏱️ full join (30M × 30M): 0.552s

In the synthetic fplida corpus every person is present in every spine, so the off-diagonals are zero. On real PLIDA, the off-diagonals tell you how much attrition each spine is hiding — the reason to write this join in the first place.

The habit that matters most: filter before joining

The single change that most reliably makes a slow join fast on a large table is moving filter() calls before the join().

# DON'T
mbs_db |>
  inner_join(dhda_sp_db, by = "synthetic_aeuid") |>
  filter(svcdate >= "2015-01-01", svcdate < "2015-02-01")

# DO
mbs_db |>
  filter(svcdate >= "2015-01-01", svcdate < "2015-02-01") |>
  inner_join(dhda_sp_db, by = "synthetic_aeuid")

Why the order matters: the first (DON’T) pipeline first tries to join every row of MBS (372 million of them) to the spine, and only then throws away the rows outside the date range. Under Arrow this usually runs out of memory. DuckDB will often rearrange the steps for you behind the scenes, but writing the filter first is safer: it means only the rows you actually want ever reach the join.

Two corollaries:

  • Select columns early. If you only need synthetic_aeuid and numserv from MBS, say so before the join. Parquet only reads the columns you ask for, but only if the select() is upstream of everything else. Extra columns cost memory in the join.
  • If something is slow, check the SQL with show_query(). You will see whether your filter made it into the inner scan, or whether it got stuck outside the join.

The PLIDA three-table join

The canonical analytical question joins three tables: demographics on the universal spine_id, an agency spine table to bridge identifiers, and a large fact table keyed on the agency identifier.

Code
big <- demo_db |>
  select(spine_id, year_of_birth, core_gender) |>
  inner_join(dhda_sp_db, by = "spine_id") |>
  inner_join(mbs_db |> select(synthetic_aeuid, numserv, feecharged),
             by = "synthetic_aeuid") |>
  mutate(age_2015   = 2015L - year_of_birth,
         age_decade = as.integer(floor(age_2015 / 10) * 10)) |>
  filter(age_2015 >= 0L, age_2015 < 100L) |>
  group_by(age_decade, core_gender) |>
  summarise(services = sum(numserv,    na.rm = TRUE),
            fees     = sum(feecharged, na.rm = TRUE),
            claims   = n(), .groups = "drop") |>
  collect()
head(big)

⏱️ 3-table join on MBS (372M): 3.073s

Code
big |>
  filter(core_gender %in% c("M", "F")) |>
  mutate(gender = if_else(core_gender == "F", "Female", "Male")) |>
  ggplot(aes(age_decade, services / 1e6, fill = gender)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = label_comma(suffix = "M")) +
  scale_fill_manual(values = c("Female" = "#E69F00", "Male" = "#56B4E9")) +
  labs(x = "Age group (decades)", y = "MBS services, 2015", fill = NULL,
       title = "MBS service use by age and sex, 2015",
       subtitle = "30M × 30M × 372M, one DuckDB query") +
  theme_minimal(base_size = 11)

Three tables, the largest 372M rows, a 20 GB memory budget. On the T3 benchmark (A3) DuckDB finishes this in about 4 seconds. Arrow is killed by the operating system part-way through, because it needs to keep the whole join in memory at once. dplyr / data.table cannot attempt it at all because the MBS parquet is larger than RAM.

ImportantThe size rule for joins

Why does DuckDB manage where Arrow doesn’t? The join needs to build a lookup structure that contains one side of the join. If that structure fits in RAM, Arrow is fine. If it doesn’t, Arrow dies — it has no mechanism for spilling pieces of the lookup to disk. DuckDB does: when it runs low on memory, it writes some of the intermediate work to a scratch directory and continues. On a 32 GB machine, the MBS join (372M rows) is over the Arrow line and under the DuckDB line.

Default to DuckDB for any PLIDA join that involves a real fact table. Arrow is fine for joins between small-to-medium tables.

Storage choices that affect join speed

A couple of one-off choices about how the data is written change how fast the joins that use it are. All backed up by numbers in A3.

  • Sorting a fact table by the person-ID column helps single-person lookups a lot. On the 371M-row MBS table, sorting by SYNTHETIC_AEUID made “give me everything for person X” about 47× faster. This is the same trick as in Reading and storing: once the data is sorted, the parquet reader can skip all the chunks that don’t contain person X.
  • Sorting does not help broad analytical joins. A three-way join filtered on year_of_birth gets no benefit from sorting MBS by synthetic_aeuid, because people born in 1970 are spread uniformly across the whole range of IDs — there’s no contiguous range of IDs to skip to. Sort for single-person lookups, not for group analyses.
  • Integer join keys are about 1.9× faster than 12-character string keys. If you are joining on synthetic_aeuid all day, creating a numeric version once at ingest (e.g. hash(synthetic_aeuid)) and joining on that is worth benchmarking. Keep the original string around — hashes are not guaranteed unique — but for the hot-path joins the integer version is faster.
  • Declaring a PRIMARY KEY or UNIQUE constraint does not make joins faster. It slows down the one-time ingest (roughly 4×) and gives you an integrity check at write-time — nothing more. Use constraints if you want DuckDB to guarantee no duplicates, not for query speed.

A quick recipe card

  • inner_join() for “matches on both sides”.
  • left_join() for “every left row, matched or not”.
  • semi_join() for “left rows with a match on the right” — use this instead of filter(x %in% ...).
  • anti_join() for “left rows with no match on the right”.
  • full_join() for the symmetric case — diagnostics, spine-coverage questions.
  • Put every filter() as close to the scan as possible. Select columns early.
  • For small/medium tables, Arrow and DuckDB both work. For anything that looks like a PLIDA fact-table join, use DuckDB.
  • Sort fact tables by the join key if your workload is dominated by point lookups. Don’t otherwise.

What lives in A3

  • T3: demo × dhda_spine × mbs (372M rows) across five backends.
  • Walkthrough timings for small inner / left / semi / anti / full joins.
  • Storage variants: baseline vs sorted vs integer-hash-keyed MBS, across point / narrow / full queries.
  • PRIMARY KEY variant build cost and query timings.