This case study asks: how does MBS service use vary by sex and age? It’s the simplest genuinely useful PLIDA question that requires joining across datasets — demographics from one source, health service use from another, linked via the spine.
The MBS 2015 claims file has 372 million rows. Loading it into memory is not an option on a 32 GB machine — we tested that in the join benchmarks, and even arrow’s join engine was killed by the operating system. DuckDB handles the whole thing in a few seconds.
This is the standard recipe from the Joins chapter: a read-only connection to the materialised DuckDB database, three lazy tbl() handles, dbplyr verbs, collect() at the end. Everything else that follows in this case study is just choosing the right joins and pushing filters to the right place.
Setup
Same opening as case study 1 and the recipe from Joins: load tidyverse plus the DuckDB-facing packages, set a memory and thread budget, and either open the existing database or point DuckDB at the parquet folders. We don’t create lazy tbl() handles up front in this case study — they’re declared below next to where they’re used, so the join chain is easier to follow.
library(tidyverse)library(scales)library(tictoc)library(duckdb)library(DBI)library(dbplyr)# Centralised paths for this project — a short file that just defines# where the DuckDB and parquet folders live. See the Setup chapter.source("R/00-paths.R")data_dir <- paths$duckdb_dir # replace with your own pathcon <-dbConnect( duckdb::duckdb(),dbdir =file.path(data_dir, "plida_tables.duckdb"),read_only =TRUE)dbExecute(con, "SET memory_limit = '20GB'") # Set at about 90% of your RAM
[1] 0
Code
dbExecute(con, "SET threads TO 8") # Set as your number of cores less 1
[1] 0
Code
library(tidyverse)library(scales)library(tictoc)library(glue)library(duckdb)library(DBI)library(dbplyr)source("R/00-paths.R")pq <- paths$parquet_root # replace with your own parquet foldercon <-dbConnect(duckdb::duckdb()) # in-memory DuckDB, no filedbExecute(con, "SET memory_limit = '20GB'")dbExecute(con, "SET threads TO 8")# Each lazy handle is a one-liner that points at a parquet glob. The# joins in the main body then work identically to the database version.from_parquet <-function(glob) {tbl(con, sql(glue("SELECT * FROM read_parquet('{glob}')")))}# demo <- from_parquet(file.path(pq, "demo/*.parquet")) |> rename_with(tolower)# dhda_spine <- from_parquet(file.path(pq, "dhda-spine/*.parquet")) |> rename_with(tolower)# mbs <- from_parquet(file.path(pq, "dhda-mbs/*.parquet")) |> rename_with(tolower)
The PLIDA join model
PLIDA data comes from different government agencies, and each agency has its own person identifier (SYNTHETIC_AEUID). The spine tables link them: each spine maps spine_id (a universal key) to one agency’s identifier.
No single agency holds a universal person identifier. The spine tables are the linkage layer — they map each agency’s identifier to a common spine_id. You’ll always need at least one spine join to link data across agencies.
Three handles. Nothing loaded. rename_with(tolower) pushes every column to snake_case as it comes out of DuckDB, so the rest of the pipe can stay lowercase. Let’s check what we’re working with:
Code
demo |>head(3) |>collect()
Code
dhda_spine |>head(3) |>collect()
Code
mbs |>head(3) |>collect()
Build and run the query
For each sex and 5-year age bracket, how many MBS services were used in 2015 and how much was charged?
Three-table join on 372M MBS rows: 6.451 sec elapsed
Code
result
That joined 30 million demographic records through 30 million spine records to 372 million MBS claims, bucketed by age, and collected a tidy result — in the time shown above.
<SQL>
SELECT
core_gender,
age_group,
SUM(numserv) AS services,
SUM(feecharged) AS fee_charged,
COUNT(*) AS claims
FROM (
SELECT q01.*, CAST(FLOOR(age_2015 / 5.0) * 5.0 AS INTEGER) AS age_group
FROM (
SELECT q01.*, 2015 - year_of_birth AS age_2015
FROM (
SELECT
demo.SPINE_ID AS spine_id,
CORE_GENDER AS core_gender,
YEAR_OF_BIRTH AS year_of_birth,
dhda_spine.SYNTHETIC_AEUID AS synthetic_aeuid,
NUMSERV AS numserv,
FEECHARGED AS feecharged
FROM demo
INNER JOIN dhda_spine
ON (demo.SPINE_ID = dhda_spine.spine_id)
INNER JOIN mbs
ON (dhda_spine.SYNTHETIC_AEUID = mbs.SYNTHETIC_AEUID)
) q01
) q01
WHERE (age_2015 >= 0) AND (age_2015 < 100)
) q01
GROUP BY core_gender, age_group
Things to notice
Normalise column case up front.rename_with(tolower) at the top of each tbl() means the rest of the pipe joins on spine_id and synthetic_aeuid without any one-off renames. That is the idiom the book uses throughout (Data manipulation § The mental model).
select() before joining. We select only the columns we need from each table before the join. DuckDB pushes the projection down, so it reads fewer columns from disk and the hash table is smaller.
Filters also go before joins. The age_2015 >= 0 filter is on the already-joined result here because the column only exists after the mutate, but where a filter exists on a raw column (e.g. year_of_birth >= 1960), push it upstream of every join. Arrow’s optimiser will not do this transformation for you; see Joins § The habit that matters most.
Importantsemi_join() / anti_join() instead of filter(x %in% ...)
Not used here, but worth calling out because it is the single most useful join idiom for PLIDA work. If you ever find yourself writing filter(synthetic_aeuid %in% (other_table |> pull(synthetic_aeuid))), rewrite it as semi_join(other_table, by = "synthetic_aeuid"). The %in% form silently materialises the right‑hand column into R memory and ships a giant IN (...) literal back in SQL; the semi_join() keeps the whole operation inside DuckDB. See Joins § filtering joins.
Warningcollect() comes last — not in the middle
If you accidentally write demo |> collect() |> inner_join(dhda_spine, ...), you’ve loaded 30 million rows into R memory before joining — and the join to MBS would try to pull 372 million rows in too. Keep the pipe lazy until the result is aggregated down to something small.
Plot
Code
result |>filter(core_gender %in%c("M", "F")) |>ggplot(aes(age_group, services /1e6, fill = core_gender)) +geom_col(position ="dodge") +scale_y_continuous(labels =label_comma(suffix ="M")) +scale_fill_manual(values =c("F"="#E69F00", "M"="#56B4E9"),labels =c("F"="Female", "M"="Male")) +labs(x ="Age group", y ="MBS services (millions)",fill =NULL,title ="MBS service use by age and sex, 2015",subtitle ="fplida synthetic data, 30M person corpus") +theme_minimal(base_size =11)
Code
result |>filter(core_gender %in%c("M", "F")) |>mutate(fee_per_service = fee_charged / services) |>ggplot(aes(age_group, fee_per_service, colour = core_gender)) +geom_line(linewidth =0.8) +geom_point(size =1.5) +scale_y_continuous(labels =label_dollar()) +scale_colour_manual(values =c("F"="#E69F00", "M"="#56B4E9"),labels =c("F"="Female", "M"="Male")) +labs(x ="Age group", y ="Average fee per service",colour =NULL,title ="Average MBS fee per service by age and sex",subtitle ="fplida synthetic data") +theme_minimal(base_size =11)
When dbplyr is not enough
For the vast majority of PLIDA questions, dplyr verbs translate cleanly. When they do not — typically a DuckDB-specific function with no tidy counterpart — the lightest escape is a single sql() fragment inside a mutate() or filter():
You stay in the dplyr pipe and hand DuckDB a literal expression for that one column. Reserve it for the genuinely irreducible case; if more than one sql(...) appears in a pipeline, rewrite the tidyverse side instead. dbGetQuery() with a hand-written SQL query is almost never the right answer — it is used once in the site for ASOF JOIN and not elsewhere.
A note on table layout
We found in the storage-layout join experiments that sorting MBS by the join key makes point lookups 47× faster but doesn’t speed up broad analytical queries like this one. If you’re looking up one specific person’s claims, sort the table. For population-level analysis, it doesn’t matter.
The one thing that helps every join: using an integer hash key instead of the string AEUID. We measured a 2× speedup.
Source Code
---title: "Case study 2: Who uses health services?"subtitle: "A three-table join across the PLIDA spine"---This case study asks: **how does MBS service use vary by sex and age?** It's the simplest genuinely useful PLIDA question that requires joining across datasets — demographics from one source, health service use from another, linked via the spine.The MBS 2015 claims file has 372 million rows. Loading it into memory is not an option on a 32 GB machine — we tested that in the [join benchmarks](A3-join-benchmarks.qmd), and even arrow's join engine was killed by the operating system. DuckDB handles the whole thing in a few seconds.This is the standard recipe from the [Joins](03-joins.qmd) chapter: a read-only connection to the materialised DuckDB database, three lazy `tbl()` handles, dbplyr verbs, `collect()` at the end. Everything else that follows in this case study is just choosing the right joins and pushing filters to the right place.## SetupSame opening as [case study 1](cs01-earnings.qmd) and the recipe from [Joins](03-joins.qmd): load tidyverse plus the DuckDB-facing packages, set a memory and thread budget, and either open the existing database or point DuckDB at the parquet folders. We don't create lazy `tbl()` handles up front in this case study — they're declared below next to where they're used, so the join chain is easier to follow.::: {.panel-tabset}## From a DuckDB database (default)```{r setup}library(tidyverse)library(scales)library(tictoc)library(duckdb)library(DBI)library(dbplyr)# Centralised paths for this project — a short file that just defines# where the DuckDB and parquet folders live. See the Setup chapter.source("R/00-paths.R")data_dir <- paths$duckdb_dir # replace with your own pathcon <-dbConnect( duckdb::duckdb(),dbdir =file.path(data_dir, "plida_tables.duckdb"),read_only =TRUE)dbExecute(con, "SET memory_limit = '20GB'") # Set at about 90% of your RAMdbExecute(con, "SET threads TO 8") # Set as your number of cores less 1```## From parquet, no database file```{r setup-parquet}#| eval: falselibrary(tidyverse)library(scales)library(tictoc)library(glue)library(duckdb)library(DBI)library(dbplyr)source("R/00-paths.R")pq <- paths$parquet_root # replace with your own parquet foldercon <-dbConnect(duckdb::duckdb()) # in-memory DuckDB, no filedbExecute(con, "SET memory_limit = '20GB'")dbExecute(con, "SET threads TO 8")# Each lazy handle is a one-liner that points at a parquet glob. The# joins in the main body then work identically to the database version.from_parquet <-function(glob) {tbl(con, sql(glue("SELECT * FROM read_parquet('{glob}')")))}# demo <- from_parquet(file.path(pq, "demo/*.parquet")) |> rename_with(tolower)# dhda_spine <- from_parquet(file.path(pq, "dhda-spine/*.parquet")) |> rename_with(tolower)# mbs <- from_parquet(file.path(pq, "dhda-mbs/*.parquet")) |> rename_with(tolower)```:::## The PLIDA join modelPLIDA data comes from different government agencies, and each agency has its own person identifier (`SYNTHETIC_AEUID`). The _spine tables_ link them: each spine maps `spine_id` (a universal key) to one agency's identifier.So the join chain for "demographics → MBS" is:```demo[SPINE_ID] → dhda_spine[spine_id, SYNTHETIC_AEUID] → mbs[SYNTHETIC_AEUID]```::: {.callout-note}## Why the spine tables existNo single agency holds a universal person identifier. The spine tables are the linkage layer — they map each agency's identifier to a common `spine_id`. You'll always need at least one spine join to link data across agencies.:::## Set up the lazy tables```{r tables}demo <-tbl(con, "demo") |>rename_with(tolower)dhda_spine <-tbl(con, "dhda_spine") |>rename_with(tolower)mbs <-tbl(con, "mbs") |>rename_with(tolower)```Three handles. Nothing loaded. `rename_with(tolower)` pushes every column to `snake_case` as it comes out of DuckDB, so the rest of the pipe can stay lowercase. Let's check what we're working with:```{r peek}demo |>head(3) |>collect()dhda_spine |>head(3) |>collect()mbs |>head(3) |>collect()```## Build and run the queryFor each sex and 5-year age bracket, how many MBS services were used in 2015 and how much was charged?```{r query}tic("Three-table join on 372M MBS rows")result <- demo |>select(spine_id, core_gender, year_of_birth) |>inner_join(dhda_spine, by ="spine_id") |>inner_join( mbs |>select(synthetic_aeuid, numserv, feecharged),by ="synthetic_aeuid" ) |>mutate(age_2015 =2015L - year_of_birth,age_group =as.integer(floor(age_2015 /5) *5) ) |>filter(age_2015 >=0L, age_2015 <100L) |>group_by(core_gender, age_group) |>summarise(services =sum(numserv, na.rm =TRUE),fee_charged =sum(feecharged, na.rm =TRUE),claims =n(),.groups ="drop" ) |>collect()toc()result```That joined 30 million demographic records through 30 million spine records to 372 million MBS claims, bucketed by age, and collected a tidy result — in the time shown above.Here's the SQL dbplyr generated:```{r show-sql}demo |>select(spine_id, core_gender, year_of_birth) |>inner_join(dhda_spine, by ="spine_id") |>inner_join(mbs |>select(synthetic_aeuid, numserv, feecharged), by ="synthetic_aeuid") |>mutate(age_2015 =2015L - year_of_birth,age_group =as.integer(floor(age_2015 /5) *5)) |>filter(age_2015 >=0L, age_2015 <100L) |>group_by(core_gender, age_group) |>summarise(services =sum(numserv, na.rm =TRUE),fee_charged =sum(feecharged, na.rm =TRUE),claims =n(), .groups ="drop") |>show_query()```## Things to notice**Normalise column case up front.** `rename_with(tolower)` at the top of each `tbl()` means the rest of the pipe joins on `spine_id` and `synthetic_aeuid` without any one-off renames. That is the idiom the book uses throughout ([Data manipulation § The mental model](02-data-manipulation.qmd#the-mental-model-lazy-query-eager-collect)).**`select()` before joining.** We select only the columns we need from each table _before_ the join. DuckDB pushes the projection down, so it reads fewer columns from disk and the hash table is smaller.**Filters also go before joins.** The `age_2015 >= 0` filter is on the already-joined result here because the column only exists after the mutate, but where a filter exists on a raw column (e.g. `year_of_birth >= 1960`), push it upstream of every join. Arrow's optimiser will not do this transformation for you; see [Joins § The habit that matters most](03-joins.qmd#the-habit-that-matters-most-push-filters-before-joins).::: {.callout-important}## `semi_join()` / `anti_join()` instead of `filter(x %in% ...)`Not used here, but worth calling out because it is the single most useful join idiom for PLIDA work. If you ever find yourself writing `filter(synthetic_aeuid %in% (other_table |> pull(synthetic_aeuid)))`, rewrite it as `semi_join(other_table, by = "synthetic_aeuid")`. The `%in%` form silently materialises the right‑hand column into R memory and ships a giant `IN (...)` literal back in SQL; the `semi_join()` keeps the whole operation inside DuckDB. See [Joins § filtering joins](03-joins.qmd#semi_join-and-anti_join-filtering-joins).:::::: {.callout-warning}## `collect()` comes last — not in the middleIf you accidentally write `demo |> collect() |> inner_join(dhda_spine, ...)`, you've loaded 30 million rows into R memory _before_ joining — and the join to MBS would try to pull 372 million rows in too. Keep the pipe lazy until the result is aggregated down to something small.:::## Plot```{r plot}#| fig-width: 8#| fig-height: 5result |>filter(core_gender %in%c("M", "F")) |>ggplot(aes(age_group, services /1e6, fill = core_gender)) +geom_col(position ="dodge") +scale_y_continuous(labels =label_comma(suffix ="M")) +scale_fill_manual(values =c("F"="#E69F00", "M"="#56B4E9"),labels =c("F"="Female", "M"="Male")) +labs(x ="Age group", y ="MBS services (millions)",fill =NULL,title ="MBS service use by age and sex, 2015",subtitle ="fplida synthetic data, 30M person corpus") +theme_minimal(base_size =11)``````{r plot-fee}#| fig-width: 8#| fig-height: 5result |>filter(core_gender %in%c("M", "F")) |>mutate(fee_per_service = fee_charged / services) |>ggplot(aes(age_group, fee_per_service, colour = core_gender)) +geom_line(linewidth =0.8) +geom_point(size =1.5) +scale_y_continuous(labels =label_dollar()) +scale_colour_manual(values =c("F"="#E69F00", "M"="#56B4E9"),labels =c("F"="Female", "M"="Male")) +labs(x ="Age group", y ="Average fee per service",colour =NULL,title ="Average MBS fee per service by age and sex",subtitle ="fplida synthetic data") +theme_minimal(base_size =11)```## When dbplyr is not enoughFor the vast majority of PLIDA questions, dplyr verbs translate cleanly. When they do not — typically a DuckDB-specific function with no tidy counterpart — the lightest escape is a single `sql()` fragment inside a `mutate()` or `filter()`:```rdemo |>mutate(something =sql("SOME_DUCKDB_FUNCTION(col)"))```You stay in the dplyr pipe and hand DuckDB a literal expression for that one column. Reserve it for the genuinely irreducible case; if more than one `sql(...)` appears in a pipeline, rewrite the tidyverse side instead. `dbGetQuery()` with a hand-written SQL query is almost never the right answer — it is used once in the site for `ASOF JOIN` and not elsewhere.## A note on table layoutWe found in the [storage-layout join experiments](A3-join-benchmarks.qmd#storage-variants) that sorting MBS by the join key makes point lookups 47× faster but doesn't speed up broad analytical queries like this one. If you're looking up one specific person's claims, sort the table. For population-level analysis, it doesn't matter.The one thing that helps _every_ join: using an [integer hash key](A3-join-benchmarks.qmd#integer-hash-key-variant) instead of the string AEUID. We measured a 2× speedup.```{r disconnect}#| include: falsedbDisconnect(con, shutdown =TRUE)```