Case study 2: Who uses health services?

A three-table join across the PLIDA spine

This case study asks: how does MBS service use vary by sex and age? It’s the simplest genuinely useful PLIDA question that requires joining across datasets — demographics from one source, health service use from another, linked via the spine.

The MBS 2015 claims file has 372 million rows. Loading it into memory is not an option on a 32 GB machine — we tested that in the join benchmarks, and even arrow’s join engine was killed by the operating system. DuckDB handles the whole thing in a few seconds.

This is the standard recipe from the Joins chapter: a read-only connection to the materialised DuckDB database, three lazy tbl() handles, dbplyr verbs, collect() at the end. Everything else that follows in this case study is just choosing the right joins and pushing filters to the right place.

Setup

Same opening as case study 1 and the recipe from Joins: load tidyverse plus the DuckDB-facing packages, set a memory and thread budget, and either open the existing database or point DuckDB at the parquet folders. We don’t create lazy tbl() handles up front in this case study — they’re declared below next to where they’re used, so the join chain is easier to follow.

From a DuckDB database (default)
From parquet, no database file

Code

library(tidyverse)
library(scales)
library(tictoc)
library(duckdb)
library(DBI)
library(dbplyr)

# Centralised paths for this project — a short file that just defines
# where the DuckDB and parquet folders live. See the Setup chapter.
source("R/00-paths.R")
data_dir <- paths$duckdb_dir # replace with your own path

con <- dbConnect(
  duckdb::duckdb(),
  dbdir     = file.path(data_dir, "plida_tables.duckdb"),
  read_only = TRUE
)
dbExecute(con, "SET memory_limit = '20GB'") # Set at about 90% of your RAM

[1] 0

Code

dbExecute(con, "SET threads TO 8")          # Set as your number of cores less 1

[1] 0

Code

library(tidyverse)
library(scales)
library(tictoc)
library(glue)
library(duckdb)
library(DBI)
library(dbplyr)

source("R/00-paths.R")
pq <- paths$parquet_root # replace with your own parquet folder

con <- dbConnect(duckdb::duckdb())              # in-memory DuckDB, no file
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET threads TO 8")

# Each lazy handle is a one-liner that points at a parquet glob. The
# joins in the main body then work identically to the database version.
from_parquet <- function(glob) {
  tbl(con, sql(glue("SELECT * FROM read_parquet('{glob}')")))
}
# demo       <- from_parquet(file.path(pq, "demo/*.parquet"))       |> rename_with(tolower)
# dhda_spine <- from_parquet(file.path(pq, "dhda-spine/*.parquet")) |> rename_with(tolower)
# mbs        <- from_parquet(file.path(pq, "dhda-mbs/*.parquet"))   |> rename_with(tolower)

The PLIDA join model

PLIDA data comes from different government agencies, and each agency has its own person identifier (SYNTHETIC_AEUID). The spine tables link them: each spine maps spine_id (a universal key) to one agency’s identifier.

So the join chain for “demographics → MBS” is:

demo[SPINE_ID]  →  dhda_spine[spine_id, SYNTHETIC_AEUID]  →  mbs[SYNTHETIC_AEUID]

Why the spine tables exist

No single agency holds a universal person identifier. The spine tables are the linkage layer — they map each agency’s identifier to a common spine_id. You’ll always need at least one spine join to link data across agencies.

Set up the lazy tables

Code

demo       <- tbl(con, "demo")       |> rename_with(tolower)
dhda_spine <- tbl(con, "dhda_spine") |> rename_with(tolower)
mbs        <- tbl(con, "mbs")        |> rename_with(tolower)

Three handles. Nothing loaded. rename_with(tolower) pushes every column to snake_case as it comes out of DuckDB, so the rest of the pipe can stay lowercase. Let’s check what we’re working with:

Code

demo |> head(3) |> collect()

Code

dhda_spine |> head(3) |> collect()

Code

mbs |> head(3) |> collect()

Build and run the query

For each sex and 5-year age bracket, how many MBS services were used in 2015 and how much was charged?

Code

tic("Three-table join on 372M MBS rows")
result <- demo |>
  select(spine_id, core_gender, year_of_birth) |>
  inner_join(dhda_spine, by = "spine_id") |>
  inner_join(
    mbs |> select(synthetic_aeuid, numserv, feecharged),
    by = "synthetic_aeuid"
  ) |>
  mutate(
    age_2015  = 2015L - year_of_birth,
    age_group = as.integer(floor(age_2015 / 5) * 5)
  ) |>
  filter(age_2015 >= 0L, age_2015 < 100L) |>
  group_by(core_gender, age_group) |>
  summarise(
    services    = sum(numserv, na.rm = TRUE),
    fee_charged = sum(feecharged, na.rm = TRUE),
    claims      = n(),
    .groups = "drop"
  ) |>
  collect()
toc()

Three-table join on 372M MBS rows: 6.451 sec elapsed

Code

result

That joined 30 million demographic records through 30 million spine records to 372 million MBS claims, bucketed by age, and collected a tidy result — in the time shown above.

Here’s the SQL dbplyr generated:

Code

demo |>
  select(spine_id, core_gender, year_of_birth) |>
  inner_join(dhda_spine, by = "spine_id") |>
  inner_join(mbs |> select(synthetic_aeuid, numserv, feecharged), by = "synthetic_aeuid") |>
  mutate(age_2015 = 2015L - year_of_birth,
         age_group = as.integer(floor(age_2015 / 5) * 5)) |>
  filter(age_2015 >= 0L, age_2015 < 100L) |>
  group_by(core_gender, age_group) |>
  summarise(services = sum(numserv, na.rm = TRUE),
            fee_charged = sum(feecharged, na.rm = TRUE),
            claims = n(), .groups = "drop") |>
  show_query()

<SQL>
SELECT
  core_gender,
  age_group,
  SUM(numserv) AS services,
  SUM(feecharged) AS fee_charged,
  COUNT(*) AS claims
FROM (
  SELECT q01.*, CAST(FLOOR(age_2015 / 5.0) * 5.0 AS INTEGER) AS age_group
  FROM (
    SELECT q01.*, 2015 - year_of_birth AS age_2015
    FROM (
      SELECT
        demo.SPINE_ID AS spine_id,
        CORE_GENDER AS core_gender,
        YEAR_OF_BIRTH AS year_of_birth,
        dhda_spine.SYNTHETIC_AEUID AS synthetic_aeuid,
        NUMSERV AS numserv,
        FEECHARGED AS feecharged
      FROM demo
      INNER JOIN dhda_spine
        ON (demo.SPINE_ID = dhda_spine.spine_id)
      INNER JOIN mbs
        ON (dhda_spine.SYNTHETIC_AEUID = mbs.SYNTHETIC_AEUID)
    ) q01
  ) q01
  WHERE (age_2015 >= 0) AND (age_2015 < 100)
) q01
GROUP BY core_gender, age_group

Things to notice

Normalise column case up front. rename_with(tolower) at the top of each tbl() means the rest of the pipe joins on spine_id and synthetic_aeuid without any one-off renames. That is the idiom the book uses throughout (Data manipulation § The mental model).

select() before joining. We select only the columns we need from each table before the join. DuckDB pushes the projection down, so it reads fewer columns from disk and the hash table is smaller.

Filters also go before joins. The age_2015 >= 0 filter is on the already-joined result here because the column only exists after the mutate, but where a filter exists on a raw column (e.g. year_of_birth >= 1960), push it upstream of every join. Arrow’s optimiser will not do this transformation for you; see Joins § The habit that matters most.

semi_join() / anti_join() instead of filter(x %in% ...)

Not used here, but worth calling out because it is the single most useful join idiom for PLIDA work. If you ever find yourself writing filter(synthetic_aeuid %in% (other_table |> pull(synthetic_aeuid))), rewrite it as semi_join(other_table, by = "synthetic_aeuid"). The %in% form silently materialises the right‑hand column into R memory and ships a giant IN (...) literal back in SQL; the semi_join() keeps the whole operation inside DuckDB. See Joins § filtering joins.

collect() comes last — not in the middle

If you accidentally write demo |> collect() |> inner_join(dhda_spine, ...), you’ve loaded 30 million rows into R memory before joining — and the join to MBS would try to pull 372 million rows in too. Keep the pipe lazy until the result is aggregated down to something small.

Plot

Code

result |>
  filter(core_gender %in% c("M", "F")) |>
  ggplot(aes(age_group, services / 1e6, fill = core_gender)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = label_comma(suffix = "M")) +
  scale_fill_manual(values = c("F" = "#E69F00", "M" = "#56B4E9"),
                    labels = c("F" = "Female", "M" = "Male")) +
  labs(x = "Age group", y = "MBS services (millions)",
       fill = NULL,
       title = "MBS service use by age and sex, 2015",
       subtitle = "fplida synthetic data, 30M person corpus") +
  theme_minimal(base_size = 11)

Code

result |>
  filter(core_gender %in% c("M", "F")) |>
  mutate(fee_per_service = fee_charged / services) |>
  ggplot(aes(age_group, fee_per_service, colour = core_gender)) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 1.5) +
  scale_y_continuous(labels = label_dollar()) +
  scale_colour_manual(values = c("F" = "#E69F00", "M" = "#56B4E9"),
                      labels = c("F" = "Female", "M" = "Male")) +
  labs(x = "Age group", y = "Average fee per service",
       colour = NULL,
       title = "Average MBS fee per service by age and sex",
       subtitle = "fplida synthetic data") +
  theme_minimal(base_size = 11)

When dbplyr is not enough

For the vast majority of PLIDA questions, dplyr verbs translate cleanly. When they do not — typically a DuckDB-specific function with no tidy counterpart — the lightest escape is a single sql() fragment inside a mutate() or filter():

demo |>
  mutate(something = sql("SOME_DUCKDB_FUNCTION(col)"))

You stay in the dplyr pipe and hand DuckDB a literal expression for that one column. Reserve it for the genuinely irreducible case; if more than one sql(...) appears in a pipeline, rewrite the tidyverse side instead. dbGetQuery() with a hand-written SQL query is almost never the right answer — it is used once in the site for ASOF JOIN and not elsewhere.

A note on table layout

We found in the storage-layout join experiments that sorting MBS by the join key makes point lookups 47× faster but doesn’t speed up broad analytical queries like this one. If you’re looking up one specific person’s claims, sort the table. For population-level analysis, it doesn’t matter.

The one thing that helps every join: using an integer hash key instead of the string AEUID. We measured a 2× speedup.

--- title: "Case study 2: Who uses health services?" subtitle: "A three-table join across the PLIDA spine" --- This case study asks: **how does MBS service use vary by sex and age?** It's the simplest genuinely useful PLIDA question that requires joining across datasets — demographics from one source, health service use from another, linked via the spine. The MBS 2015 claims file has 372 million rows. Loading it into memory is not an option on a 32 GB machine — we tested that in the [join benchmarks](A3-join-benchmarks.qmd), and even arrow's join engine was killed by the operating system. DuckDB handles the whole thing in a few seconds. This is the standard recipe from the [Joins](03-joins.qmd) chapter: a read-only connection to the materialised DuckDB database, three lazy `tbl()` handles, dbplyr verbs, `collect()` at the end. Everything else that follows in this case study is just choosing the right joins and pushing filters to the right place. ## Setup Same opening as [case study 1](cs01-earnings.qmd) and the recipe from [Joins](03-joins.qmd): load tidyverse plus the DuckDB-facing packages, set a memory and thread budget, and either open the existing database or point DuckDB at the parquet folders. We don't create lazy `tbl()` handles up front in this case study — they're declared below next to where they're used, so the join chain is easier to follow. ::: {.panel-tabset} ## From a DuckDB database (default) ```{r setup} library(tidyverse) library(scales) library(tictoc) library(duckdb) library(DBI) library(dbplyr) # Centralised paths for this project — a short file that just defines # where the DuckDB and parquet folders live. See the Setup chapter. source("R/00-paths.R") data_dir <- paths$duckdb_dir # replace with your own path con <- dbConnect( duckdb::duckdb(), dbdir = file.path(data_dir, "plida_tables.duckdb"), read_only = TRUE ) dbExecute(con, "SET memory_limit = '20GB'") # Set at about 90% of your RAM dbExecute(con, "SET threads TO 8") # Set as your number of cores less 1 ``` ## From parquet, no database file ```{r setup-parquet} #| eval: false library(tidyverse) library(scales) library(tictoc) library(glue) library(duckdb) library(DBI) library(dbplyr) source("R/00-paths.R") pq <- paths$parquet_root # replace with your own parquet folder con <- dbConnect(duckdb::duckdb()) # in-memory DuckDB, no file dbExecute(con, "SET memory_limit = '20GB'") dbExecute(con, "SET threads TO 8") # Each lazy handle is a one-liner that points at a parquet glob. The # joins in the main body then work identically to the database version. from_parquet <- function(glob) { tbl(con, sql(glue("SELECT * FROM read_parquet('{glob}')"))) } # demo <- from_parquet(file.path(pq, "demo/*.parquet")) |> rename_with(tolower) # dhda_spine <- from_parquet(file.path(pq, "dhda-spine/*.parquet")) |> rename_with(tolower) # mbs <- from_parquet(file.path(pq, "dhda-mbs/*.parquet")) |> rename_with(tolower) ``` ::: ## The PLIDA join model PLIDA data comes from different government agencies, and each agency has its own person identifier (`SYNTHETIC_AEUID`). The _spine tables_ link them: each spine maps `spine_id` (a universal key) to one agency's identifier. So the join chain for "demographics → MBS" is: ``` demo[SPINE_ID] → dhda_spine[spine_id, SYNTHETIC_AEUID] → mbs[SYNTHETIC_AEUID] ``` ::: {.callout-note} ## Why the spine tables exist No single agency holds a universal person identifier. The spine tables are the linkage layer — they map each agency's identifier to a common `spine_id`. You'll always need at least one spine join to link data across agencies. ::: ## Set up the lazy tables ```{r tables} demo <- tbl(con, "demo") |> rename_with(tolower) dhda_spine <- tbl(con, "dhda_spine") |> rename_with(tolower) mbs <- tbl(con, "mbs") |> rename_with(tolower) ``` Three handles. Nothing loaded. `rename_with(tolower)` pushes every column to `snake_case` as it comes out of DuckDB, so the rest of the pipe can stay lowercase. Let's check what we're working with: ```{r peek} demo |> head(3) |> collect() dhda_spine |> head(3) |> collect() mbs |> head(3) |> collect() ``` ## Build and run the query For each sex and 5-year age bracket, how many MBS services were used in 2015 and how much was charged? ```{r query} tic("Three-table join on 372M MBS rows") result <- demo |> select(spine_id, core_gender, year_of_birth) |> inner_join(dhda_spine, by = "spine_id") |> inner_join( mbs |> select(synthetic_aeuid, numserv, feecharged), by = "synthetic_aeuid" ) |> mutate( age_2015 = 2015L - year_of_birth, age_group = as.integer(floor(age_2015 / 5) * 5) ) |> filter(age_2015 >= 0L, age_2015 < 100L) |> group_by(core_gender, age_group) |> summarise( services = sum(numserv, na.rm = TRUE), fee_charged = sum(feecharged, na.rm = TRUE), claims = n(), .groups = "drop" ) |> collect() toc() result ``` That joined 30 million demographic records through 30 million spine records to 372 million MBS claims, bucketed by age, and collected a tidy result — in the time shown above. Here's the SQL dbplyr generated: ```{r show-sql} demo |> select(spine_id, core_gender, year_of_birth) |> inner_join(dhda_spine, by = "spine_id") |> inner_join(mbs |> select(synthetic_aeuid, numserv, feecharged), by = "synthetic_aeuid") |> mutate(age_2015 = 2015L - year_of_birth, age_group = as.integer(floor(age_2015 / 5) * 5)) |> filter(age_2015 >= 0L, age_2015 < 100L) |> group_by(core_gender, age_group) |> summarise(services = sum(numserv, na.rm = TRUE), fee_charged = sum(feecharged, na.rm = TRUE), claims = n(), .groups = "drop") |> show_query() ``` ## Things to notice **Normalise column case up front.** `rename_with(tolower)` at the top of each `tbl()` means the rest of the pipe joins on `spine_id` and `synthetic_aeuid` without any one-off renames. That is the idiom the book uses throughout ([Data manipulation § The mental model](02-data-manipulation.qmd#the-mental-model-lazy-query-eager-collect)). **`select()` before joining.** We select only the columns we need from each table _before_ the join. DuckDB pushes the projection down, so it reads fewer columns from disk and the hash table is smaller. **Filters also go before joins.** The `age_2015 >= 0` filter is on the already-joined result here because the column only exists after the mutate, but where a filter exists on a raw column (e.g. `year_of_birth >= 1960`), push it upstream of every join. Arrow's optimiser will not do this transformation for you; see [Joins § The habit that matters most](03-joins.qmd#the-habit-that-matters-most-push-filters-before-joins). ::: {.callout-important} ## `semi_join()` / `anti_join()` instead of `filter(x %in% ...)` Not used here, but worth calling out because it is the single most useful join idiom for PLIDA work. If you ever find yourself writing `filter(synthetic_aeuid %in% (other_table |> pull(synthetic_aeuid)))`, rewrite it as `semi_join(other_table, by = "synthetic_aeuid")`. The `%in%` form silently materialises the right‑hand column into R memory and ships a giant `IN (...)` literal back in SQL; the `semi_join()` keeps the whole operation inside DuckDB. See [Joins § filtering joins](03-joins.qmd#semi_join-and-anti_join-filtering-joins). ::: ::: {.callout-warning} ## `collect()` comes last — not in the middle If you accidentally write `demo |> collect() |> inner_join(dhda_spine, ...)`, you've loaded 30 million rows into R memory _before_ joining — and the join to MBS would try to pull 372 million rows in too. Keep the pipe lazy until the result is aggregated down to something small. ::: ## Plot ```{r plot} #| fig-width: 8 #| fig-height: 5 result |> filter(core_gender %in% c("M", "F")) |> ggplot(aes(age_group, services / 1e6, fill = core_gender)) + geom_col(position = "dodge") + scale_y_continuous(labels = label_comma(suffix = "M")) + scale_fill_manual(values = c("F" = "#E69F00", "M" = "#56B4E9"), labels = c("F" = "Female", "M" = "Male")) + labs(x = "Age group", y = "MBS services (millions)", fill = NULL, title = "MBS service use by age and sex, 2015", subtitle = "fplida synthetic data, 30M person corpus") + theme_minimal(base_size = 11) ``` ```{r plot-fee} #| fig-width: 8 #| fig-height: 5 result |> filter(core_gender %in% c("M", "F")) |> mutate(fee_per_service = fee_charged / services) |> ggplot(aes(age_group, fee_per_service, colour = core_gender)) + geom_line(linewidth = 0.8) + geom_point(size = 1.5) + scale_y_continuous(labels = label_dollar()) + scale_colour_manual(values = c("F" = "#E69F00", "M" = "#56B4E9"), labels = c("F" = "Female", "M" = "Male")) + labs(x = "Age group", y = "Average fee per service", colour = NULL, title = "Average MBS fee per service by age and sex", subtitle = "fplida synthetic data") + theme_minimal(base_size = 11) ``` ## When dbplyr is not enough For the vast majority of PLIDA questions, dplyr verbs translate cleanly. When they do not — typically a DuckDB-specific function with no tidy counterpart — the lightest escape is a single `sql()` fragment inside a `mutate()` or `filter()`: ```r demo |> mutate(something = sql("SOME_DUCKDB_FUNCTION(col)")) ``` You stay in the dplyr pipe and hand DuckDB a literal expression for that one column. Reserve it for the genuinely irreducible case; if more than one `sql(...)` appears in a pipeline, rewrite the tidyverse side instead. `dbGetQuery()` with a hand-written SQL query is almost never the right answer — it is used once in the site for `ASOF JOIN` and not elsewhere. ## A note on table layout We found in the [storage-layout join experiments](A3-join-benchmarks.qmd#storage-variants) that sorting MBS by the join key makes point lookups 47× faster but doesn't speed up broad analytical queries like this one. If you're looking up one specific person's claims, sort the table. For population-level analysis, it doesn't matter. The one thing that helps _every_ join: using an [integer hash key](A3-join-benchmarks.qmd#integer-hash-key-variant) instead of the string AEUID. We measured a 2× speedup. ```{r disconnect} #| include: false dbDisconnect(con, shutdown = TRUE) ```