Case study 1: How much do people earn?

A single-table workflow on individual tax returns

This case study asks the most boring useful question you can ask of the individual tax returns: what does the distribution of taxable income look like, and how has it moved over time? It uses one table — itr_inc, the ATO income sub-table — and a few dplyr verbs. If you’re picking up PLIDA for the first time, start here.

Setup

The Reading and storing chapter covers the one-off setup: CSVs get converted to parquet, and (optionally) the parquet is loaded into a materialised plida_tables.duckdb database. Every case study from here on does the same three things at the top: load packages, point DuckDB at the data, and grab a lazy tbl() handle to the table we actually want. Here that’s itr_inc — the ATO income sub-table with one row per person per income year. rename_with(tolower) pushes the SCREAMING_CASE column names down to snake_case so the rest of the pipe reads cleanly; this is lazy, so no data moves.

From a DuckDB database (default)
From parquet, no database file

Code

library(tidyverse)
library(scales)
library(tictoc)
library(duckdb)
library(DBI)
library(dbplyr)

# Centralised paths for this project — a short file that just defines
# where the DuckDB and parquet folders live. See the Setup chapter.
source("R/00-paths.R")
data_dir <- paths$duckdb_dir # replace with your own path

con <- dbConnect(
  duckdb::duckdb(),
  dbdir     = file.path(data_dir, "plida_tables.duckdb"),
  read_only = TRUE
)
dbExecute(con, "SET memory_limit = '20GB'") # Set at about 90% of your RAM

[1] 0

Code

dbExecute(con, "SET threads TO 8")          # Set as your number of cores less 1

[1] 0

Code

itr_inc <- tbl(con, "itr_inc") |> rename_with(tolower)

Code

library(tidyverse)
library(scales)
library(tictoc)
library(glue)
library(duckdb)
library(DBI)
library(dbplyr)

source("R/00-paths.R")
pq <- paths$parquet_root # replace with your own parquet folder

con <- dbConnect(duckdb::duckdb())              # in-memory DuckDB, no file
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET threads TO 8")

itr_inc <- tbl(con, sql(glue(
  "SELECT * FROM read_parquet('{pq}/itr_inc/*.parquet')"
))) |>
  rename_with(tolower)

itr_inc is a lazy dbplyr handle — it knows about the table but hasn’t read a row. rename_with(tolower) lowercases the SCREAMING_CASE column names so the rest of the pipe reads cleanly in snake_case. Computation happens in DuckDB; the result comes back to R only when we say collect().

Why a materialised DuckDB table and not views over parquet?

For a single scan we would be fine either way. The reason the default recipe materialises tables is that most real analyses escalate: one filter today, a join tomorrow, a window function next week. Materialised storage pays once and is faster at everything after that. See Reading and storing § Step 3 and A3 T3 for the evidence.

Code

glimpse(itr_inc)

Rows: ??
Columns: 6
Database: DuckDB 1.4.1 [root@Darwin 25.2.0:R 4.5.2//Users/willmackey/offline/using-plida-data/duckdb/plida_tables.duckdb]
$ synthetic_aeuid        <chr> "000000000200", "00000000021D", "0000000002D7",…
$ incm_yr                <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,…
$ grs_pmt_totl_calcd_amt <dbl> 26281.00, 13700.00, 76830.00, 20000.53, 68883.0…
$ bus_incm_totl_amt      <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 13791…
$ intst_amt              <dbl> 477.83, 0.00, 0.00, 0.00, 2688.06, 0.00, 0.00, …
$ ti_or_lss_amt          <dbl> 22859.83, 13652.09, 74658.35, 19050.94, 70922.8…

Four dollar columns (gross payments, business income, interest, and taxable income or loss) plus synthetic_aeuid and incm_yr. About 170 million rows across all financial years. The one to focus on is ti_or_lss_amt — taxable income or loss, the ATO’s headline “how much did this person make, for tax purposes” number.

The ITR is split across four sub-tables

ato-pit_itr/ in the fplida corpus contains four distinct schemas: context (occupation code and residency), income (what we use here), deductions, and withholding/tax. The conversion step splits them into separate parquet subdirs before ingest, so we get four DuckDB tables — itr, itr_inc, itr_ded, itr_whld — that all share the (synthetic_aeuid, incm_yr) key. Any combined question joins them on that key.

Median taxable income by year

The simplest possible earnings question: what is the median taxable income in each financial year?

Code

tic("Median + deciles on 170M rows")
earnings_by_year <- itr_inc |>
  filter(!is.na(ti_or_lss_amt)) |>
  group_by(incm_yr) |>
  summarise(
    n         = n(),
    mean_ti   = mean(ti_or_lss_amt, na.rm = TRUE),
    median_ti = median(ti_or_lss_amt, na.rm = TRUE),
    p10_ti    = quantile_cont(ti_or_lss_amt, 0.10),
    p90_ti    = quantile_cont(ti_or_lss_amt, 0.90),
    .groups = "drop"
  ) |>
  arrange(incm_yr) |>
  collect()
toc()

Median + deciles on 170M rows: 6.142 sec elapsed

Code

earnings_by_year

Everything up to collect() is lazy — the dplyr chain builds a SQL query that DuckDB executes. 170 million input rows collapse to one row per year. That is the pattern: push the heavy work to DuckDB, collect() only when the result fits in memory.

The golden rule

collect() only when the result is small. If you find yourself wanting to collect() and then filter(), you have it backwards.

quantile_cont() is a DuckDB function

dbplyr translates quantile_cont(x, p) straight to DuckDB’s QUANTILE_CONT(x, p) aggregate, which interpolates linearly between adjacent values — same semantics as base R’s default quantile(). median() also translates, to DuckDB’s MEDIAN(). Rolling percentiles combine window_frame() with quantile_cont() the same way; see Window functions.

Code

earnings_by_year |>
  ggplot(aes(incm_yr)) +
  geom_ribbon(aes(ymin = p10_ti, ymax = p90_ti), alpha = 0.15, fill = "#1b9e77") +
  geom_line(aes(y = median_ti), colour = "#1b9e77", linewidth = 0.9) +
  geom_line(aes(y = mean_ti),   colour = "#d95f02", linewidth = 0.7, linetype = 2) +
  scale_y_continuous(labels = label_dollar(accuracy = 1)) +
  labs(x = "Income year", y = "Taxable income (AUD)",
       title = "Taxable income by year: median and 10–90% range",
       subtitle = "Solid green: median. Dashed orange: mean. Band: P10–P90.") +
  theme_minimal(base_size = 11)

Two things jump out even on synthetic data. The mean sits well above the median — the income distribution has a long upper tail, as it does in real ATO data. And the P10–P90 band is wide: the tenth and ninetieth percentiles differ by an order of magnitude.

The income distribution, bucketed

Medians and deciles are a nice summary, but for “how much do people earn?” it is usually clearer to count people in income brackets. Six brackets, one groupby.

Code

tic("Income brackets by year on 170M rows")
brackets <- itr_inc |>
  filter(!is.na(ti_or_lss_amt)) |>
  mutate(
    bracket = case_when(
      ti_or_lss_amt <        0 ~ "loss",
      ti_or_lss_amt <    18200 ~ "<$18.2k",
      ti_or_lss_amt <    45000 ~ "$18.2–45k",
      ti_or_lss_amt <   120000 ~ "$45–120k",
      ti_or_lss_amt <   180000 ~ "$120–180k",
      TRUE                     ~ "$180k+"
    )
  ) |>
  count(incm_yr, bracket) |>
  collect()
toc()

Income brackets by year on 170M rows: 0.561 sec elapsed

Code

head(brackets, 8)

case_when() translates cleanly into a SQL CASE WHEN expression; the count() adds a GROUP BY and a COUNT(*). Same pattern as above, same shape of result.

Code

bracket_levels <- c("loss", "<$18.2k", "$18.2–45k",
                    "$45–120k", "$120–180k", "$180k+")

brackets |>
  mutate(bracket = factor(bracket, levels = bracket_levels)) |>
  ggplot(aes(incm_yr, n, fill = bracket)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  labs(x = "Income year", y = "Share of returns", fill = "Taxable income",
       title = "Share of individual tax returns by income bracket",
       subtitle = "fplida synthetic data, 30M person corpus") +
  theme_minimal(base_size = 11)

The brackets track the tax-free threshold ($18,200) and the old marginal rate cut-ins ($45k, $120k, $180k). Reading the chart left to right tells you whether the distribution is compressing or spreading across years. On synthetic data the picture is stationary by construction; on real ATO data this is the first chart you draw.

What just happened

You aggregated 170 million rows twice and plotted both results — without loading the data into R memory. The same operation with read_parquet() and dplyr-on-tibbles takes about 35 seconds and 14.5 GB of RAM (see A2). On a smaller machine it would crash.

The comparison that matters

The same groupby on 170M rows: dplyr-on-tibbles takes 35 seconds and 14.5 GB of RAM. dbplyr over DuckDB takes under a second and uses negligible memory. See the full benchmark results.

The workflow is always:

Open a connection, grab a lazy tbl().
Build the query with dplyr verbs.
collect() the (small) result into R.
ggplot2 from there.

What’s next

The obvious follow-up — “how does taxable income vary by occupation?” — is a join between itr_inc and itr on (synthetic_aeuid, incm_yr). Both tables share the ATO key, so no spine is involved. That is a good next exercise after reading the Joins chapter.

Case study 2 is the first genuinely cross-agency join: linking demographics to MBS claims via the PLIDA spine.

When you might reach for something else

The default recipe above works for every single-table PLIDA question this project has thrown at it. Two alternatives are worth knowing about for specific situations:

arrow::open_dataset() is useful if you have parquet files on disk but haven’t stood up a DuckDB database yet — for example, a one-off look at someone else’s parquet extract. On scans and groupbys Arrow is indistinguishable from DuckDB (A2); it just gives up on large joins and window functions (A3, A4).
DuckDB views over parquet give you the same dbplyr interface as this case study without materialising a table. Good for ad-hoc exploration of a parquet directory you don’t want to ingest. See Reading and storing § Step 3.

For anything you will run more than a couple of times, materialised tables plus dbplyr is the recipe the rest of this site uses.

--- title: "Case study 1: How much do people earn?" subtitle: "A single-table workflow on individual tax returns" --- This case study asks the most boring useful question you can ask of the individual tax returns: **what does the distribution of taxable income look like, and how has it moved over time?** It uses one table — `itr_inc`, the ATO income sub-table — and a few dplyr verbs. If you're picking up PLIDA for the first time, start here. ## Setup The [Reading and storing](01-reading-storing.qmd) chapter covers the one-off setup: CSVs get converted to parquet, and (optionally) the parquet is loaded into a materialised `plida_tables.duckdb` database. Every case study from here on does the same three things at the top: load packages, point DuckDB at the data, and grab a lazy `tbl()` handle to the table we actually want. Here that's `itr_inc` — the ATO income sub-table with one row per person per income year. `rename_with(tolower)` pushes the `SCREAMING_CASE` column names down to `snake_case` so the rest of the pipe reads cleanly; this is lazy, so no data moves. ::: {.panel-tabset} ## From a DuckDB database (default) ```{r setup} library(tidyverse) library(scales) library(tictoc) library(duckdb) library(DBI) library(dbplyr) # Centralised paths for this project — a short file that just defines # where the DuckDB and parquet folders live. See the Setup chapter. source("R/00-paths.R") data_dir <- paths$duckdb_dir # replace with your own path con <- dbConnect( duckdb::duckdb(), dbdir = file.path(data_dir, "plida_tables.duckdb"), read_only = TRUE ) dbExecute(con, "SET memory_limit = '20GB'") # Set at about 90% of your RAM dbExecute(con, "SET threads TO 8") # Set as your number of cores less 1 itr_inc <- tbl(con, "itr_inc") |> rename_with(tolower) ``` ## From parquet, no database file ```{r setup-parquet} #| eval: false library(tidyverse) library(scales) library(tictoc) library(glue) library(duckdb) library(DBI) library(dbplyr) source("R/00-paths.R") pq <- paths$parquet_root # replace with your own parquet folder con <- dbConnect(duckdb::duckdb()) # in-memory DuckDB, no file dbExecute(con, "SET memory_limit = '20GB'") dbExecute(con, "SET threads TO 8") itr_inc <- tbl(con, sql(glue( "SELECT * FROM read_parquet('{pq}/itr_inc/*.parquet')" ))) |> rename_with(tolower) ``` ::: `itr_inc` is a lazy `dbplyr` handle — it knows about the table but hasn't read a row. `rename_with(tolower)` lowercases the `SCREAMING_CASE` column names so the rest of the pipe reads cleanly in `snake_case`. Computation happens in DuckDB; the result comes back to R only when we say `collect()`. ::: {.callout-note} ## Why a materialised DuckDB table and not views over parquet? For a single scan we would be fine either way. The reason the default recipe materialises tables is that most real analyses escalate: one filter today, a join tomorrow, a window function next week. Materialised storage pays once and is faster at everything after that. See [Reading and storing § Step 3](01-reading-storing.qmd#step-3-load-the-parquet-into-duckdb) and [A3 T3](A3-join-benchmarks.qmd#t3-three-table-join) for the evidence. ::: ```{r glimpse} glimpse(itr_inc) ``` Four dollar columns (gross payments, business income, interest, and taxable income or loss) plus `synthetic_aeuid` and `incm_yr`. About 170 million rows across all financial years. The one to focus on is `ti_or_lss_amt` — **taxable income or loss**, the ATO's headline "how much did this person make, for tax purposes" number. ::: {.callout-note} ## The ITR is split across four sub-tables `ato-pit_itr/` in the fplida corpus contains four distinct schemas: context (occupation code and residency), **income** (what we use here), deductions, and withholding/tax. The conversion step splits them into separate parquet subdirs before ingest, so we get four DuckDB tables — `itr`, `itr_inc`, `itr_ded`, `itr_whld` — that all share the `(synthetic_aeuid, incm_yr)` key. Any combined question joins them on that key. ::: ## Median taxable income by year The simplest possible earnings question: what is the median taxable income in each financial year? ```{r median-by-year} tic("Median + deciles on 170M rows") earnings_by_year <- itr_inc |> filter(!is.na(ti_or_lss_amt)) |> group_by(incm_yr) |> summarise( n = n(), mean_ti = mean(ti_or_lss_amt, na.rm = TRUE), median_ti = median(ti_or_lss_amt, na.rm = TRUE), p10_ti = quantile_cont(ti_or_lss_amt, 0.10), p90_ti = quantile_cont(ti_or_lss_amt, 0.90), .groups = "drop" ) |> arrange(incm_yr) |> collect() toc() earnings_by_year ``` Everything up to `collect()` is lazy — the dplyr chain builds a SQL query that DuckDB executes. 170 million input rows collapse to one row per year. That is the pattern: push the heavy work to DuckDB, `collect()` only when the result fits in memory. ::: {.callout-tip} ## The golden rule `collect()` only when the result is small. If you find yourself wanting to `collect()` and _then_ `filter()`, you have it backwards. ::: ::: {.callout-note} ## `quantile_cont()` is a DuckDB function dbplyr translates `quantile_cont(x, p)` straight to DuckDB's `QUANTILE_CONT(x, p)` aggregate, which interpolates linearly between adjacent values — same semantics as base R's default `quantile()`. `median()` also translates, to DuckDB's `MEDIAN()`. Rolling percentiles combine `window_frame()` with `quantile_cont()` the same way; see [Window functions](04-window-functions.qmd). ::: ```{r plot-median} #| fig-width: 8 #| fig-height: 4.5 earnings_by_year |> ggplot(aes(incm_yr)) + geom_ribbon(aes(ymin = p10_ti, ymax = p90_ti), alpha = 0.15, fill = "#1b9e77") + geom_line(aes(y = median_ti), colour = "#1b9e77", linewidth = 0.9) + geom_line(aes(y = mean_ti), colour = "#d95f02", linewidth = 0.7, linetype = 2) + scale_y_continuous(labels = label_dollar(accuracy = 1)) + labs(x = "Income year", y = "Taxable income (AUD)", title = "Taxable income by year: median and 10–90% range", subtitle = "Solid green: median. Dashed orange: mean. Band: P10–P90.") + theme_minimal(base_size = 11) ``` Two things jump out even on synthetic data. The mean sits well above the median — the income distribution has a long upper tail, as it does in real ATO data. And the P10–P90 band is wide: the tenth and ninetieth percentiles differ by an order of magnitude. ## The income distribution, bucketed Medians and deciles are a nice summary, but for "how much do people earn?" it is usually clearer to count people in income brackets. Six brackets, one groupby. ```{r brackets} tic("Income brackets by year on 170M rows") brackets <- itr_inc |> filter(!is.na(ti_or_lss_amt)) |> mutate( bracket = case_when( ti_or_lss_amt < 0 ~ "loss", ti_or_lss_amt < 18200 ~ "<$18.2k", ti_or_lss_amt < 45000 ~ "$18.2–45k", ti_or_lss_amt < 120000 ~ "$45–120k", ti_or_lss_amt < 180000 ~ "$120–180k", TRUE ~ "$180k+" ) ) |> count(incm_yr, bracket) |> collect() toc() head(brackets, 8) ``` `case_when()` translates cleanly into a SQL `CASE WHEN` expression; the `count()` adds a `GROUP BY` and a `COUNT(*)`. Same pattern as above, same shape of result. ```{r plot-brackets} #| fig-width: 8 #| fig-height: 5 bracket_levels <- c("loss", "<$18.2k", "$18.2–45k", "$45–120k", "$120–180k", "$180k+") brackets |> mutate(bracket = factor(bracket, levels = bracket_levels)) |> ggplot(aes(incm_yr, n, fill = bracket)) + geom_col(position = "fill") + scale_y_continuous(labels = label_percent()) + scale_fill_brewer(palette = "RdYlGn", direction = -1) + labs(x = "Income year", y = "Share of returns", fill = "Taxable income", title = "Share of individual tax returns by income bracket", subtitle = "fplida synthetic data, 30M person corpus") + theme_minimal(base_size = 11) ``` The brackets track the tax-free threshold (\$18,200) and the old marginal rate cut-ins (\$45k, \$120k, \$180k). Reading the chart left to right tells you whether the distribution is compressing or spreading across years. On synthetic data the picture is stationary by construction; on real ATO data this is the first chart you draw. ## What just happened You aggregated 170 million rows twice and plotted both results — without loading the data into R memory. The same operation with `read_parquet()` and dplyr-on-tibbles takes about 35 seconds and 14.5 GB of RAM (see [A2](A2-data-manipulation-benchmarks.qmd)). On a smaller machine it would crash. ::: {.callout-important} ## The comparison that matters The same groupby on 170M rows: **dplyr-on-tibbles takes 35 seconds and 14.5 GB of RAM. dbplyr over DuckDB takes under a second and uses negligible memory.** See the full [benchmark results](A2-data-manipulation-benchmarks.qmd). ::: The workflow is always: 1. Open a connection, grab a lazy `tbl()`. 2. Build the query with dplyr verbs. 3. `collect()` the (small) result into R. 4. ggplot2 from there. ## What's next The obvious follow-up — "how does taxable income vary _by occupation_?" — is a join between `itr_inc` and `itr` on `(synthetic_aeuid, incm_yr)`. Both tables share the ATO key, so no spine is involved. That is a good next exercise after reading the [Joins](03-joins.qmd) chapter. [Case study 2](cs02-health-services.qmd) is the first genuinely cross-agency join: linking demographics to MBS claims via the PLIDA spine. ## When you might reach for something else The default recipe above works for every single-table PLIDA question this project has thrown at it. Two alternatives are worth knowing about for specific situations: - **`arrow::open_dataset()`** is useful if you have parquet files on disk but haven't stood up a DuckDB database yet — for example, a one-off look at someone else's parquet extract. On scans and groupbys Arrow is indistinguishable from DuckDB ([A2](A2-data-manipulation-benchmarks.qmd)); it just gives up on large joins and window functions ([A3](A3-join-benchmarks.qmd), [A4](A4-window-function-benchmarks.qmd)). - **DuckDB views over parquet** give you the same dbplyr interface as this case study without materialising a table. Good for ad-hoc exploration of a parquet directory you don't want to ingest. See [Reading and storing § Step 3](01-reading-storing.qmd#step-3-load-the-parquet-into-duckdb). For anything you will run more than a couple of times, materialised tables plus `dbplyr` is the recipe the rest of this site uses. ```{r disconnect} #| include: false dbDisconnect(con, shutdown = TRUE) ```