Case study 1: How much do people earn?

A single-table workflow on individual tax returns

This case study asks the most boring useful question you can ask of the individual tax returns: what does the distribution of taxable income look like, and how has it moved over time? It uses one table — itr_inc, the ATO income sub-table — and a few dplyr verbs. If you’re picking up PLIDA for the first time, start here.

Setup

The Reading and storing chapter covers the one-off setup: CSVs get converted to parquet, and (optionally) the parquet is loaded into a materialised plida_tables.duckdb database. Every case study from here on does the same three things at the top: load packages, point DuckDB at the data, and grab a lazy tbl() handle to the table we actually want. Here that’s itr_inc — the ATO income sub-table with one row per person per income year. rename_with(tolower) pushes the SCREAMING_CASE column names down to snake_case so the rest of the pipe reads cleanly; this is lazy, so no data moves.

Code
library(tidyverse)
library(scales)
library(tictoc)
library(duckdb)
library(DBI)
library(dbplyr)

# Centralised paths for this project — a short file that just defines
# where the DuckDB and parquet folders live. See the Setup chapter.
source("R/00-paths.R")
data_dir <- paths$duckdb_dir # replace with your own path

con <- dbConnect(
  duckdb::duckdb(),
  dbdir     = file.path(data_dir, "plida_tables.duckdb"),
  read_only = TRUE
)
dbExecute(con, "SET memory_limit = '20GB'") # Set at about 90% of your RAM
[1] 0
Code
dbExecute(con, "SET threads TO 8")          # Set as your number of cores less 1
[1] 0
Code
itr_inc <- tbl(con, "itr_inc") |> rename_with(tolower)
Code
library(tidyverse)
library(scales)
library(tictoc)
library(glue)
library(duckdb)
library(DBI)
library(dbplyr)

source("R/00-paths.R")
pq <- paths$parquet_root # replace with your own parquet folder

con <- dbConnect(duckdb::duckdb())              # in-memory DuckDB, no file
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET threads TO 8")

itr_inc <- tbl(con, sql(glue(
  "SELECT * FROM read_parquet('{pq}/itr_inc/*.parquet')"
))) |>
  rename_with(tolower)

itr_inc is a lazy dbplyr handle — it knows about the table but hasn’t read a row. rename_with(tolower) lowercases the SCREAMING_CASE column names so the rest of the pipe reads cleanly in snake_case. Computation happens in DuckDB; the result comes back to R only when we say collect().

NoteWhy a materialised DuckDB table and not views over parquet?

For a single scan we would be fine either way. The reason the default recipe materialises tables is that most real analyses escalate: one filter today, a join tomorrow, a window function next week. Materialised storage pays once and is faster at everything after that. See Reading and storing § Step 3 and A3 T3 for the evidence.

Code
glimpse(itr_inc)
Rows: ??
Columns: 6
Database: DuckDB 1.4.1 [root@Darwin 25.2.0:R 4.5.2//Users/willmackey/offline/using-plida-data/duckdb/plida_tables.duckdb]
$ synthetic_aeuid        <chr> "000000000200", "00000000021D", "0000000002D7",…
$ incm_yr                <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,…
$ grs_pmt_totl_calcd_amt <dbl> 26281.00, 13700.00, 76830.00, 20000.53, 68883.0…
$ bus_incm_totl_amt      <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 13791…
$ intst_amt              <dbl> 477.83, 0.00, 0.00, 0.00, 2688.06, 0.00, 0.00, …
$ ti_or_lss_amt          <dbl> 22859.83, 13652.09, 74658.35, 19050.94, 70922.8…

Four dollar columns (gross payments, business income, interest, and taxable income or loss) plus synthetic_aeuid and incm_yr. About 170 million rows across all financial years. The one to focus on is ti_or_lss_amttaxable income or loss, the ATO’s headline “how much did this person make, for tax purposes” number.

NoteThe ITR is split across four sub-tables

ato-pit_itr/ in the fplida corpus contains four distinct schemas: context (occupation code and residency), income (what we use here), deductions, and withholding/tax. The conversion step splits them into separate parquet subdirs before ingest, so we get four DuckDB tables — itr, itr_inc, itr_ded, itr_whld — that all share the (synthetic_aeuid, incm_yr) key. Any combined question joins them on that key.

Median taxable income by year

The simplest possible earnings question: what is the median taxable income in each financial year?

Code
tic("Median + deciles on 170M rows")
earnings_by_year <- itr_inc |>
  filter(!is.na(ti_or_lss_amt)) |>
  group_by(incm_yr) |>
  summarise(
    n         = n(),
    mean_ti   = mean(ti_or_lss_amt, na.rm = TRUE),
    median_ti = median(ti_or_lss_amt, na.rm = TRUE),
    p10_ti    = quantile_cont(ti_or_lss_amt, 0.10),
    p90_ti    = quantile_cont(ti_or_lss_amt, 0.90),
    .groups = "drop"
  ) |>
  arrange(incm_yr) |>
  collect()
toc()
Median + deciles on 170M rows: 6.142 sec elapsed
Code
earnings_by_year

Everything up to collect() is lazy — the dplyr chain builds a SQL query that DuckDB executes. 170 million input rows collapse to one row per year. That is the pattern: push the heavy work to DuckDB, collect() only when the result fits in memory.

TipThe golden rule

collect() only when the result is small. If you find yourself wanting to collect() and then filter(), you have it backwards.

Notequantile_cont() is a DuckDB function

dbplyr translates quantile_cont(x, p) straight to DuckDB’s QUANTILE_CONT(x, p) aggregate, which interpolates linearly between adjacent values — same semantics as base R’s default quantile(). median() also translates, to DuckDB’s MEDIAN(). Rolling percentiles combine window_frame() with quantile_cont() the same way; see Window functions.

Code
earnings_by_year |>
  ggplot(aes(incm_yr)) +
  geom_ribbon(aes(ymin = p10_ti, ymax = p90_ti), alpha = 0.15, fill = "#1b9e77") +
  geom_line(aes(y = median_ti), colour = "#1b9e77", linewidth = 0.9) +
  geom_line(aes(y = mean_ti),   colour = "#d95f02", linewidth = 0.7, linetype = 2) +
  scale_y_continuous(labels = label_dollar(accuracy = 1)) +
  labs(x = "Income year", y = "Taxable income (AUD)",
       title = "Taxable income by year: median and 10–90% range",
       subtitle = "Solid green: median. Dashed orange: mean. Band: P10–P90.") +
  theme_minimal(base_size = 11)

Two things jump out even on synthetic data. The mean sits well above the median — the income distribution has a long upper tail, as it does in real ATO data. And the P10–P90 band is wide: the tenth and ninetieth percentiles differ by an order of magnitude.

The income distribution, bucketed

Medians and deciles are a nice summary, but for “how much do people earn?” it is usually clearer to count people in income brackets. Six brackets, one groupby.

Code
tic("Income brackets by year on 170M rows")
brackets <- itr_inc |>
  filter(!is.na(ti_or_lss_amt)) |>
  mutate(
    bracket = case_when(
      ti_or_lss_amt <        0 ~ "loss",
      ti_or_lss_amt <    18200 ~ "<$18.2k",
      ti_or_lss_amt <    45000 ~ "$18.2–45k",
      ti_or_lss_amt <   120000 ~ "$45–120k",
      ti_or_lss_amt <   180000 ~ "$120–180k",
      TRUE                     ~ "$180k+"
    )
  ) |>
  count(incm_yr, bracket) |>
  collect()
toc()
Income brackets by year on 170M rows: 0.561 sec elapsed
Code
head(brackets, 8)

case_when() translates cleanly into a SQL CASE WHEN expression; the count() adds a GROUP BY and a COUNT(*). Same pattern as above, same shape of result.

Code
bracket_levels <- c("loss", "<$18.2k", "$18.2–45k",
                    "$45–120k", "$120–180k", "$180k+")

brackets |>
  mutate(bracket = factor(bracket, levels = bracket_levels)) |>
  ggplot(aes(incm_yr, n, fill = bracket)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  labs(x = "Income year", y = "Share of returns", fill = "Taxable income",
       title = "Share of individual tax returns by income bracket",
       subtitle = "fplida synthetic data, 30M person corpus") +
  theme_minimal(base_size = 11)

The brackets track the tax-free threshold ($18,200) and the old marginal rate cut-ins ($45k, $120k, $180k). Reading the chart left to right tells you whether the distribution is compressing or spreading across years. On synthetic data the picture is stationary by construction; on real ATO data this is the first chart you draw.

What just happened

You aggregated 170 million rows twice and plotted both results — without loading the data into R memory. The same operation with read_parquet() and dplyr-on-tibbles takes about 35 seconds and 14.5 GB of RAM (see A2). On a smaller machine it would crash.

ImportantThe comparison that matters

The same groupby on 170M rows: dplyr-on-tibbles takes 35 seconds and 14.5 GB of RAM. dbplyr over DuckDB takes under a second and uses negligible memory. See the full benchmark results.

The workflow is always:

  1. Open a connection, grab a lazy tbl().
  2. Build the query with dplyr verbs.
  3. collect() the (small) result into R.
  4. ggplot2 from there.

What’s next

The obvious follow-up — “how does taxable income vary by occupation?” — is a join between itr_inc and itr on (synthetic_aeuid, incm_yr). Both tables share the ATO key, so no spine is involved. That is a good next exercise after reading the Joins chapter.

Case study 2 is the first genuinely cross-agency join: linking demographics to MBS claims via the PLIDA spine.

When you might reach for something else

The default recipe above works for every single-table PLIDA question this project has thrown at it. Two alternatives are worth knowing about for specific situations:

  • arrow::open_dataset() is useful if you have parquet files on disk but haven’t stood up a DuckDB database yet — for example, a one-off look at someone else’s parquet extract. On scans and groupbys Arrow is indistinguishable from DuckDB (A2); it just gives up on large joins and window functions (A3, A4).
  • DuckDB views over parquet give you the same dbplyr interface as this case study without materialising a table. Good for ad-hoc exploration of a parquet directory you don’t want to ingest. See Reading and storing § Step 3.

For anything you will run more than a couple of times, materialised tables plus dbplyr is the recipe the rest of this site uses.