Using PLIDA: larger-than-memory data in R

Author

Will Mackey

Draft

This site is a work in progress. The code, benchmarks, and recommendations are still being tested and may change. Don’t rely on anything here yet.

What this site is

A worked demonstration of modern R data-engineering tools — arrow, duckdb / dbplyr, data.table, and base dplyr — applied to PLIDA-style administrative microdata that is much larger than the machine’s RAM. All code runs on a 32 GB MacBook against a ~750 GB synthetic PLIDA corpus produced by the fplida package.

Each main chapter describes one piece of the larger-than-memory workflow and shows the recommended way to do it. The evidence behind the recommendations — the timings, memory footprints, and the things we tried that did not work — lives in the benchmark appendices at the back of the book. Read the main chapters for the what and how; dip into the appendices when you want the why.

The short answer

For PLIDA-style data on a machine with 32 GB of RAM:

Store raw data as parquet. Convert CSV → parquet with DuckDB’s read_csv_auto() piped into COPY ... TO parquet. It is roughly 2× faster than the Arrow equivalent on this corpus and writes slightly smaller files. Keep zstd compression, medium row groups (ROW_GROUP_SIZE 250000), and medium-sized output files. See A1.
Put the parquet into a materialised DuckDB database once the workload becomes repetitive or join-heavy. Set memory_limit explicitly (~60% of physical RAM). Views over parquet are fine for one-off exploration. See A1.
Query the database through dbplyr. Compose dplyr verbs against tbl(con, "name") and call collect() only on the final small result. Tidyverse all the way — raw SQL is almost never needed. See A2.
Use filtering joins (semi_join, anti_join) instead of filter(x %in% ...) when the right-hand side is large. Push filter() upstream of every join. Expect DuckDB to handle 100M+ row joins that Arrow cannot finish on a 32 GB machine. See A3.
For window functions on billion-row panel data, chunk by a person-hash bucket. A single LAG() over 700M STP rows does not fit; ten independent calls over 70M-row buckets complete in about 20 seconds. See A4.

The rest of this site walks through each of those recommendations in turn.

A very short introduction to DuckDB

DuckDB is a database that runs inside your R session. No server to start, no login, no network. A DuckDB database is a single file on disk (here, plida_tables.duckdb) that you open from R the same way you would open a connection to any database, and close when you’re done. It is the same general idea as SQLite — if you have used that for small projects, DuckDB will feel familiar — but it is designed for the analysis workload we do every day: big tables, filters and groupbys, joins across tables.

Three things about DuckDB matter for PLIDA work, in plain English.

It only reads the columns your query touches. When you write summarise(services = sum(numserv)) against a 370-million-row MBS table, DuckDB reads the numserv column from disk and ignores everything else. With read_csv() on a tibble, R reads every column of every row whether you use it or not. That’s the biggest reason the examples in this book go from “takes 35 seconds and 14 GB of RAM” to “takes under a second and uses almost no RAM” without any cleverness on your part — just the storage format plus the query engine.
It works when the data is bigger than your RAM. A join or a sort that cannot fit entirely in memory normally kills your R session (you see “cannot allocate vector of size…”). DuckDB notices this is happening and starts temporarily writing pieces of the intermediate work out to a scratch directory on disk. The query finishes, just a bit slower. Without this feature, a three-way join on the 372-million-row MBS table on a 32 GB laptop is impossible.
It reads parquet files directly. You can either point DuckDB at a parquet file (a zero-copy “view”) or copy the parquet into a DuckDB table (slightly faster for repeated queries; see Reading and storing). Either way, parquet stays the canonical on-disk format — other R packages (arrow), Python, and so on can all read the same files.

The only Rule of DuckDB Club worth memorising is: set the memory limit explicitly (below), and always read_only = TRUE when you are analysing rather than building the database.

From R there are two main interfaces to DuckDB:

library(duckdb); library(DBI); library(dplyr); library(dbplyr)

con <- dbConnect(duckdb::duckdb(),
                 dbdir = "plida_tables.duckdb",
                 read_only = TRUE)
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET threads TO 8")

# Write tidyverse R. DuckDB executes it.
tbl(con, "itr") |>
  filter(incm_yr == 2015L) |>
  count(ocptn_grp_cd) |>
  collect()

This site is tidyverse R, not SQL

Every query in the main chapters and case studies is written in tidyverse syntax — filter, mutate, group_by, summarise, inner_join, semi_join, lag, window_order, window_frame. dbplyr translates those verbs into SQL and DuckDB executes them; you do not have to write SQL. The only exceptions are a small number of one-off setup steps where there is no tidyverse equivalent — CSV → parquet conversion (COPY ... TO PARQUET), connection pragmas (SET memory_limit = ...), and CREATE TABLE for the one-off ingest. Everything after that is dplyr.

You can see the SQL that dbplyr generated for any lazy query with show_query(), and the DuckDB plan with explain(). Use these when something is slow; do not write the SQL yourself.

Two knobs are worth setting on every connection:

SET memory_limit = '20GB' (or roughly 60% of physical RAM). Without this, DuckDB will happily claim most of free memory, collide with R, and the OS will kill one of them.
read_only = TRUE when attaching an already-built database for analysis. Multiple R sessions can attach the same .duckdb file read-only simultaneously — useful when you iterate in one session and render a Quarto document in another.

The rest of the site assumes this is the default query engine and treats Arrow, dplyr-on-tibbles, and data.table as alternatives you reach for in specific situations.

How the chapters fit together

The main chapters follow the life of a PLIDA query — from raw file to answer — and each chapter references the appendix that holds the evidence:

Setup — install fplida, generate a synthetic corpus, set paths.
Reading and storing raw data — CSV → parquet → DuckDB, with a disciplined storage layout. Evidence in A1.
Data manipulation — the dplyr verbs through dbplyr (and Arrow), and the small translation traps that bite once. Evidence in A2.
Joins — inner, left, semi, anti, full; when a semi_join replaces a %in% filter; when filters must be pushed before the join. Evidence in A3.
Window functions — lag(), row_number(), rolling windows with window_frame(), and the chunk-by-hash pattern for billion-row panels. Evidence in A4.

Four short case studies apply the recipe end-to-end: earnings, health services, CSV to chart in 20 minutes, and job switching and health shocks.

Reproducibility

All source data lives outside this repository at /Users/willmackey/offline/fplida-data/fplida_30m_csv/ (synthetic, produced by fplida). All intermediate parquet and DuckDB files live at /Users/willmackey/offline/using-plida-data/ so the project folder stays small. The scripts in R/ are numbered in dependency order; R/01-convert-csv-to-parquet.R runs first, then R/01b-convert-supplementary.R, R/02-build-duckdb.R, and finally R/04-run-benchmarks.R.

--- title: "Using PLIDA: larger-than-memory data in R" author: "Will Mackey" --- ::: {.callout-warning} ## Draft This site is a work in progress. The code, benchmarks, and recommendations are still being tested and may change. Don't rely on anything here yet. ::: ## What this site is A worked demonstration of modern R data-engineering tools — [arrow](https://arrow.apache.org/docs/r/), [duckdb](https://duckdb.org/docs/api/r) / [dbplyr](https://dbplyr.tidyverse.org/), [data.table](https://rdatatable.gitlab.io/data.table/), and base [dplyr](https://dplyr.tidyverse.org/) — applied to PLIDA-style administrative microdata that is much larger than the machine's RAM. All code runs on a 32 GB MacBook against a ~750 GB synthetic PLIDA corpus produced by the [`fplida`](https://github.com/wfmackey/fplida) package. Each main chapter describes one piece of the larger-than-memory workflow and shows the recommended way to do it. The evidence behind the recommendations — the timings, memory footprints, and the things we tried that did not work — lives in the benchmark appendices at the back of the book. Read the main chapters for the _what_ and _how_; dip into the appendices when you want the _why_. ## The short answer For PLIDA-style data on a machine with 32 GB of RAM: 1. **Store raw data as parquet.** Convert CSV → parquet with DuckDB's `read_csv_auto()` piped into `COPY ... TO parquet`. It is roughly 2× faster than the Arrow equivalent on this corpus and writes slightly smaller files. Keep zstd compression, medium row groups (`ROW_GROUP_SIZE 250000`), and medium-sized output files. See [A1](A1-reading-storing-benchmarks.qmd). 2. **Put the parquet into a materialised DuckDB database** once the workload becomes repetitive or join-heavy. Set `memory_limit` explicitly (~60% of physical RAM). Views over parquet are fine for one-off exploration. See [A1](A1-reading-storing-benchmarks.qmd). 3. **Query the database through `dbplyr`.** Compose dplyr verbs against `tbl(con, "name")` and call `collect()` only on the final small result. Tidyverse all the way — raw SQL is almost never needed. See [A2](A2-data-manipulation-benchmarks.qmd). 4. **Use filtering joins (`semi_join`, `anti_join`) instead of `filter(x %in% ...)`** when the right-hand side is large. Push `filter()` upstream of every join. Expect DuckDB to handle 100M+ row joins that Arrow cannot finish on a 32 GB machine. See [A3](A3-join-benchmarks.qmd). 5. **For window functions on billion-row panel data, chunk by a person-hash bucket.** A single `LAG()` over 700M STP rows does not fit; ten independent calls over 70M-row buckets complete in about 20 seconds. See [A4](A4-window-function-benchmarks.qmd). The rest of this site walks through each of those recommendations in turn. ## A very short introduction to DuckDB [DuckDB](https://duckdb.org/) is a database that runs inside your R session. No server to start, no login, no network. A DuckDB database is a single file on disk (here, `plida_tables.duckdb`) that you open from R the same way you would open a connection to any database, and close when you're done. It is the same general idea as SQLite — if you have used that for small projects, DuckDB will feel familiar — but it is designed for the analysis workload we do every day: big tables, filters and groupbys, joins across tables. Three things about DuckDB matter for PLIDA work, in plain English. - **It only reads the columns your query touches.** When you write `summarise(services = sum(numserv))` against a 370-million-row MBS table, DuckDB reads the `numserv` column from disk and ignores everything else. With `read_csv()` on a tibble, R reads every column of every row whether you use it or not. That's the biggest reason the examples in this book go from "takes 35 seconds and 14 GB of RAM" to "takes under a second and uses almost no RAM" without any cleverness on your part — just the storage format plus the query engine. - **It works when the data is bigger than your RAM.** A join or a sort that cannot fit entirely in memory normally kills your R session (you see "cannot allocate vector of size..."). DuckDB notices this is happening and starts temporarily writing pieces of the intermediate work out to a scratch directory on disk. The query finishes, just a bit slower. Without this feature, a three-way join on the 372-million-row MBS table on a 32 GB laptop is impossible. - **It reads parquet files directly.** You can either point DuckDB at a parquet file (a zero-copy "view") or copy the parquet into a DuckDB table (slightly faster for repeated queries; see [Reading and storing](01-reading-storing.qmd)). Either way, parquet stays the canonical on-disk format — other R packages (`arrow`), Python, and so on can all read the same files. The only Rule of DuckDB Club worth memorising is: **set the memory limit explicitly** (below), and **always `read_only = TRUE`** when you are analysing rather than building the database. From R there are two main interfaces to DuckDB: ```r library(duckdb); library(DBI); library(dplyr); library(dbplyr) con <- dbConnect(duckdb::duckdb(), dbdir = "plida_tables.duckdb", read_only = TRUE) dbExecute(con, "SET memory_limit = '20GB'") dbExecute(con, "SET threads TO 8") # Write tidyverse R. DuckDB executes it. tbl(con, "itr") |> filter(incm_yr == 2015L) |> count(ocptn_grp_cd) |> collect() ``` ::: {.callout-note} ## This site is tidyverse R, not SQL Every query in the main chapters and case studies is written in tidyverse syntax — `filter`, `mutate`, `group_by`, `summarise`, `inner_join`, `semi_join`, `lag`, `window_order`, `window_frame`. `dbplyr` translates those verbs into SQL and DuckDB executes them; you do not have to write SQL. The only exceptions are a small number of one-off **setup steps** where there is no tidyverse equivalent — CSV → parquet conversion (`COPY ... TO PARQUET`), connection pragmas (`SET memory_limit = ...`), and `CREATE TABLE` for the one-off ingest. Everything after that is dplyr. You can see the SQL that dbplyr generated for any lazy query with `show_query()`, and the DuckDB plan with `explain()`. Use these when something is slow; do not write the SQL yourself. ::: Two knobs are worth setting on every connection: - **`SET memory_limit = '20GB'`** (or roughly 60% of physical RAM). Without this, DuckDB will happily claim most of free memory, collide with R, and the OS will kill one of them. - **`read_only = TRUE`** when attaching an already-built database for analysis. Multiple R sessions can attach the same `.duckdb` file read-only simultaneously — useful when you iterate in one session and render a Quarto document in another. The rest of the site assumes this is the default query engine and treats Arrow, dplyr-on-tibbles, and data.table as alternatives you reach for in specific situations. ## How the chapters fit together The main chapters follow the life of a PLIDA query — from raw file to answer — and each chapter references the appendix that holds the evidence: - [Setup](00-setup.qmd) — install `fplida`, generate a synthetic corpus, set paths. - [Reading and storing raw data](01-reading-storing.qmd) — CSV → parquet → DuckDB, with a disciplined storage layout. Evidence in [A1](A1-reading-storing-benchmarks.qmd). - [Data manipulation](02-data-manipulation.qmd) — the dplyr verbs through `dbplyr` (and Arrow), and the small translation traps that bite once. Evidence in [A2](A2-data-manipulation-benchmarks.qmd). - [Joins](03-joins.qmd) — inner, left, semi, anti, full; when a `semi_join` replaces a `%in%` filter; when filters must be pushed before the join. Evidence in [A3](A3-join-benchmarks.qmd). - [Window functions](04-window-functions.qmd) — `lag()`, `row_number()`, rolling windows with `window_frame()`, and the chunk-by-hash pattern for billion-row panels. Evidence in [A4](A4-window-function-benchmarks.qmd). Four short case studies apply the recipe end-to-end: [earnings](cs01-earnings.qmd), [health services](cs02-health-services.qmd), [CSV to chart in 20 minutes](cs03-csv-to-chart.qmd), and [job switching and health shocks](cs04-job-switching.qmd). ## Reproducibility All source data lives outside this repository at `/Users/willmackey/offline/fplida-data/fplida_30m_csv/` (synthetic, produced by `fplida`). All intermediate parquet and DuckDB files live at `/Users/willmackey/offline/using-plida-data/` so the project folder stays small. The scripts in `R/` are numbered in dependency order; `R/01-convert-csv-to-parquet.R` runs first, then `R/01b-convert-supplementary.R`, `R/02-build-duckdb.R`, and finally `R/04-run-benchmarks.R`.