Using PLIDA: larger-than-memory data in R
This site is a work in progress. The code, benchmarks, and recommendations are still being tested and may change. Don’t rely on anything here yet.
What this site is
A worked demonstration of modern R data-engineering tools — arrow, duckdb / dbplyr, data.table, and base dplyr — applied to PLIDA-style administrative microdata that is much larger than the machine’s RAM. All code runs on a 32 GB MacBook against a ~750 GB synthetic PLIDA corpus produced by the fplida package.
Each main chapter describes one piece of the larger-than-memory workflow and shows the recommended way to do it. The evidence behind the recommendations — the timings, memory footprints, and the things we tried that did not work — lives in the benchmark appendices at the back of the book. Read the main chapters for the what and how; dip into the appendices when you want the why.
The short answer
For PLIDA-style data on a machine with 32 GB of RAM:
- Store raw data as parquet. Convert CSV → parquet with DuckDB’s
read_csv_auto()piped intoCOPY ... TO parquet. It is roughly 2× faster than the Arrow equivalent on this corpus and writes slightly smaller files. Keep zstd compression, medium row groups (ROW_GROUP_SIZE 250000), and medium-sized output files. See A1. - Put the parquet into a materialised DuckDB database once the workload becomes repetitive or join-heavy. Set
memory_limitexplicitly (~60% of physical RAM). Views over parquet are fine for one-off exploration. See A1. - Query the database through
dbplyr. Compose dplyr verbs againsttbl(con, "name")and callcollect()only on the final small result. Tidyverse all the way — raw SQL is almost never needed. See A2. - Use filtering joins (
semi_join,anti_join) instead offilter(x %in% ...)when the right-hand side is large. Pushfilter()upstream of every join. Expect DuckDB to handle 100M+ row joins that Arrow cannot finish on a 32 GB machine. See A3. - For window functions on billion-row panel data, chunk by a person-hash bucket. A single
LAG()over 700M STP rows does not fit; ten independent calls over 70M-row buckets complete in about 20 seconds. See A4.
The rest of this site walks through each of those recommendations in turn.
A very short introduction to DuckDB
DuckDB is a database that runs inside your R session. No server to start, no login, no network. A DuckDB database is a single file on disk (here, plida_tables.duckdb) that you open from R the same way you would open a connection to any database, and close when you’re done. It is the same general idea as SQLite — if you have used that for small projects, DuckDB will feel familiar — but it is designed for the analysis workload we do every day: big tables, filters and groupbys, joins across tables.
Three things about DuckDB matter for PLIDA work, in plain English.
- It only reads the columns your query touches. When you write
summarise(services = sum(numserv))against a 370-million-row MBS table, DuckDB reads thenumservcolumn from disk and ignores everything else. Withread_csv()on a tibble, R reads every column of every row whether you use it or not. That’s the biggest reason the examples in this book go from “takes 35 seconds and 14 GB of RAM” to “takes under a second and uses almost no RAM” without any cleverness on your part — just the storage format plus the query engine. - It works when the data is bigger than your RAM. A join or a sort that cannot fit entirely in memory normally kills your R session (you see “cannot allocate vector of size…”). DuckDB notices this is happening and starts temporarily writing pieces of the intermediate work out to a scratch directory on disk. The query finishes, just a bit slower. Without this feature, a three-way join on the 372-million-row MBS table on a 32 GB laptop is impossible.
- It reads parquet files directly. You can either point DuckDB at a parquet file (a zero-copy “view”) or copy the parquet into a DuckDB table (slightly faster for repeated queries; see Reading and storing). Either way, parquet stays the canonical on-disk format — other R packages (
arrow), Python, and so on can all read the same files.
The only Rule of DuckDB Club worth memorising is: set the memory limit explicitly (below), and always read_only = TRUE when you are analysing rather than building the database.
From R there are two main interfaces to DuckDB:
library(duckdb); library(DBI); library(dplyr); library(dbplyr)
con <- dbConnect(duckdb::duckdb(),
dbdir = "plida_tables.duckdb",
read_only = TRUE)
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET threads TO 8")
# Write tidyverse R. DuckDB executes it.
tbl(con, "itr") |>
filter(incm_yr == 2015L) |>
count(ocptn_grp_cd) |>
collect()Every query in the main chapters and case studies is written in tidyverse syntax — filter, mutate, group_by, summarise, inner_join, semi_join, lag, window_order, window_frame. dbplyr translates those verbs into SQL and DuckDB executes them; you do not have to write SQL. The only exceptions are a small number of one-off setup steps where there is no tidyverse equivalent — CSV → parquet conversion (COPY ... TO PARQUET), connection pragmas (SET memory_limit = ...), and CREATE TABLE for the one-off ingest. Everything after that is dplyr.
You can see the SQL that dbplyr generated for any lazy query with show_query(), and the DuckDB plan with explain(). Use these when something is slow; do not write the SQL yourself.
Two knobs are worth setting on every connection:
SET memory_limit = '20GB'(or roughly 60% of physical RAM). Without this, DuckDB will happily claim most of free memory, collide with R, and the OS will kill one of them.read_only = TRUEwhen attaching an already-built database for analysis. Multiple R sessions can attach the same.duckdbfile read-only simultaneously — useful when you iterate in one session and render a Quarto document in another.
The rest of the site assumes this is the default query engine and treats Arrow, dplyr-on-tibbles, and data.table as alternatives you reach for in specific situations.
How the chapters fit together
The main chapters follow the life of a PLIDA query — from raw file to answer — and each chapter references the appendix that holds the evidence:
- Setup — install
fplida, generate a synthetic corpus, set paths. - Reading and storing raw data — CSV → parquet → DuckDB, with a disciplined storage layout. Evidence in A1.
- Data manipulation — the dplyr verbs through
dbplyr(and Arrow), and the small translation traps that bite once. Evidence in A2. - Joins — inner, left, semi, anti, full; when a
semi_joinreplaces a%in%filter; when filters must be pushed before the join. Evidence in A3. - Window functions —
lag(),row_number(), rolling windows withwindow_frame(), and the chunk-by-hash pattern for billion-row panels. Evidence in A4.
Four short case studies apply the recipe end-to-end: earnings, health services, CSV to chart in 20 minutes, and job switching and health shocks.
Reproducibility
All source data lives outside this repository at /Users/willmackey/offline/fplida-data/fplida_30m_csv/ (synthetic, produced by fplida). All intermediate parquet and DuckDB files live at /Users/willmackey/offline/using-plida-data/ so the project folder stays small. The scripts in R/ are numbered in dependency order; R/01-convert-csv-to-parquet.R runs first, then R/01b-convert-supplementary.R, R/02-build-duckdb.R, and finally R/04-run-benchmarks.R.