Setup
Before running anything on this site, you need two things: the synthetic data, and a place to put intermediate files. This page walks through both.
Step 1: Install fplida
fplida is an R package with a Rust backend that generates synthetic data mirroring PLIDA’s structure. It produces realistic, cross-dataset-consistent microdata for 34 administrative and survey datasets — everything from Census and tax returns to MBS claims and STP payroll.
# Requires R 4.5+ and Rust 1.68+ (via rustup)
export PATH="$HOME/.cargo/bin:$PATH"
# Clone and install
git clone https://github.com/wfmackey/fplida.git
R CMD INSTALL fplida/Step 2: Generate the data
The size you generate depends on what you want to test. The case studies on this site use 30 million people (~750 GB of CSV), but you can start much smaller:
| People | Approx CSV size | Approx parquet size | Good for |
|---|---|---|---|
| 100K | ~2.5 GB | ~500 MB | Quick testing, learning the API |
| 1M | ~25 GB | ~5 GB | Realistic single-table workflows |
| 10M | ~250 GB | ~50 GB | Join and window function testing |
| 30M | ~750 GB | ~130 GB | Full PLIDA-scale stress testing |
The total disk size understates the challenge — it’s the dataset-level row counts that matter. PLIDA datasets are not one-row-per-person; MBS has ~12 claims per person per year, STP has ~52 payroll records per employed person per year, and the ITR has multiple sub-tables per person per year. Even at 1 million people, joins across these tables involve tens of millions of rows:
| Dataset | 100K people | 1M people | 30M people |
|---|---|---|---|
| Census (person-level) | 100K | 1M | 30M |
| ITR context (person × year × sub-table) | ~570K | ~5.7M | ~170M |
| MBS claims (person × service × year) | ~1.2M | ~12M | ~372M |
| STP payroll (person × employer × week) | ~11.6M | ~116M | ~3.5B |
| Travellers (person × quarter) | ~370K | ~3.7M | ~110M |
At 1 million people, joining MBS to demographics already involves 12 million claims — enough that naive in-memory approaches start to struggle and the tools on this site start to matter.
fplida can output parquet directly, and parquet is faster to work with (as we show in the Reading and storing chapter). But most data in PLIDA is delivered as CSV. If you’re learning these tools to prepare for working with real PLIDA data, generating CSV and converting it yourself is the more realistic exercise.
fplida can write parquet directly or CSV:
library(fplida)
# Generate at your chosen scale — parquet is faster and smaller
result <- build_fplida(
n = 1000000,
export_format = "parquet",
output_dir = "/path/to/your/fplida-data"
)
# Or CSV if you want to follow the CSV-to-parquet conversion steps on this site
result <- build_fplida(
n = 1000000,
export_format = "csv",
output_dir = "/path/to/your/fplida-data"
)The output is a directory tree with one subfolder per dataset (e.g. ato-pit_itr/, dhda-mbs/, abs-core/), each containing data files and a spine file that maps spine_id to that agency’s person identifier.
If you just want to follow along with the case studies, 1 million people is enough to see the patterns. The timings and memory numbers on this site are from a 30M run — yours will be proportionally smaller, but the code is the same.
Step 3: Set your paths
Everything on this site reads data from two locations:
fplida_path: where fplida wrote its output (the directory containingato-pit_itr/,dhda-mbs/, etc.)work_path: where this project stores intermediate files (parquet conversions, DuckDB databases). This should be on a drive with plenty of space — at 30M people, expect ~200 GB of intermediates.
Open R/_common.R and edit the two paths:
# R/_common.R — edit these to match your setup
fplida_path <- "/path/to/your/fplida-data"
work_path <- "/path/to/your/working-directory"Every script and .qmd file on this site sources R/00-paths.R, which reads R/_common.R and builds the full path structure from those two variables. You set them once; everything else follows.
R/_common.R is in .gitignore so your local paths don’t get committed.
If you prefer, add these to your .Renviron instead of editing R/_common.R:
FPLIDA_PATH=/path/to/your/fplida-data
USING_PLIDA_WORK_PATH=/path/to/your/working-directory
R/00-paths.R checks for these as a fallback.
Step 4: Install R packages
The case studies use these packages:
install.packages(c(
"duckdb", "duckplyr", "arrow", "dbplyr",
"dplyr", "tidyverse", "data.table", "dtplyr",
"ggplot2", "scales", "tictoc", "fs", "glue", "here"
))What’s next
With the data generated and paths set, you’re ready to go:
- If you generated CSV, start with Reading and storing to convert your data and build the DuckDB database.
- If you generated parquet directly, skip ahead to Reading and storing § Step 3 to build the database.
- If you just want to see what this is all about, read the case studies.