Setup

Before running anything on this site, you need two things: the synthetic data, and a place to put intermediate files. This page walks through both.

Step 1: Install fplida

fplida is an R package with a Rust backend that generates synthetic data mirroring PLIDA’s structure. It produces realistic, cross-dataset-consistent microdata for 34 administrative and survey datasets — everything from Census and tax returns to MBS claims and STP payroll.

# Requires R 4.5+ and Rust 1.68+ (via rustup)
export PATH="$HOME/.cargo/bin:$PATH"

# Clone and install
git clone https://github.com/wfmackey/fplida.git
R CMD INSTALL fplida/

Step 2: Generate the data

The size you generate depends on what you want to test. The case studies on this site use 30 million people (~750 GB of CSV), but you can start much smaller:

People Approx CSV size Approx parquet size Good for
100K ~2.5 GB ~500 MB Quick testing, learning the API
1M ~25 GB ~5 GB Realistic single-table workflows
10M ~250 GB ~50 GB Join and window function testing
30M ~750 GB ~130 GB Full PLIDA-scale stress testing

The total disk size understates the challenge — it’s the dataset-level row counts that matter. PLIDA datasets are not one-row-per-person; MBS has ~12 claims per person per year, STP has ~52 payroll records per employed person per year, and the ITR has multiple sub-tables per person per year. Even at 1 million people, joins across these tables involve tens of millions of rows:

Dataset 100K people 1M people 30M people
Census (person-level) 100K 1M 30M
ITR context (person × year × sub-table) ~570K ~5.7M ~170M
MBS claims (person × service × year) ~1.2M ~12M ~372M
STP payroll (person × employer × week) ~11.6M ~116M ~3.5B
Travellers (person × quarter) ~370K ~3.7M ~110M

At 1 million people, joining MBS to demographics already involves 12 million claims — enough that naive in-memory approaches start to struggle and the tools on this site start to matter.

NoteWhy CSV?

fplida can output parquet directly, and parquet is faster to work with (as we show in the Reading and storing chapter). But most data in PLIDA is delivered as CSV. If you’re learning these tools to prepare for working with real PLIDA data, generating CSV and converting it yourself is the more realistic exercise.

fplida can write parquet directly or CSV:

library(fplida)

# Generate at your chosen scale — parquet is faster and smaller
result <- build_fplida(
  n = 1000000,
  export_format = "parquet",
  output_dir = "/path/to/your/fplida-data"
)

# Or CSV if you want to follow the CSV-to-parquet conversion steps on this site
result <- build_fplida(
  n = 1000000,
  export_format = "csv",
  output_dir = "/path/to/your/fplida-data"
)

The output is a directory tree with one subfolder per dataset (e.g. ato-pit_itr/, dhda-mbs/, abs-core/), each containing data files and a spine file that maps spine_id to that agency’s person identifier.

TipStart small

If you just want to follow along with the case studies, 1 million people is enough to see the patterns. The timings and memory numbers on this site are from a 30M run — yours will be proportionally smaller, but the code is the same.

Step 3: Set your paths

Everything on this site reads data from two locations:

  • fplida_path: where fplida wrote its output (the directory containing ato-pit_itr/, dhda-mbs/, etc.)
  • work_path: where this project stores intermediate files (parquet conversions, DuckDB databases). This should be on a drive with plenty of space — at 30M people, expect ~200 GB of intermediates.

Open R/_common.R and edit the two paths:

# R/_common.R — edit these to match your setup
fplida_path <- "/path/to/your/fplida-data"
work_path   <- "/path/to/your/working-directory"

Every script and .qmd file on this site sources R/00-paths.R, which reads R/_common.R and builds the full path structure from those two variables. You set them once; everything else follows.

R/_common.R is in .gitignore so your local paths don’t get committed.

NoteAlternative: environment variables

If you prefer, add these to your .Renviron instead of editing R/_common.R:

FPLIDA_PATH=/path/to/your/fplida-data
USING_PLIDA_WORK_PATH=/path/to/your/working-directory

R/00-paths.R checks for these as a fallback.

Step 4: Install R packages

The case studies use these packages:

install.packages(c(
  "duckdb", "duckplyr", "arrow", "dbplyr",
  "dplyr", "tidyverse", "data.table", "dtplyr",
  "ggplot2", "scales", "tictoc", "fs", "glue", "here"
))

What’s next

With the data generated and paths set, you’re ready to go: