Reading and storing raw data

Administrative microdata usually arrives as CSV, and the CSVs are always bigger than you want them to be. The first job is to get that data off the “CSV on disk” default and onto something faster, smaller, and typed. On PLIDA-style data the right answer is almost always the same: convert to parquet with DuckDB, and load the parquet into a materialised DuckDB database.

This chapter walks through that recipe. The evidence for each claim is in Appendix A1; cross-references below point to the relevant experiment.

NoteThe SQL bits on this page are one-off

This is the only chapter where a couple of DuckDB-native DDL commands appear — the CSV → parquet COPY and the CREATE TABLE that ingests the parquet. Both are one-shot setup steps that have no dbplyr equivalent. You run them once; every analysis after that (chapters 2–4 and the case studies) is ordinary tidyverse R.

Setup

This is the chapter that builds the DuckDB database the rest of the book relies on, so the setup is deliberately light: a few packages for reading and writing, and a source() of the paths helper from the Setup chapter so we can refer to the parquet and DuckDB folders by name. There is no connection to open up here — the short-lived DuckDB connections needed to convert CSVs and ingest parquet are created inline in each example below, then closed immediately after.

Code
library(tidyverse)
library(arrow)
library(duckdb)
library(DBI)
library(fs)

# Centralised paths for this project — a short file that just defines
# where the DuckDB and parquet folders live. See the Setup chapter.
source("R/00-paths.R")
data_dir <- paths$duckdb_dir # replace with your own path

What the data looks like

The fplida package generates a realistic simulated PLIDA corpus: a spine-based synthetic panel of individuals linked across ATO, DHDA, DSS, and other datasets, with cross-dataset consistency. We work against the 30 million person variant, which on disk is about 751 GB of CSV:

Dataset Size (GB)
dhda-mbs 329
ato-stp 242
ato-pit_itr 31
ato-pit_ps 20
ato-pit_ie 17
ncver-tva 14
ato-sae 13
ha-travellers 11
_system 11
dhda-air 9
others 54

A single one of these — the 2015 MBS claims file — is 44.5 GB on its own. That is the stress-test file we return to throughout the site.

For the demonstration we restrict attention to five tables, chosen to mimic the join pattern you would use for a labour-market study of health shocks: demographics on the left, an income table and a services-use table on the right, linked via the PLIDA spine.

label source purpose rows
demo abs-core/plidage-core-demog-cb-c21-2006-latest.csv birth year, gender, country of birth — keyed by SPINE_ID 30M
abs_spine abs-core/abs-spine.csv spine_id → ABS SYNTHETIC_AEUID 30M
ato_spine ato-pit_itr/ato-spine.csv spine_id → ATO SYNTHETIC_AEUID 30M
dhda_spine dhda-mbs/dhda-spine.csv spine_id → DHDA SYNTHETIC_AEUID 30M
itr ato-pit_itr/madip-ge-atoitrcontext.csv income year, occupation code, residency — by ATO aeuid 170M
mbs dhda-mbs/madipge-mbs-d-claims-2015.csv MBS claims for 2015 — by DHDA aeuid 372M

The spine files are the PLIDA-specific glue. Each dataset provider (ABS, ATO, DHDA) has its own person identifier (SYNTHETIC_AEUID), and the *-spine.csv tables map each provider’s identifier back to a common spine_id. The canonical question — “how many MBS services per person by sex and age, for people born 1960–1970?” — is therefore a three-table join through the spine:

demo[SPINE_ID]
  ⟕ dhda_spine[spine_id, SYNTHETIC_AEUID]
  ⟕ mbs[SYNTHETIC_AEUID, NUMSERV, BENPAID]

This shape of question appears so often in PLIDA work that most of the case studies and benchmarks on this site are variations of it.

Step 1 — CSV to parquet, using DuckDB

CSV should essentially always become parquet before serious work begins. Four main reasons:

  • Parquet knows the column types; CSV doesn’t. In a CSV, every column is a string until something reads it and guesses. year_of_birth might be read as a number in one session and a character in the next. Parquet files record the type of each column once, at write time, and every tool that reads them agrees.
  • Parquet lets you read one column at a time. PLIDA tables have dozens of columns and most queries touch a handful. A CSV read always reads every column of every row. Parquet is arranged so that the reader can fetch only the columns the query actually uses. This is the single biggest reason these pipelines are fast.
  • Parquet files are much smaller. Parquet compresses each column separately, and we ask it to use the strong zstd algorithm. The 44.5 GB MBS 2015 CSV becomes 18.7 GB of parquet. Smaller files move through memory and disk faster.
  • Parquet stores mini-summaries that let readers skip chunks. A parquet file is internally split into groups of rows (a few hundred thousand at a time). For each group, the file records the min and max of every column. When you ask for filter(incm_yr == 2014), the reader looks at those min/max summaries and skips any group whose year range doesn’t include 2014 — without decompressing or reading the rows. This is the other reason queries that look at a sliver of the data are fast.

Parquet is also the format everyone uses. arrow, duckdb, Python’s pandas, polars, pretty much anything — all read parquet natively. CSV has a thousand dialects.

The only reason not to convert is if the file is small enough that you never cared about speed anyway.

The recommended converter is DuckDB’s CSV reader piped into COPY ... TO ... (FORMAT PARQUET):

library(duckdb); library(DBI)

con <- dbConnect(duckdb::duckdb())
dbExecute(con, "SET threads TO 8")
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET preserve_insertion_order = false")

dbExecute(con, "
  COPY (
    SELECT * FROM read_csv_auto('dhda-mbs/madipge-mbs-d-claims-2015.csv')
  ) TO 'data/parquet/dhda-mbs'
  (FORMAT PARQUET, COMPRESSION zstd,
   ROW_GROUP_SIZE 250000, ROW_GROUPS_PER_FILE 20)
")

Five details in that snippet are worth spelling out:

  1. read_csv_auto() reads in chunks. It doesn’t load the whole CSV into memory first. A 44 GB file passes through a memory footprint of a few GB while being written to parquet.
  2. COMPRESSION zstd is worth setting explicitly. The default compression is weaker. On the itr_context table, explicit zstd cut the parquet size from 1.82 GB to 1.22 GB.
  3. ROW_GROUP_SIZE 250000 sets the size of the row-groups we described above (the internal chunks of the parquet file, with their min/max summaries). 250,000 rows per group is a sensible middle ground: big enough that compression and chunk-skipping work well, small enough that the parquet metadata doesn’t balloon. Very small groups (say, 10,000) slow queries down noticeably.
  4. ROW_GROUPS_PER_FILE 20 gives about five million rows per output file. Reading a few medium files in parallel is faster than reading one enormous file. Think of it like splitting a long CSV into shards so multiple CPU cores can work at once.
  5. preserve_insertion_order = false unlocks a modest speed-up on bulk conversions where we do not care about the order of rows in the output.

On the six large-file benchmark (three 48–51 GB MBS files plus three 48 GB STP files), DuckDB averaged 67.3s per file against Arrow’s 130.0s — roughly a 2× speed-up — while writing slightly smaller parquet. Peak memory stayed under 3 GB in both cases. Full numbers in A1 §Arrow vs DuckDB.

TipWhy not Arrow?

Arrow was the obvious first choice — open_csv_dataset() plus write_dataset() is the natural R-native streaming path, and it works. The benchmark just shows that DuckDB is faster at this specific step on this corpus. Arrow is still useful as a low-overhead query layer over parquet; see the data manipulation chapter.

NoteShould we skip parquet and ingest CSV straight into DuckDB?

DuckDB can create a table directly from a CSV with CREATE TABLE x AS SELECT * FROM read_csv_auto(...), so in principle you could skip the parquet intermediate. Don’t, for three reasons:

  • Parquet is a useful artefact on its own — other tools read it without a DuckDB connection.
  • CSV-to-DuckDB-table takes about the same time as CSV-to-parquet plus parquet-to-DuckDB-table, and you end up with only the DuckDB copy.
  • If the DuckDB file ever corrupts or needs resharding, parquet gives you a cheap rebuild.

Step 2 — Get the parquet layout right

Parquet is not one format. How you write the file — the compression, the row-group size, whether you split the output into folders by year — changes how fast the reader can scan it. Three rules, all backed up in A1:

  • Use explicit zstd. The project-default compression left the file at 1.82 GB; explicit zstd + 250k row groups rewrote the same data at 1.22 GB.
  • Partitioning is folder-based splitting. Parquet lets you split a table into a folder of files by the value of one column, e.g. by_year/year=2014/, by_year/year=2015/, etc. A filter on year == 2014 then touches only the files in that subfolder, regardless of what’s inside. It’s worth doing when you routinely filter on that column, and pointless otherwise. For a multi-year table like itr_context, year-partitioning made filter(incm_yr == 2015) run in 0.028s vs 0.080s. For MBS 2015 — which is already one year — there is nothing to gain.
  • Keep row groups medium-sized. Tiny row groups (10,000 rows) slow full-table scans down noticeably — a summarise over the whole table went from 0.10s to 0.24s. The middle ground (250,000 rows) is usually right.

The one sorting trick that matters

For big tables you look up one person at a time — “give me every MBS claim for person X” — you can get a huge speed-up by sorting the table by the person ID at write time. Here’s why it matters, in the same language as the row-group explanation above:

Parquet stores a min and max of every column for every row-group. When the data is unsorted, every row-group’s range of synthetic_aeuid covers basically the whole ID space, so a filter like synthetic_aeuid == "X" can’t skip any row-group — the reader has to open all of them. When the data is sorted by synthetic_aeuid, each row-group holds a tight range of IDs, and the reader can skip every row-group except the one containing “X”. On the 371-million-row MBS table this is about 47× faster for single-person lookups (see A3).

The catch: sorting only helps when the filter produces a contiguous range of the sorted column. “People born in 1970” are uniformly spread across the whole range of person IDs (year_of_birth has nothing to do with how IDs were assigned), so sorting MBS by synthetic_aeuid does nothing to speed up a three-way join filtered on year_of_birth. Sort for single-person lookups; do not sort for broad analytical queries.

Step 3 — Load the parquet into DuckDB

Once the parquet files exist there are two natural ways to make them available to DuckDB.

Option A — “views” over parquet (zero copy)

con <- dbConnect(duckdb::duckdb(), dbdir = "plida_views.duckdb")
dbExecute(con, "
  CREATE OR REPLACE VIEW mbs AS
  SELECT * FROM read_parquet('/offline/.../dhda-mbs/*.parquet')
")

A view is a name you give to a query. Here you are telling DuckDB: “whenever anyone asks for the table mbs, go read these parquet files.” The parquet is not copied anywhere. Setting this up takes milliseconds. Every query then does its own parquet read, but only of the columns and row-groups it needs — which is fast.

Option B — “materialised tables” (copy the parquet into DuckDB once)

con <- dbConnect(duckdb::duckdb(), dbdir = "plida_tables.duckdb")
dbExecute(con, "
  CREATE OR REPLACE TABLE mbs AS
  SELECT * FROM read_parquet('/offline/.../dhda-mbs/*.parquet')
")

Here DuckDB reads the parquet once and writes the data into its own internal storage inside the .duckdb file. After this one-off ingest, queries against the mbs table never touch the parquet again. Building the table takes minutes for a large fact table like MBS; subsequent queries are a bit faster and more predictable, because DuckDB owns the storage and doesn’t have to translate from parquet each time.

When to pick which

  • Single-table filter + summarise: views are fine. On the benchmark tasks T1 and T2 the view was actually slightly faster than the table and the difference was tiny.
  • Joins across big tables: tables pull away. On the three-way MBS join (T3), the materialised table finished in 3.7s; the view version took 6.4s. See A3.
  • Running the same pipeline many times: pay the two-minute ingest cost once and then every run is a bit faster. If you are going to run this pipeline more than a couple of times, tables win.

The practical default: tables for anything you run more than once, views for ad-hoc exploration. Prefer views when the underlying parquet changes often, when disk is tight (tables duplicate the storage), or when other tools need to read the same files.

Step 4 — Set the right connection defaults

Two knobs make the difference between a DuckDB connection that behaves and one that crashes the machine:

con <- dbConnect(duckdb::duckdb(),
                 dbdir = "plida_tables.duckdb",
                 read_only = TRUE)
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET threads TO 8")
  • memory_limit should be about 60% of physical RAM. On a 32 GB machine, 20 GB leaves enough for R and the OS. Without this, DuckDB will try to grab most of free memory and one of the two processes will be killed.
  • threads controls parallelism. For bulk ingest and scans, more is better up to core count; for small ad-hoc queries the default is already fine.
  • read_only = TRUE on an already-built database lets multiple sessions attach simultaneously. Useful when you iterate in one R session and render a Quarto document in another.

Putting the pipeline together

The end-to-end recipe for a new PLIDA dataset:

# 1. CSV → parquet (one-shot, outside the analysis)
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET threads TO 8")
dbExecute(con, "SET preserve_insertion_order = false")
dbExecute(con, "
  COPY (SELECT * FROM read_csv_auto('raw/*.csv'))
  TO 'parquet/table'
  (FORMAT PARQUET, COMPRESSION zstd,
   ROW_GROUP_SIZE 250000, ROW_GROUPS_PER_FILE 20)
")
dbDisconnect(con, shutdown = TRUE)

# 2. parquet → DuckDB database (one-shot)
con <- dbConnect(duckdb::duckdb(), dbdir = "plida_tables.duckdb")
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "CREATE TABLE mbs AS
             SELECT * FROM read_parquet('parquet/table/*.parquet')")
dbDisconnect(con, shutdown = TRUE)

# 3. analysis (every day)
library(dplyr); library(dbplyr)
con <- dbConnect(duckdb::duckdb(),
                 dbdir = "plida_tables.duckdb",
                 read_only = TRUE)
dbExecute(con, "SET memory_limit = '20GB'")

tbl(con, "mbs") |>
  filter(incm_yr == 2015L) |>
  count(month = month(svcdate)) |>
  collect()

Everything that follows in this book is a variation on step 3.

What lives in A1

  • Arrow vs DuckDB on the six-file CSV → parquet benchmark (time, memory, output size).
  • Per-file conversion log across the corpus, and the scaling relationship with CSV size.
  • Parquet layout experiments: explicit zstd, partitioning, row-group size.
  • DuckDB build cost for materialised tables vs views.