Case study 3: From CSV to chart in 20 minutes

The full pipeline: convert, load, query, plot

The first two case studies assumed you already had parquet files and a DuckDB database ready to go. This one starts from scratch: raw CSVs, no intermediate files, nothing pre-built. By the end you will have a chart on screen.

This is what you would actually do on your first day with a new PLIDA extract.

Setup

Unlike the earlier case studies, this one starts from raw CSVs — no parquet, no DuckDB database yet — so there is no pre-existing connection to open here. The setup just loads the packages we need for each of the four steps that follow: arrow and fs for walking the CSV folder, duckdb/DBI/dbplyr for converting CSVs and later querying the database, glue for small bits of path interpolation, tictoc for timing, and tidyverse + scales for the usual data-munging and plotting. The DuckDB connection itself is opened inside each step (CSV → parquet, then parquet → DuckDB database, then querying the database) so the code in each step reads the way you would write it if you were doing this for the first time on a new dataset.

Code

library(tidyverse)
library(scales)
library(arrow)
library(fs)
library(glue)
library(tictoc)
library(duckdb)
library(DBI)
library(dbplyr)

# Centralised paths for this project — a short file that just defines
# where the DuckDB and parquet folders live. See the Setup chapter.
source("R/00-paths.R")
data_dir <- paths$duckdb_dir # replace with your own path

Step 1: Look at what you have

You have been given a directory of CSVs. Before doing anything clever, just look at them.

Code

csv_root <- paths$csv_root
tibble(
  dataset = dir_ls(csv_root, type = "directory") |> path_file(),
  n_files = vapply(dir_ls(csv_root, type = "directory"), \(d) length(dir_ls(d, glob = "*.csv")), integer(1)),
  total_gb = vapply(dir_ls(csv_root, type = "directory"), path_size_gb, double(1)) |> round(1)
) |>
  arrange(desc(total_gb)) |>
  head(10)

751 GB of CSV across 34 datasets. On a 32 GB machine, you are not reading any of this into R with read_csv(). Don’t even try.

Do not read_csv() a PLIDA file

It is tempting. You have done it a thousand times on smaller data. But a single MBS year file is 44.5 GB — far more than your machine’s 32 GB of RAM. Even a “medium” ITR file will eat several GB and slow everything else down. Use DuckDB’s streaming CSV reader instead (next step).

Step 2: Convert the CSVs you need to parquet

You almost certainly do not need all 34 datasets. Pick the ones relevant to your question and convert them. We covered why parquet is better than CSV in the Reading and storing chapter — the short version is: smaller, faster, typed, and columnar.

The recommended converter is DuckDB’s streaming CSV reader piped into COPY ... TO parquet. On the six‑file benchmark in A1 it was about 2× faster than Arrow for this step, wrote slightly smaller parquet, and stayed under 3 GB of RAM on a 44.5 GB source file.

Here is the recipe for one file:

library(duckdb); library(DBI)

con <- dbConnect(duckdb::duckdb())
dbExecute(con, "SET threads TO 8")
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET preserve_insertion_order = false")

dbExecute(con, "
  COPY (
    SELECT * FROM read_csv_auto('path/to/big-file.csv')
  ) TO 'output/dir/'
  (FORMAT PARQUET, COMPRESSION zstd,
   ROW_GROUP_SIZE 250000, ROW_GROUPS_PER_FILE 20)
")

dbDisconnect(con, shutdown = TRUE)

Five details worth knowing:

read_csv_auto() is streaming. DuckDB never materialises the whole CSV in memory.
COMPRESSION zstd is worth setting explicitly — it shaves a meaningful fraction off the output (e.g. 1.82 GB → 1.22 GB on one of our reference tables).
ROW_GROUP_SIZE 250000 is the right middle ground. Tiny row groups (10k) make full scans multiples slower.
ROW_GROUPS_PER_FILE 20 gives about 5 million rows per output file. Many medium‑sized files parallelise far better than one enormous one.
preserve_insertion_order = false is a DuckDB pragma worth flipping for bulk export where row order does not matter.

How the streaming works

DuckDB’s CSV reader operates on batches of rows, pushes them through any SELECT projection, and streams each batch straight to the parquet writer. A 44.5 GB input passes through about 2–3 GB of RAM; the memory budget does not grow with the file. The full measurement is in A1: peak RSS stays flat regardless of input size.

Do this for each dataset you need. It is boring but fast, and you only do it once.

Step 3: Build a DuckDB database

With the parquet files in place, you now have a choice. You can query the parquet directly (with either arrow or DuckDB views), or you can load the parquet into a materialised DuckDB database. We tested both in Reading and storing and in the join benchmarks, and the answer is clear: materialise if you are going to query the data more than a couple of times.

library(duckdb); library(DBI)

con <- dbConnect(duckdb::duckdb(), dbdir = "my_project.duckdb", read_only = FALSE)

# DuckDB will use up to 20 GB for the ingest. On a 32 GB machine this
# leaves room for R, the OS, and the rest of the world.
dbExecute(con, "SET memory_limit = '20GB'")
dbExecute(con, "SET threads TO 8")

# One table per parquet directory. DuckDB reads the glob.
dbExecute(con, "
  CREATE TABLE demo AS
  SELECT * FROM read_parquet('parquet/demo/*.parquet')
")

dbExecute(con, "
  CREATE TABLE mbs AS
  SELECT * FROM read_parquet('parquet/dhda-mbs/*.parquet')
  ORDER BY SYNTHETIC_AEUID
")

Notice the ORDER BY on the MBS table. We showed in the storage-variant experiments that sorting a fact table by its join key makes point lookups 47× faster. It costs a couple of extra minutes at build time and pays off on every subsequent query that filters by the join key.

When sorting helps (and when it doesn’t)

Sorting by the join key is a big win for point lookups (looking up one person’s data: 47× faster) and a modest help for narrow filtered joins when the filter correlates with the key. It does not help full-table-scan joins where you touch every row anyway — see the storage-variant experiments for the honest numbers.

If you also want faster joins across the board, add a hashed integer key at ingest time:

dbExecute(con, "
  CREATE TABLE mbs AS
  SELECT hash(SYNTHETIC_AEUID) AS aeuid_hash, *
  FROM read_parquet('parquet/dhda-mbs/*.parquet')
  ORDER BY aeuid_hash
")

Integer joins are about 2× faster than string joins. Keep the original SYNTHETIC_AEUID for display; use the hash for joining.

Step 4: Query with dbplyr

Now you are in the pleasant part. Open a read-only connection (so multiple R sessions can share it) and use dplyr verbs as usual. This is the same recipe case study 1 and case study 2 use.

Code

con <- dbConnect(
  duckdb::duckdb(),
  dbdir = file.path(paths$duckdb_dir, "plida_tables.duckdb"),
  read_only = TRUE
)

demo       <- tbl(con, "demo")
dhda_spine <- tbl(con, "dhda_spine")
mbs        <- tbl(con, "mbs")

tic("Three-table join + groupby on 372M rows")
result <- demo |>
  select(SPINE_ID, CORE_GENDER, YEAR_OF_BIRTH) |>
  inner_join(dhda_spine |> rename(SPINE_ID = spine_id), by = "SPINE_ID") |>
  inner_join(mbs |> select(SYNTHETIC_AEUID, NUMSERV, BTOS), by = "SYNTHETIC_AEUID") |>
  mutate(age_2015 = 2015L - YEAR_OF_BIRTH,
         age_group = (age_2015 %/% 10L) * 10L) |>
  filter(age_2015 >= 0L, age_2015 < 100L) |>
  group_by(CORE_GENDER, age_group, BTOS) |>
  summarise(services = sum(NUMSERV, na.rm = TRUE),
            claims = n(), .groups = "drop") |>
  collect()
toc()

Three-table join + groupby on 372M rows: 2.568 sec elapsed

Code

result

That joined 30 million people to 372 million MBS claims, grouped by sex, decade of age, and broad type of service, and returned a tidy tibble. All in a few seconds.

Step 5: Plot

Code

result |>
  filter(CORE_GENDER %in% c("M", "F"),
         BTOS %in% c("GP", "SP", "P", "O")) |>
  mutate(BTOS = recode(BTOS,
    "GP" = "GP", "SP" = "Specialist", "P" = "Pathology", "O" = "Other")) |>
  ggplot(aes(age_group, services / 1e6, fill = BTOS)) +
  geom_col() +
  facet_wrap(~ CORE_GENDER, labeller = labeller(
    CORE_GENDER = c("F" = "Female", "M" = "Male"))) +
  scale_y_continuous(labels = label_comma(suffix = "M")) +
  labs(x = "Age group", y = "MBS services (millions)",
       fill = "Service type",
       title = "MBS service use by age, sex, and service type, 2015",
       subtitle = "fplida synthetic data") +
  theme_minimal(base_size = 11)

The whole pipeline, summarised

Step	What	Time	Memory
Convert CSV → parquet	DuckDB `COPY ... TO parquet`	~1.5 s/GB	<3 GB
Build DuckDB	`CREATE TABLE ... FROM read_parquet()`	~2 min for 372M rows	~10 GB peak
Query + collect	dbplyr verbs → `collect()`	~3-4 s for a 3-table join	~2 GB
Plot	ggplot2 on the small result	instant	trivial

Total wall time for the one-off setup (steps 1–2): maybe 10–15 minutes for a realistic subset of PLIDA. After that, every query is seconds. If you are used to waiting minutes for read_csv() to load a file — or crashing R entirely — this is a different world.

Common mistakes

The four things that crash your machine

These are the mistakes that will freeze a 32 GB laptop. Every one of them involves accidentally pulling too much data into R memory. The fix is always the same: push the filtering and aggregation to DuckDB, collect() only the small result.

A few things that will bite you if you are not expecting them:

Calling collect() too early. If you write mbs |> collect() |> filter(...), you have just tried to load 372 million rows into R memory. Your machine will freeze. Always filter before collecting. The rule is: collect() goes at the end, and the result should be small.

Forgetting read_only = TRUE. By default, DuckDB locks the file for exclusive access. If you open it in your R session and then try to render a Quarto doc that also opens it, one of them will fail. Always pass read_only = TRUE when you are not writing.

Not setting memory_limit. DuckDB will cheerfully try to use all your RAM for a big sort or join. On a 32 GB machine, set memory_limit = '20GB' so the OS and R can breathe.

One enormous parquet file. If you write 372 million rows to a single parquet file, DuckDB cannot parallelise the read and zonemaps are less useful. Set ROW_GROUPS_PER_FILE 20 (≈5 million rows per file) in the COPY ... TO parquet step.

--- title: "Case study 3: From CSV to chart in 20 minutes" subtitle: "The full pipeline: convert, load, query, plot" --- The first two case studies assumed you already had parquet files and a DuckDB database ready to go. This one starts from scratch: raw CSVs, no intermediate files, nothing pre-built. By the end you will have a chart on screen. This is what you would actually do on your first day with a new PLIDA extract. ## Setup Unlike the earlier case studies, this one starts from **raw CSVs** — no parquet, no DuckDB database yet — so there is no pre-existing connection to open here. The setup just loads the packages we need for each of the four steps that follow: `arrow` and `fs` for walking the CSV folder, `duckdb`/`DBI`/`dbplyr` for converting CSVs and later querying the database, `glue` for small bits of path interpolation, `tictoc` for timing, and `tidyverse` + `scales` for the usual data-munging and plotting. The DuckDB connection itself is opened **inside each step** (CSV → parquet, then parquet → DuckDB database, then querying the database) so the code in each step reads the way you would write it if you were doing this for the first time on a new dataset. ```{r setup} library(tidyverse) library(scales) library(arrow) library(fs) library(glue) library(tictoc) library(duckdb) library(DBI) library(dbplyr) # Centralised paths for this project — a short file that just defines # where the DuckDB and parquet folders live. See the Setup chapter. source("R/00-paths.R") data_dir <- paths$duckdb_dir # replace with your own path ``` ## Step 1: Look at what you have You have been given a directory of CSVs. Before doing anything clever, just look at them. ```{r look} csv_root <- paths$csv_root tibble( dataset = dir_ls(csv_root, type = "directory") |> path_file(), n_files = vapply(dir_ls(csv_root, type = "directory"), \(d) length(dir_ls(d, glob = "*.csv")), integer(1)), total_gb = vapply(dir_ls(csv_root, type = "directory"), path_size_gb, double(1)) |> round(1) ) |> arrange(desc(total_gb)) |> head(10) ``` 751 GB of CSV across 34 datasets. On a 32 GB machine, you are not reading any of this into R with `read_csv()`. Don't even try. ::: {.callout-warning} ## Do not `read_csv()` a PLIDA file It is tempting. You have done it a thousand times on smaller data. But a single MBS year file is 44.5 GB — far more than your machine's 32 GB of RAM. Even a "medium" ITR file will eat several GB and slow everything else down. Use DuckDB's streaming CSV reader instead (next step). ::: ## Step 2: Convert the CSVs you need to parquet You almost certainly do not need all 34 datasets. Pick the ones relevant to your question and convert them. We covered why parquet is better than CSV in the [Reading and storing](01-reading-storing.qmd#step-1-csv-to-parquet-using-duckdb) chapter — the short version is: smaller, faster, typed, and columnar. The recommended converter is **DuckDB's streaming CSV reader** piped into `COPY ... TO parquet`. On the six‑file benchmark in [A1](A1-reading-storing-benchmarks.qmd#arrow-vs-duckdb-on-csv-to-parquet) it was about 2× faster than Arrow for this step, wrote slightly smaller parquet, and stayed under 3 GB of RAM on a 44.5 GB source file. Here is the recipe for one file: ```r library(duckdb); library(DBI) con <- dbConnect(duckdb::duckdb()) dbExecute(con, "SET threads TO 8") dbExecute(con, "SET memory_limit = '20GB'") dbExecute(con, "SET preserve_insertion_order = false") dbExecute(con, " COPY ( SELECT * FROM read_csv_auto('path/to/big-file.csv') ) TO 'output/dir/' (FORMAT PARQUET, COMPRESSION zstd, ROW_GROUP_SIZE 250000, ROW_GROUPS_PER_FILE 20) ") dbDisconnect(con, shutdown = TRUE) ``` Five details worth knowing: 1. **`read_csv_auto()`** is streaming. DuckDB never materialises the whole CSV in memory. 2. **`COMPRESSION zstd`** is worth setting explicitly — it shaves a meaningful fraction off the output (e.g. 1.82 GB → 1.22 GB on one of our reference tables). 3. **`ROW_GROUP_SIZE 250000`** is the right middle ground. Tiny row groups (10k) make full scans multiples slower. 4. **`ROW_GROUPS_PER_FILE 20`** gives about 5 million rows per output file. Many medium‑sized files parallelise far better than one enormous one. 5. **`preserve_insertion_order = false`** is a DuckDB pragma worth flipping for bulk export where row order does not matter. ::: {.callout-tip} ## How the streaming works DuckDB's CSV reader operates on batches of rows, pushes them through any `SELECT` projection, and streams each batch straight to the parquet writer. A 44.5 GB input passes through about 2–3 GB of RAM; the memory budget does not grow with the file. The full measurement is in [A1](A1-reading-storing-benchmarks.qmd#per-file-conversion-log): peak RSS stays flat regardless of input size. ::: Do this for each dataset you need. It is boring but fast, and you only do it once. ## Step 3: Build a DuckDB database With the parquet files in place, you now have a choice. You can query the parquet directly (with either arrow or DuckDB views), or you can load the parquet into a materialised DuckDB database. We tested both in [Reading and storing](01-reading-storing.qmd#step-3-load-the-parquet-into-duckdb) and in the [join benchmarks](A3-join-benchmarks.qmd#t3-three-table-join), and the answer is clear: materialise if you are going to query the data more than a couple of times. ```r library(duckdb); library(DBI) con <- dbConnect(duckdb::duckdb(), dbdir = "my_project.duckdb", read_only = FALSE) # DuckDB will use up to 20 GB for the ingest. On a 32 GB machine this # leaves room for R, the OS, and the rest of the world. dbExecute(con, "SET memory_limit = '20GB'") dbExecute(con, "SET threads TO 8") # One table per parquet directory. DuckDB reads the glob. dbExecute(con, " CREATE TABLE demo AS SELECT * FROM read_parquet('parquet/demo/*.parquet') ") dbExecute(con, " CREATE TABLE mbs AS SELECT * FROM read_parquet('parquet/dhda-mbs/*.parquet') ORDER BY SYNTHETIC_AEUID ") ``` Notice the `ORDER BY` on the MBS table. We showed in the [storage-variant experiments](A3-join-benchmarks.qmd#storage-variants) that sorting a fact table by its join key makes point lookups 47× faster. It costs a couple of extra minutes at build time and pays off on every subsequent query that filters by the join key. ::: {.callout-note} ## When sorting helps (and when it doesn't) Sorting by the join key is a big win for **point lookups** (looking up one person's data: 47× faster) and a modest help for **narrow filtered joins** when the filter correlates with the key. It does _not_ help full-table-scan joins where you touch every row anyway — see the [storage-variant experiments](A3-join-benchmarks.qmd#storage-variants) for the honest numbers. ::: If you also want faster joins across the board, add a hashed integer key at ingest time: ```r dbExecute(con, " CREATE TABLE mbs AS SELECT hash(SYNTHETIC_AEUID) AS aeuid_hash, * FROM read_parquet('parquet/dhda-mbs/*.parquet') ORDER BY aeuid_hash ") ``` Integer joins are [about 2× faster](A3-join-benchmarks.qmd#integer-hash-key-variant) than string joins. Keep the original `SYNTHETIC_AEUID` for display; use the hash for joining. ## Step 4: Query with dbplyr Now you are in the pleasant part. Open a _read-only_ connection (so multiple R sessions can share it) and use dplyr verbs as usual. This is the same recipe [case study 1](cs01-earnings.qmd) and [case study 2](cs02-health-services.qmd) use. ```{r query} con <- dbConnect( duckdb::duckdb(), dbdir = file.path(paths$duckdb_dir, "plida_tables.duckdb"), read_only = TRUE ) demo <- tbl(con, "demo") dhda_spine <- tbl(con, "dhda_spine") mbs <- tbl(con, "mbs") tic("Three-table join + groupby on 372M rows") result <- demo |> select(SPINE_ID, CORE_GENDER, YEAR_OF_BIRTH) |> inner_join(dhda_spine |> rename(SPINE_ID = spine_id), by = "SPINE_ID") |> inner_join(mbs |> select(SYNTHETIC_AEUID, NUMSERV, BTOS), by = "SYNTHETIC_AEUID") |> mutate(age_2015 = 2015L - YEAR_OF_BIRTH, age_group = (age_2015 %/% 10L) * 10L) |> filter(age_2015 >= 0L, age_2015 < 100L) |> group_by(CORE_GENDER, age_group, BTOS) |> summarise(services = sum(NUMSERV, na.rm = TRUE), claims = n(), .groups = "drop") |> collect() toc() result ``` That joined 30 million people to 372 million MBS claims, grouped by sex, decade of age, and broad type of service, and returned a tidy tibble. All in a few seconds. ## Step 5: Plot ```{r plot} #| fig-width: 8 #| fig-height: 5 result |> filter(CORE_GENDER %in% c("M", "F"), BTOS %in% c("GP", "SP", "P", "O")) |> mutate(BTOS = recode(BTOS, "GP" = "GP", "SP" = "Specialist", "P" = "Pathology", "O" = "Other")) |> ggplot(aes(age_group, services / 1e6, fill = BTOS)) + geom_col() + facet_wrap(~ CORE_GENDER, labeller = labeller( CORE_GENDER = c("F" = "Female", "M" = "Male"))) + scale_y_continuous(labels = label_comma(suffix = "M")) + labs(x = "Age group", y = "MBS services (millions)", fill = "Service type", title = "MBS service use by age, sex, and service type, 2015", subtitle = "fplida synthetic data") + theme_minimal(base_size = 11) ``` ## The whole pipeline, summarised | Step | What | Time | Memory | |------|------|------|--------| | Convert CSV → parquet | DuckDB `COPY ... TO parquet` | ~1.5 s/GB | <3 GB | | Build DuckDB | `CREATE TABLE ... FROM read_parquet()` | ~2 min for 372M rows | ~10 GB peak | | Query + collect | dbplyr verbs → `collect()` | ~3-4 s for a 3-table join | ~2 GB | | Plot | ggplot2 on the small result | instant | trivial | Total wall time for the one-off setup (steps 1–2): maybe 10–15 minutes for a realistic subset of PLIDA. After that, every query is seconds. If you are used to waiting minutes for `read_csv()` to load a file — or crashing R entirely — this is a different world. ## Common mistakes ::: {.callout-important} ## The four things that crash your machine These are the mistakes that will freeze a 32 GB laptop. Every one of them involves accidentally pulling too much data into R memory. The fix is always the same: push the filtering and aggregation to DuckDB, `collect()` only the small result. ::: A few things that will bite you if you are not expecting them: **Calling `collect()` too early.** If you write `mbs |> collect() |> filter(...)`, you have just tried to load 372 million rows into R memory. Your machine will freeze. Always filter _before_ collecting. The rule is: `collect()` goes at the end, and the result should be small. **Forgetting `read_only = TRUE`.** By default, DuckDB locks the file for exclusive access. If you open it in your R session and then try to render a Quarto doc that also opens it, one of them will fail. Always pass `read_only = TRUE` when you are not writing. **Not setting `memory_limit`.** DuckDB will cheerfully try to use all your RAM for a big sort or join. On a 32 GB machine, set `memory_limit = '20GB'` so the OS and R can breathe. **One enormous parquet file.** If you write 372 million rows to a single parquet file, DuckDB cannot parallelise the read and zonemaps are less useful. Set `ROW_GROUPS_PER_FILE 20` (≈5 million rows per file) in the `COPY ... TO parquet` step. ```{r disconnect} #| include: false dbDisconnect(con, shutdown = TRUE) ```