A1 · Reading and storing benchmarks

CSV → parquet, parquet layout, DuckDB ingest

This appendix backs up the claims in Reading and storing raw data. Three experiments, each with timings and peak resident-set size (RSS):

Arrow vs DuckDB on CSV → parquet, across six large files.
Per-file conversion log and scaling with CSV size.
Parquet layout variants (zstd, row-group size, partitioning).
DuckDB build cost for materialised tables vs views.

All runs are on a 32 GB MacBook with local NVMe. Each backend is spawned in a fresh R subprocess so one run cannot pollute the next.

Setup

Unlike the main chapters, this appendix does not open a DuckDB connection or run any live queries. The benchmark runs happened earlier, in a fresh R subprocess each, inside the R/ scripts listed below each section (R/01-convert-csv-to-parquet.R, R/10-benchmark-csv-backends.R, and so on). Each script wrote its timings and peak-memory readings to a parquet log in benchmarks/. The setup just loads tidyverse plus arrow/fs for reading those logs, scales/knitr for rendering the tables and charts you see below, and R/helpers.R — which defines latest_results(), a tiny wrapper that keeps only the most recent run for each key combination so re-running a benchmark doesn’t pile up stale numbers. The four *_log objects created here are just tibbles the rest of the page filters and renders.

Code

library(tidyverse)
library(scales)
library(arrow)
library(fs)
library(knitr)

# Centralised paths plus a small helper (`latest_results()`) that picks
# the most recent run for each (label, ...) combination from the
# benchmark logs. See the Setup chapter.
source("R/00-paths.R")
source("R/helpers.R")

convert_log <- if (file_exists(file.path(paths$bench_dir, "convert-log.parquet"))) {
  latest_results(read_parquet(file.path(paths$bench_dir, "convert-log.parquet")),
                 c("label", "file"))
} else tibble()

backend_log <- if (file_exists(file.path(paths$bench_dir, "csv-backend-log.parquet"))) {
  latest_results(read_parquet(file.path(paths$bench_dir, "csv-backend-log.parquet")),
                 c("approach", "label"))
} else tibble()

layout_log <- if (file_exists(file.path(paths$bench_dir, "parquet-layout-log.parquet"))) {
  latest_results(read_parquet(file.path(paths$bench_dir, "parquet-layout-log.parquet")),
                 c("task", "variant", "engine"))
} else tibble()

build_log <- if (file_exists(file.path(paths$bench_dir, "duckdb-build-log.parquet"))) {
  latest_results(read_parquet(file.path(paths$bench_dir, "duckdb-build-log.parquet")),
                 c("task", "approach", "label"))
} else tibble()

Arrow vs DuckDB on CSV to parquet

The benchmark converts six genuinely large source files from this corpus — three MBS yearly claim files (48-51 GB each) and three STP yearly payroll files (48 GB each) — with each backend in turn. Both sides write zstd-compressed parquet with medium row groups and medium-sized output files. DuckDB uses ROW_GROUP_SIZE 250000 and ROW_GROUPS_PER_FILE 20; Arrow uses min_rows_per_group = max_rows_per_group = 250000L and max_rows_per_file = 5e6.

The benchmark lives in R/10-benchmark-csv-backends.R and logs to benchmarks/csv-backend-log.parquet.

label	approach	csv_gb	parquet_gb	n_files	seconds	rss_peak_gb
mbs-2018	arrow-stream	48.1	11.85	81	136.8	2.69
mbs-2018	duckdb-copy	48.1	11.51	67	81.7	2.99
mbs-2019	arrow-stream	49.3	12.14	83	137.4	2.66
mbs-2019	duckdb-copy	49.3	11.80	69	80.1	2.69
mbs-2020	arrow-stream	50.6	12.43	85	136.0	0.81
mbs-2020	duckdb-copy	50.6	12.08	70	82.5	2.67
stp-fy2223	arrow-stream	48.1	5.00	140	121.6	2.70
stp-fy2223	duckdb-copy	48.1	4.60	123	53.0	2.70
stp-fy2324	arrow-stream	48.1	5.01	140	121.7	2.60
stp-fy2324	duckdb-copy	48.1	4.62	121	52.7	2.70
stp-fy2425	arrow-stream	48.1	5.02	140	124.8	2.99
stp-fy2425	duckdb-copy	48.1	4.63	122	53.7	2.61

approach	files	mean_seconds	median_seconds	mean_parquet_gb	mean_rss_gb
arrow-stream	6	129.7	130.4	8.57	2.41
duckdb-copy	6	67.3	66.9	8.21	2.73

The result is not close:

DuckDB was about 2× faster on average. Across the six files, Arrow averaged 130.0s; DuckDB averaged 67.3s.
DuckDB wrote slightly smaller parquet. Mean output size was 8.57 GB for Arrow and 8.21 GB for DuckDB on the same source files.
Both stayed within the machine budget. Arrow’s peak RSS averaged 2.41 GB; DuckDB’s averaged 2.73 GB.

That is strong enough to change the project default. For this corpus, on this machine, DuckDB is the better CSV → parquet converter — and it is the recipe recommended in Reading and storing.

Per-file conversion log

Every file run through the production DuckDB conversion pipeline, with elapsed time, input size, and peak RSS.

label	approach	file	csv_gb	parquet_gb	ratio	seconds	peak_rss_gb
stp-fy2324	arrow-stream	madipge-ato-d-stp-fy2324.csv	48.14	8.59	0.18	118.8	0.55
stp-fy2425	arrow-stream	madipge-ato-d-stp-fy2425.csv	48.14	8.61	0.18	118.9	0.56
stp-fy2223	arrow-stream	madipge-ato-d-stp-fy2223.csv	48.11	8.56	0.18	118.2	0.44
stp-fy2122	arrow-stream	madipge-ato-d-stp-fy2122.csv	48.10	8.53	0.18	114.8	0.55
stp-fy2021	arrow-stream	madipge-ato-d-stp-fy2021.csv	48.08	8.50	0.18	113.9	0.44
dhda-mbs-2015	arrow-stream	madipge-mbs-d-claims-2015.csv	44.54	18.73	0.42	122.3	1.76
demo	arrow-stream	plidage-core-demog-cb-c21-2006-latest.csv	0.84	0.24	0.29	2.4	0.60
ato-pit_itr	arrow-stream	madipge-ato-d-ded-exp-off-fy2324.csv	0.77	0.40	0.52	2.5	1.59
ato-pit_itr	arrow-stream	madipge-ato-d-whld-debt-fy2324.csv	0.77	0.54	0.70	2.5	1.76
stp-jobs	arrow-stream	madipge-ato-d-stp-jobs-2020-current.csv	0.76	0.20	0.26	1.8	0.61
ato-pit_itr	arrow-stream	madipge-ato-d-ded-exp-off-fy2223.csv	0.75	0.39	0.52	2.5	1.58
ato-pit_itr	arrow-stream	madipge-ato-d-whld-debt-fy2223.csv	0.75	0.53	0.71	2.4	1.76
abs-spine	arrow-stream	abs-spine.csv	0.71	0.48	0.68	1.2	1.04
ato-spine	arrow-stream	ato-spine.csv	0.71	0.48	0.68	1.2	1.26
ato-spine-stp	arrow-stream	ato-spine.csv	0.71	0.48	0.68	1.2	1.02

The headline row is the 44.5 GB MBS 2015 file converted to 18.7 GB of parquet in about 2 minutes with peak resident-set size under 2 GB. That is the whole point of the out-of-core path, regardless of engine: the file is an order of magnitude larger than the memory budget, and the budget never flinches.

Across the files we converted, wall time is essentially linear in CSV size and peak RSS stays modest.

Parquet layout variants

Parquet is not one format; it is a family of layouts, and the layout matters. We rewrote the 1.8 GB itr_context parquet in a few variants and reran one year-filter query plus one full groupby in both engines.

variant	size_gb	full_groupby_arrow	full_groupby_duckdb	year_filter_arrow	year_filter_duckdb
current layout	1.816	0.211	0.125	0.076	0.080
zstd + 250k row groups	1.215	0.146	0.097	0.237	0.049
year partitions + zstd + 250k	1.216	0.149	0.100	0.066	0.028
year partitions + zstd + 10k	1.412	0.515	0.235	0.098	0.043

Three takeaways:

Explicit zstd is worth it. On itr_context, the current layout was 1.82 GB; the explicit zstd + 250k rewrite was 1.22 GB.
Year partitioning helps year filters when the table is actually multi-year. On the same table, DuckDB’s incm_yr == 2015 count fell from about 0.080s on the current layout to 0.028s on the year-partitioned layout; Arrow moved from 0.076s to 0.066s.
Tiny row groups are bad. The 10k-row-group variant made the full groupby much slower: Arrow went from about 0.149s to 0.515s, DuckDB from 0.100s to 0.235s.

The absolute numbers are small because itr_context is only 1.8 GB, but the direction is the point. The practical rule: use zstd, keep row groups medium-sized, and only partition on columns you actually filter on a lot. For PLIDA that usually means year on multi-year panel-style tables; it does not mean inventing partitions for a single-year fact table like MBS 2015.

DuckDB ingest: materialised tables and views

Once parquet exists, the two ways to expose it to DuckDB are views (zero-copy, re-read on every query) and materialised tables (one-off ingest, native storage).

Views build in a few milliseconds per view. Tables build in time linear in the input size:

label	seconds	peak_rss_gb	db_size_gb	n_rows
abs_spine	2.8	1.86	0.60	30,000,000
ato_spine	2.4	2.13	0.90	30,000,000
ato_spine_stp	4.6	8.04	68.77	NA
demo	2.4	1.83	0.30	30,000,000
dhda_spine	3.4	2.45	1.20	30,000,000
itr	11.0	4.55	2.98	169,966,495
itr_ded	14.3	9.09	6.52	155,700,807
itr_inc	12.3	12.24	9.31	155,700,807
itr_whld	16.2	12.24	12.28	155,700,807
mbs	116.1	9.46	26.65	371,739,667
stp year 1 (create + sort)	155.7	4.96	34.98	NA
stp year 2 (insert + sort)	164.0	6.42	43.32	NA
stp year 3 (insert + sort)	163.3	6.50	51.65	NA
stp year 4 (insert + sort)	166.4	6.03	59.98	NA
stp year 5 (insert + sort)	162.8	6.54	68.31	NA
stp_jobs	3.1	7.97	68.47	NA

Two observations:

MBS dominates build time. The 371 million MBS rows take about two minutes to ingest — about 17 MB/s from parquet into DuckDB storage. Everything else is near-instant by comparison.
Peak RSS during ingest is bounded at roughly the memory_limit set on the connection (here 20 GB). We set this explicitly because otherwise DuckDB will happily claim most of free RAM and collide with R.

What lives in other benchmark appendices

Storage-layout variants that affect joins specifically — sorting the fact table by the join key, integer surrogate keys, PRIMARY KEY — are in A3. The three everyday-dplyr workloads (filter, groupby, join) across five backends are split across A2 and A3.

--- title: "A1 · Reading and storing benchmarks" subtitle: "CSV → parquet, parquet layout, DuckDB ingest" --- This appendix backs up the claims in [Reading and storing raw data](01-reading-storing.qmd). Three experiments, each with timings and peak resident-set size (RSS): 1. Arrow vs DuckDB on CSV → parquet, across six large files. 2. Per-file conversion log and scaling with CSV size. 3. Parquet layout variants (zstd, row-group size, partitioning). 4. DuckDB build cost for materialised tables vs views. All runs are on a 32 GB MacBook with local NVMe. Each backend is spawned in a fresh R subprocess so one run cannot pollute the next. ## Setup Unlike the main chapters, this appendix does not open a DuckDB connection or run any live queries. The benchmark runs happened earlier, in a fresh R subprocess each, inside the `R/` scripts listed below each section (`R/01-convert-csv-to-parquet.R`, `R/10-benchmark-csv-backends.R`, and so on). Each script wrote its timings and peak-memory readings to a parquet log in `benchmarks/`. The setup just loads tidyverse plus `arrow`/`fs` for reading those logs, `scales`/`knitr` for rendering the tables and charts you see below, and `R/helpers.R` — which defines `latest_results()`, a tiny wrapper that keeps only the most recent run for each key combination so re-running a benchmark doesn't pile up stale numbers. The four `*_log` objects created here are just tibbles the rest of the page filters and renders. ```{r setup} library(tidyverse) library(scales) library(arrow) library(fs) library(knitr) # Centralised paths plus a small helper (`latest_results()`) that picks # the most recent run for each (label, ...) combination from the # benchmark logs. See the Setup chapter. source("R/00-paths.R") source("R/helpers.R") convert_log <- if (file_exists(file.path(paths$bench_dir, "convert-log.parquet"))) { latest_results(read_parquet(file.path(paths$bench_dir, "convert-log.parquet")), c("label", "file")) } else tibble() backend_log <- if (file_exists(file.path(paths$bench_dir, "csv-backend-log.parquet"))) { latest_results(read_parquet(file.path(paths$bench_dir, "csv-backend-log.parquet")), c("approach", "label")) } else tibble() layout_log <- if (file_exists(file.path(paths$bench_dir, "parquet-layout-log.parquet"))) { latest_results(read_parquet(file.path(paths$bench_dir, "parquet-layout-log.parquet")), c("task", "variant", "engine")) } else tibble() build_log <- if (file_exists(file.path(paths$bench_dir, "duckdb-build-log.parquet"))) { latest_results(read_parquet(file.path(paths$bench_dir, "duckdb-build-log.parquet")), c("task", "approach", "label")) } else tibble() ``` ## Arrow vs DuckDB on CSV to parquet The benchmark converts **six genuinely large source files** from this corpus — three MBS yearly claim files (`48-51 GB` each) and three STP yearly payroll files (`48 GB` each) — with each backend in turn. Both sides write zstd-compressed parquet with medium row groups and medium-sized output files. DuckDB uses `ROW_GROUP_SIZE 250000` and `ROW_GROUPS_PER_FILE 20`; Arrow uses `min_rows_per_group = max_rows_per_group = 250000L` and `max_rows_per_file = 5e6`. The benchmark lives in `R/10-benchmark-csv-backends.R` and logs to `benchmarks/csv-backend-log.parquet`. ```{r backend-table} #| echo: false if (nrow(backend_log) > 0) { backend_log |> transmute( label, approach, csv_gb = round(csv_gb, 1), parquet_gb = round(parquet_gb, 2), n_files, seconds = round(seconds, 1), rss_peak_gb = round(rss_peak_gb, 2) ) |> arrange(label, approach) |> kable() } ``` ```{r backend-summary} #| echo: false if (nrow(backend_log) > 0) { backend_log |> group_by(approach) |> summarise( files = n(), mean_seconds = round(mean(seconds), 1), median_seconds = round(median(seconds), 1), mean_parquet_gb = round(mean(parquet_gb), 2), mean_rss_gb = round(mean(rss_peak_gb), 2), .groups = "drop" ) |> kable() } ``` The result is not close: - **DuckDB was about 2× faster on average.** Across the six files, Arrow averaged `130.0s`; DuckDB averaged `67.3s`. - **DuckDB wrote slightly smaller parquet.** Mean output size was `8.57 GB` for Arrow and `8.21 GB` for DuckDB on the same source files. - **Both stayed within the machine budget.** Arrow's peak RSS averaged `2.41 GB`; DuckDB's averaged `2.73 GB`. That is strong enough to change the project default. For this corpus, on this machine, **DuckDB is the better CSV → parquet converter** — and it is the recipe recommended in [Reading and storing](01-reading-storing.qmd#step-1-csv-to-parquet-using-duckdb). ## Per-file conversion log Every file run through the production DuckDB conversion pipeline, with elapsed time, input size, and peak RSS. ```{r convert-table} #| echo: false if (nrow(convert_log) > 0) { convert_log |> filter(!is.na(seconds), !is.na(csv_gb)) |> transmute( label, approach, file, csv_gb = round(csv_gb, 2), parquet_gb = round(parquet_gb, 2), ratio = round(parquet_gb / csv_gb, 2), seconds = round(seconds, 1), peak_rss_gb = round(rss_peak_gb, 2) ) |> arrange(desc(csv_gb)) |> head(15) |> kable() } else { cat("(convert log not yet available — run R/01-convert-csv-to-parquet.R)") } ``` The headline row is the 44.5 GB MBS 2015 file converted to 18.7 GB of parquet in about 2 minutes with peak resident-set size under 2 GB. That is the whole point of the out-of-core path, regardless of engine: the file is an order of magnitude larger than the memory budget, and the budget never flinches. ```{r convert-plot} #| echo: false #| fig-width: 7 #| fig-height: 4 if (nrow(convert_log) > 0) { convert_log |> filter(!is.na(seconds), !is.na(csv_gb)) |> ggplot(aes(csv_gb, seconds)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm", se = FALSE, linewidth = 0.4) + scale_x_continuous("CSV size (GB)", trans = "log10") + scale_y_continuous("Conversion time (s)", trans = "log10") + labs(title = "Conversion time scales linearly with CSV size", subtitle = "out-of-core CSV -> parquet conversion") + theme_minimal(base_size = 11) } ``` Across the files we converted, wall time is essentially linear in CSV size and peak RSS stays modest. ## Parquet layout variants Parquet is not one format; it is a family of layouts, and the layout matters. We rewrote the 1.8 GB `itr_context` parquet in a few variants and reran one year-filter query plus one full groupby in both engines. ```{r layout-table} #| echo: false if (nrow(layout_log) > 0) { size_rows <- layout_log |> filter(task %in% c("layout", "write")) |> transmute(variant, size_gb) |> distinct(variant, .keep_all = TRUE) timing_rows <- layout_log |> filter(task %in% c("year_filter", "full_groupby")) |> transmute(variant, metric = paste(task, engine, sep = "_"), seconds) |> pivot_wider(names_from = metric, values_from = seconds) size_rows |> left_join(timing_rows, by = "variant") |> mutate( variant = recode( variant, baseline_current = "current layout", zstd_rg250k = "zstd + 250k row groups", year_part_zstd_rg250k = "year partitions + zstd + 250k", year_part_zstd_rg10k = "year partitions + zstd + 10k" ), across(where(is.double), ~ round(.x, 3)) ) |> arrange(match( variant, c("current layout", "zstd + 250k row groups", "year partitions + zstd + 250k", "year partitions + zstd + 10k") )) |> kable() } ``` Three takeaways: - **Explicit zstd is worth it.** On `itr_context`, the current layout was 1.82 GB; the explicit `zstd + 250k` rewrite was 1.22 GB. - **Year partitioning helps year filters when the table is actually multi-year.** On the same table, DuckDB's `incm_yr == 2015` count fell from about `0.080s` on the current layout to `0.028s` on the year-partitioned layout; Arrow moved from `0.076s` to `0.066s`. - **Tiny row groups are bad.** The `10k`-row-group variant made the full groupby much slower: Arrow went from about `0.149s` to `0.515s`, DuckDB from `0.100s` to `0.235s`. The absolute numbers are small because `itr_context` is only 1.8 GB, but the direction is the point. The practical rule: **use zstd, keep row groups medium-sized, and only partition on columns you actually filter on a lot.** For PLIDA that usually means year on multi-year panel-style tables; it does _not_ mean inventing partitions for a single-year fact table like MBS 2015. ## DuckDB ingest: materialised tables and views Once parquet exists, the two ways to expose it to DuckDB are views (zero-copy, re-read on every query) and materialised tables (one-off ingest, native storage). Views build in a few milliseconds per view. Tables build in time linear in the input size: ```{r build-table} #| echo: false if (nrow(build_log) > 0) { build_log |> filter(approach == "duckdb-table") |> transmute(label, seconds = round(seconds, 1), peak_rss_gb = round(rss_peak_gb, 2), db_size_gb = round(db_size_gb, 2), n_rows = comma(n_rows)) |> kable() } ``` Two observations: 1. **MBS dominates build time.** The 371 million MBS rows take about two minutes to ingest — about 17 MB/s from parquet into DuckDB storage. Everything else is near-instant by comparison. 2. **Peak RSS during ingest is bounded** at roughly the `memory_limit` set on the connection (here 20 GB). We set this explicitly because otherwise DuckDB will happily claim most of free RAM and collide with R. ## What lives in other benchmark appendices Storage-layout variants that affect _joins_ specifically — sorting the fact table by the join key, integer surrogate keys, `PRIMARY KEY` — are in [A3](A3-join-benchmarks.qmd). The three everyday-dplyr workloads (filter, groupby, join) across five backends are split across [A2](A2-data-manipulation-benchmarks.qmd) and [A3](A3-join-benchmarks.qmd).