A1 · Reading and storing benchmarks

CSV → parquet, parquet layout, DuckDB ingest

This appendix backs up the claims in Reading and storing raw data. Three experiments, each with timings and peak resident-set size (RSS):

  1. Arrow vs DuckDB on CSV → parquet, across six large files.
  2. Per-file conversion log and scaling with CSV size.
  3. Parquet layout variants (zstd, row-group size, partitioning).
  4. DuckDB build cost for materialised tables vs views.

All runs are on a 32 GB MacBook with local NVMe. Each backend is spawned in a fresh R subprocess so one run cannot pollute the next.

Setup

Unlike the main chapters, this appendix does not open a DuckDB connection or run any live queries. The benchmark runs happened earlier, in a fresh R subprocess each, inside the R/ scripts listed below each section (R/01-convert-csv-to-parquet.R, R/10-benchmark-csv-backends.R, and so on). Each script wrote its timings and peak-memory readings to a parquet log in benchmarks/. The setup just loads tidyverse plus arrow/fs for reading those logs, scales/knitr for rendering the tables and charts you see below, and R/helpers.R — which defines latest_results(), a tiny wrapper that keeps only the most recent run for each key combination so re-running a benchmark doesn’t pile up stale numbers. The four *_log objects created here are just tibbles the rest of the page filters and renders.

Code
library(tidyverse)
library(scales)
library(arrow)
library(fs)
library(knitr)

# Centralised paths plus a small helper (`latest_results()`) that picks
# the most recent run for each (label, ...) combination from the
# benchmark logs. See the Setup chapter.
source("R/00-paths.R")
source("R/helpers.R")

convert_log <- if (file_exists(file.path(paths$bench_dir, "convert-log.parquet"))) {
  latest_results(read_parquet(file.path(paths$bench_dir, "convert-log.parquet")),
                 c("label", "file"))
} else tibble()

backend_log <- if (file_exists(file.path(paths$bench_dir, "csv-backend-log.parquet"))) {
  latest_results(read_parquet(file.path(paths$bench_dir, "csv-backend-log.parquet")),
                 c("approach", "label"))
} else tibble()

layout_log <- if (file_exists(file.path(paths$bench_dir, "parquet-layout-log.parquet"))) {
  latest_results(read_parquet(file.path(paths$bench_dir, "parquet-layout-log.parquet")),
                 c("task", "variant", "engine"))
} else tibble()

build_log <- if (file_exists(file.path(paths$bench_dir, "duckdb-build-log.parquet"))) {
  latest_results(read_parquet(file.path(paths$bench_dir, "duckdb-build-log.parquet")),
                 c("task", "approach", "label"))
} else tibble()

Arrow vs DuckDB on CSV to parquet

The benchmark converts six genuinely large source files from this corpus — three MBS yearly claim files (48-51 GB each) and three STP yearly payroll files (48 GB each) — with each backend in turn. Both sides write zstd-compressed parquet with medium row groups and medium-sized output files. DuckDB uses ROW_GROUP_SIZE 250000 and ROW_GROUPS_PER_FILE 20; Arrow uses min_rows_per_group = max_rows_per_group = 250000L and max_rows_per_file = 5e6.

The benchmark lives in R/10-benchmark-csv-backends.R and logs to benchmarks/csv-backend-log.parquet.

label approach csv_gb parquet_gb n_files seconds rss_peak_gb
mbs-2018 arrow-stream 48.1 11.85 81 136.8 2.69
mbs-2018 duckdb-copy 48.1 11.51 67 81.7 2.99
mbs-2019 arrow-stream 49.3 12.14 83 137.4 2.66
mbs-2019 duckdb-copy 49.3 11.80 69 80.1 2.69
mbs-2020 arrow-stream 50.6 12.43 85 136.0 0.81
mbs-2020 duckdb-copy 50.6 12.08 70 82.5 2.67
stp-fy2223 arrow-stream 48.1 5.00 140 121.6 2.70
stp-fy2223 duckdb-copy 48.1 4.60 123 53.0 2.70
stp-fy2324 arrow-stream 48.1 5.01 140 121.7 2.60
stp-fy2324 duckdb-copy 48.1 4.62 121 52.7 2.70
stp-fy2425 arrow-stream 48.1 5.02 140 124.8 2.99
stp-fy2425 duckdb-copy 48.1 4.63 122 53.7 2.61
approach files mean_seconds median_seconds mean_parquet_gb mean_rss_gb
arrow-stream 6 129.7 130.4 8.57 2.41
duckdb-copy 6 67.3 66.9 8.21 2.73

The result is not close:

  • DuckDB was about 2× faster on average. Across the six files, Arrow averaged 130.0s; DuckDB averaged 67.3s.
  • DuckDB wrote slightly smaller parquet. Mean output size was 8.57 GB for Arrow and 8.21 GB for DuckDB on the same source files.
  • Both stayed within the machine budget. Arrow’s peak RSS averaged 2.41 GB; DuckDB’s averaged 2.73 GB.

That is strong enough to change the project default. For this corpus, on this machine, DuckDB is the better CSV → parquet converter — and it is the recipe recommended in Reading and storing.

Per-file conversion log

Every file run through the production DuckDB conversion pipeline, with elapsed time, input size, and peak RSS.

label approach file csv_gb parquet_gb ratio seconds peak_rss_gb
stp-fy2324 arrow-stream madipge-ato-d-stp-fy2324.csv 48.14 8.59 0.18 118.8 0.55
stp-fy2425 arrow-stream madipge-ato-d-stp-fy2425.csv 48.14 8.61 0.18 118.9 0.56
stp-fy2223 arrow-stream madipge-ato-d-stp-fy2223.csv 48.11 8.56 0.18 118.2 0.44
stp-fy2122 arrow-stream madipge-ato-d-stp-fy2122.csv 48.10 8.53 0.18 114.8 0.55
stp-fy2021 arrow-stream madipge-ato-d-stp-fy2021.csv 48.08 8.50 0.18 113.9 0.44
dhda-mbs-2015 arrow-stream madipge-mbs-d-claims-2015.csv 44.54 18.73 0.42 122.3 1.76
demo arrow-stream plidage-core-demog-cb-c21-2006-latest.csv 0.84 0.24 0.29 2.4 0.60
ato-pit_itr arrow-stream madipge-ato-d-ded-exp-off-fy2324.csv 0.77 0.40 0.52 2.5 1.59
ato-pit_itr arrow-stream madipge-ato-d-whld-debt-fy2324.csv 0.77 0.54 0.70 2.5 1.76
stp-jobs arrow-stream madipge-ato-d-stp-jobs-2020-current.csv 0.76 0.20 0.26 1.8 0.61
ato-pit_itr arrow-stream madipge-ato-d-ded-exp-off-fy2223.csv 0.75 0.39 0.52 2.5 1.58
ato-pit_itr arrow-stream madipge-ato-d-whld-debt-fy2223.csv 0.75 0.53 0.71 2.4 1.76
abs-spine arrow-stream abs-spine.csv 0.71 0.48 0.68 1.2 1.04
ato-spine arrow-stream ato-spine.csv 0.71 0.48 0.68 1.2 1.26
ato-spine-stp arrow-stream ato-spine.csv 0.71 0.48 0.68 1.2 1.02

The headline row is the 44.5 GB MBS 2015 file converted to 18.7 GB of parquet in about 2 minutes with peak resident-set size under 2 GB. That is the whole point of the out-of-core path, regardless of engine: the file is an order of magnitude larger than the memory budget, and the budget never flinches.

Across the files we converted, wall time is essentially linear in CSV size and peak RSS stays modest.

Parquet layout variants

Parquet is not one format; it is a family of layouts, and the layout matters. We rewrote the 1.8 GB itr_context parquet in a few variants and reran one year-filter query plus one full groupby in both engines.

variant size_gb full_groupby_arrow full_groupby_duckdb year_filter_arrow year_filter_duckdb
current layout 1.816 0.211 0.125 0.076 0.080
zstd + 250k row groups 1.215 0.146 0.097 0.237 0.049
year partitions + zstd + 250k 1.216 0.149 0.100 0.066 0.028
year partitions + zstd + 10k 1.412 0.515 0.235 0.098 0.043

Three takeaways:

  • Explicit zstd is worth it. On itr_context, the current layout was 1.82 GB; the explicit zstd + 250k rewrite was 1.22 GB.
  • Year partitioning helps year filters when the table is actually multi-year. On the same table, DuckDB’s incm_yr == 2015 count fell from about 0.080s on the current layout to 0.028s on the year-partitioned layout; Arrow moved from 0.076s to 0.066s.
  • Tiny row groups are bad. The 10k-row-group variant made the full groupby much slower: Arrow went from about 0.149s to 0.515s, DuckDB from 0.100s to 0.235s.

The absolute numbers are small because itr_context is only 1.8 GB, but the direction is the point. The practical rule: use zstd, keep row groups medium-sized, and only partition on columns you actually filter on a lot. For PLIDA that usually means year on multi-year panel-style tables; it does not mean inventing partitions for a single-year fact table like MBS 2015.

DuckDB ingest: materialised tables and views

Once parquet exists, the two ways to expose it to DuckDB are views (zero-copy, re-read on every query) and materialised tables (one-off ingest, native storage).

Views build in a few milliseconds per view. Tables build in time linear in the input size:

label seconds peak_rss_gb db_size_gb n_rows
abs_spine 2.8 1.86 0.60 30,000,000
ato_spine 2.4 2.13 0.90 30,000,000
ato_spine_stp 4.6 8.04 68.77 NA
demo 2.4 1.83 0.30 30,000,000
dhda_spine 3.4 2.45 1.20 30,000,000
itr 11.0 4.55 2.98 169,966,495
itr_ded 14.3 9.09 6.52 155,700,807
itr_inc 12.3 12.24 9.31 155,700,807
itr_whld 16.2 12.24 12.28 155,700,807
mbs 116.1 9.46 26.65 371,739,667
stp year 1 (create + sort) 155.7 4.96 34.98 NA
stp year 2 (insert + sort) 164.0 6.42 43.32 NA
stp year 3 (insert + sort) 163.3 6.50 51.65 NA
stp year 4 (insert + sort) 166.4 6.03 59.98 NA
stp year 5 (insert + sort) 162.8 6.54 68.31 NA
stp_jobs 3.1 7.97 68.47 NA

Two observations:

  1. MBS dominates build time. The 371 million MBS rows take about two minutes to ingest — about 17 MB/s from parquet into DuckDB storage. Everything else is near-instant by comparison.
  2. Peak RSS during ingest is bounded at roughly the memory_limit set on the connection (here 20 GB). We set this explicitly because otherwise DuckDB will happily claim most of free RAM and collide with R.

What lives in other benchmark appendices

Storage-layout variants that affect joins specifically — sorting the fact table by the join key, integer surrogate keys, PRIMARY KEY — are in A3. The three everyday-dplyr workloads (filter, groupby, join) across five backends are split across A2 and A3.