A2 · Data manipulation benchmarks

Filter + count (T1) and groupby (T2) across five backends

This appendix backs up the claims in Data manipulation. Two tasks, each run against five backends — dplyr on in-memory tibbles, data.table, arrow datasets, DuckDB views over parquet, DuckDB materialised tables — with each backend in a fresh R subprocess (R/04-run-benchmarks.R) so that memory from a prior run cannot pollute the next measurement.

Setup

No DuckDB connection on this page — the runs happened earlier, each in a fresh R subprocess, driven by R/04-run-benchmarks.R. That script ran T1 (filter + count) and T2 (groupby) against five backends, recorded the wall-clock time and peak memory of each, and wrote them all to benchmarks/benchmark-log.parquet. This setup chunk just reads that log (via latest_results() from R/helpers.R, which keeps only the most recent run for each task × approach pair), and defines a couple of small lookups — the ordered set of approaches and the plain-English task labels — that the tables and charts below use.

Code

library(tidyverse)
library(scales)
library(arrow)
library(fs)
library(knitr)

# Paths and helpers. `latest_results()` keeps only the most recent run
# for each (task, approach) pair from the benchmark log.
source("R/00-paths.R")
source("R/helpers.R")

log_path <- file.path(paths$bench_dir, "benchmark-log.parquet")
bench <- if (file_exists(log_path)) {
  latest_results(read_parquet(log_path), c("task", "approach"))
} else tibble()

approach_levels <- c("dplyr", "datatable", "arrow", "duckdb_view", "duckdb_table")
task_labels <- c(
  t1 = "T1: filter + count on 1 year of ITR",
  t2 = "T2: groupby count on all years of ITR"
)

bench_t12 <- bench |> filter(task %in% c("t1", "t2"))

The tasks

T1 filter + count. One year of ITR (~8 M rows), count rows with non-null occupation code. The easy one; included to get a baseline for “small file, nothing complicated”.
T2 groupby count. All ~10 years of ITR pooled (~170 M rows), count rows per occupation group. Full scan of the biggest column-narrow table, exercise the groupby operator.

The three-way MBS join (T3) is in A3.

Results

task	approach	seconds	peak_rss_gb	rss_delta_gb	n_result
t1	dplyr	4.83	1.99	1.84	1
t1	datatable	3.98	1.89	1.73	1
t1	arrow	0.10	0.64	0.48	1
t1	duckdb_view	0.07	0.19	0.04	1
t1	duckdb_table	0.08	0.22	0.06	1
t2	dplyr	38.16	14.04	13.88	365
t2	datatable	39.29	12.17	12.01	365
t2	arrow	0.24	0.40	0.24	365
t2	duckdb_view	0.18	0.24	0.09	365
t2	duckdb_table	0.21	0.55	0.39	365

T1: filter + count on one year of ITR

Small-file regime. Every backend can handle this, and the raw speed difference is mostly down to fixed costs — starting a DuckDB connection and opening a parquet file is a similar order of magnitude to the actual work. The key story here is memory rather than time: dplyr and data.table both materialise the whole year’s worth of ITR (~250 MB in memory), while arrow and duckdb hold almost nothing. They stream.

This is the regime where the choice of backend genuinely does not matter for performance, and you might as well reach for the most ergonomic API (dbplyr, in my view).

T2: groupby count on all ITR years

Medium regime. Now we are scanning ~170 M rows, about 2–3 GB of parquet. dplyr-on-tibbles is paying the full materialisation cost — it has to load every row into R memory before grouping. data.table is the same story (arrow → data.table conversion is a full copy). Arrow and DuckDB both stream the scan and only materialise the ~1000-row result.

The DuckDB table version is meaningfully faster than the DuckDB view version because DuckDB’s own storage is denser and the groupby runs against column stripes directly, without going through the parquet decoder on every query.

Peak memory

The memory plot is arguably the more important one. For the smallest task, every backend is fine. For the medium task, dplyr and data.table are already paying several GB for something that DuckDB handles in a few hundred MB.

Walkthrough timings

A separate experiment walked through the same dplyr pipeline across DuckDB and Arrow, verb by verb, collecting per-chunk wall-clock times. The results reinforce the T1/T2 story and add a useful set of intermediate operations.

Operation (on a big table)	DuckDB	Arrow
filter + count on one year (~10M)	sub-second	sub-second
groupby on all years (~170M)	~1s	~1s
mutate + arrange on 30M demographic rows	~1s	~1s

The short version: for scans, filters, mutates, and groupbys, DuckDB and Arrow are indistinguishable on this corpus. The differences only open up once joins and windows enter the picture — covered in A3 and A4.

Two further findings from the walkthrough that belong here:

as.integer() does not mean the same thing in Arrow and DuckDB. Arrow truncates; DuckDB rounds half-to-even. Using as.integer(year_of_birth / 10) * 10 silently produces different decade counts. floor(x / 10) * 10 works consistently across both. The discussion is in Data manipulation § The as.integer() trap.
%/% and %% are unreliable through Arrow’s translator. Use floor(x / y) and explicit modulus expressions for portability.

Reproducing

Rscript R/04-run-benchmarks.R

Writes benchmarks/benchmark-log.parquet, which this page consumes.

--- title: "A2 · Data manipulation benchmarks" subtitle: "Filter + count (T1) and groupby (T2) across five backends" --- This appendix backs up the claims in [Data manipulation](02-data-manipulation.qmd). Two tasks, each run against five backends — **dplyr on in-memory tibbles, data.table, arrow datasets, DuckDB views over parquet, DuckDB materialised tables** — with each backend in a fresh R subprocess (`R/04-run-benchmarks.R`) so that memory from a prior run cannot pollute the next measurement. ## Setup No DuckDB connection on this page — the runs happened earlier, each in a fresh R subprocess, driven by `R/04-run-benchmarks.R`. That script ran T1 (filter + count) and T2 (groupby) against five backends, recorded the wall-clock time and peak memory of each, and wrote them all to `benchmarks/benchmark-log.parquet`. This setup chunk just reads that log (via `latest_results()` from `R/helpers.R`, which keeps only the most recent run for each task × approach pair), and defines a couple of small lookups — the ordered set of approaches and the plain-English task labels — that the tables and charts below use. ```{r setup} library(tidyverse) library(scales) library(arrow) library(fs) library(knitr) # Paths and helpers. `latest_results()` keeps only the most recent run # for each (task, approach) pair from the benchmark log. source("R/00-paths.R") source("R/helpers.R") log_path <- file.path(paths$bench_dir, "benchmark-log.parquet") bench <- if (file_exists(log_path)) { latest_results(read_parquet(log_path), c("task", "approach")) } else tibble() approach_levels <- c("dplyr", "datatable", "arrow", "duckdb_view", "duckdb_table") task_labels <- c( t1 = "T1: filter + count on 1 year of ITR", t2 = "T2: groupby count on all years of ITR" ) bench_t12 <- bench |> filter(task %in% c("t1", "t2")) ``` ## The tasks - **T1 filter + count.** One year of ITR (~8 M rows), count rows with non-null occupation code. The easy one; included to get a baseline for "small file, nothing complicated". - **T2 groupby count.** All ~10 years of ITR pooled (~170 M rows), count rows per occupation group. Full scan of the biggest column-narrow table, exercise the groupby operator. The three-way MBS join (T3) is in [A3](A3-join-benchmarks.qmd). ## Results ```{r bench-table} #| echo: false if (nrow(bench_t12) > 0) { bench_t12 |> mutate(approach = factor(approach, levels = approach_levels)) |> arrange(task, approach) |> transmute( task, approach, seconds = round(seconds, 2), peak_rss_gb = round(rss_after_gb, 2), rss_delta_gb = round(rss_delta_gb, 2), n_result, error = if_else(is.na(error), "", substr(error, 1, 60)) ) |> kable() } else { cat("(no benchmarks yet — run `Rscript R/04-run-benchmarks.R`)") } ``` ```{r bench-plot} #| echo: false #| fig-width: 8 #| fig-height: 5 if (nrow(bench_t12) > 0) { bench_t12 |> filter(is.na(error)) |> mutate(approach = factor(approach, levels = approach_levels), task = factor(task, levels = names(task_labels), labels = task_labels)) |> ggplot(aes(approach, seconds, fill = approach)) + geom_col() + facet_wrap(~ task, scales = "free_y", ncol = 1) + scale_y_continuous("Seconds") + labs(x = NULL) + theme_minimal(base_size = 11) + theme(legend.position = "none", axis.text.x = element_text(angle = 20, hjust = 1)) } ``` ### T1: filter + count on one year of ITR Small-file regime. Every backend can handle this, and the raw speed difference is mostly down to fixed costs — starting a DuckDB connection and opening a parquet file is a similar order of magnitude to the actual work. The key story here is _memory_ rather than time: dplyr and data.table both materialise the whole year's worth of ITR (~250 MB in memory), while arrow and duckdb hold almost nothing. They stream. This is the regime where the choice of backend genuinely does not matter for performance, and you might as well reach for the most ergonomic API (dbplyr, in my view). ### T2: groupby count on all ITR years Medium regime. Now we are scanning ~170 M rows, about 2–3 GB of parquet. dplyr-on-tibbles is paying the full materialisation cost — it has to load every row into R memory before grouping. data.table is the same story (arrow → data.table conversion is a full copy). Arrow and DuckDB both stream the scan and only materialise the ~1000-row result. The DuckDB table version is meaningfully faster than the DuckDB view version because DuckDB's own storage is denser and the groupby runs against column stripes directly, without going through the parquet decoder on every query. ## Peak memory ```{r rss-plot} #| echo: false #| fig-width: 8 #| fig-height: 4 if (nrow(bench_t12) > 0) { bench_t12 |> filter(is.na(error)) |> mutate(approach = factor(approach, levels = approach_levels), task = factor(task, levels = names(task_labels), labels = task_labels)) |> ggplot(aes(approach, rss_after_gb, fill = approach)) + geom_col() + facet_wrap(~ task, ncol = 2) + scale_y_continuous("Peak RSS after task (GB)") + labs(x = NULL, title = "Memory footprint after each task") + theme_minimal(base_size = 11) + theme(legend.position = "none", axis.text.x = element_text(angle = 20, hjust = 1)) } ``` The memory plot is arguably the more important one. For the smallest task, every backend is fine. For the medium task, dplyr and data.table are already paying several GB for something that DuckDB handles in a few hundred MB. ## Walkthrough timings A separate experiment walked through the same dplyr pipeline across DuckDB and Arrow, verb by verb, collecting per-chunk wall-clock times. The results reinforce the T1/T2 story and add a useful set of intermediate operations. | Operation (on a big table) | DuckDB | Arrow | |---|---|---| | filter + count on one year (~10M) | sub-second | sub-second | | groupby on all years (~170M) | ~1s | ~1s | | mutate + arrange on 30M demographic rows | ~1s | ~1s | The short version: **for scans, filters, mutates, and groupbys, DuckDB and Arrow are indistinguishable** on this corpus. The differences only open up once joins and windows enter the picture — covered in [A3](A3-join-benchmarks.qmd) and [A4](A4-window-function-benchmarks.qmd). Two further findings from the walkthrough that belong here: - **`as.integer()` does not mean the same thing in Arrow and DuckDB.** Arrow truncates; DuckDB rounds half-to-even. Using `as.integer(year_of_birth / 10) * 10` silently produces different decade counts. `floor(x / 10) * 10` works consistently across both. The discussion is in [Data manipulation § The `as.integer()` trap](02-data-manipulation.qmd#the-as-integer-trap-truncate-vs-round). - **`%/%` and `%%` are unreliable through Arrow's translator.** Use `floor(x / y)` and explicit modulus expressions for portability. ## Reproducing ```bash Rscript R/04-run-benchmarks.R ``` Writes `benchmarks/benchmark-log.parquet`, which this page consumes.