Data manipulation

dplyr verbs against a DuckDB database

Setup

Load tidyverse plus the handful of packages DuckDB needs (duckdb, DBI, dbplyr), and set a memory and thread budget so DuckDB does not grab all of the machine. There are two ways to point DuckDB at our data — from a DuckDB database or from parquet directly. They give you the same lazy tbl() handles and the analysis code below works identically against either.

From a DuckDB database (default)
From parquet, no database file

Use the materialised plida_tables.duckdb we built in Reading and storing. Open it read-only so you can have multiple R sessions attached at once (handy when you are iterating in one session and rendering a Quarto document in another). This is what every chapter and case study from here on assumes.

Code

library(tidyverse)
library(scales)
library(arrow)    # used below in the "Arrow as a secondary engine" section
library(duckdb)
library(DBI)
library(dbplyr)

# Centralised paths for this project — a short file that just defines
# where the DuckDB and parquet folders live. See the Setup chapter.
source("R/00-paths.R")
data_dir <- paths$duckdb_dir # replace with your own path

con <- dbConnect(
  duckdb::duckdb(),
  dbdir     = file.path(data_dir, "plida_tables.duckdb"),
  read_only = TRUE
)
dbExecute(con, "SET memory_limit = '30GB'") # Set at about 90% of your RAM

[1] 0

Code

dbExecute(con, "SET threads TO 9")          # Set as your number of cores less 1

[1] 0

If you have parquet on disk but haven’t ingested it into a DuckDB database yet, you can skip the database entirely. Open an in-memory DuckDB (pass no dbdir =) and hand tbl() a bit of SQL that tells DuckDB where to read the parquet from. Everything downstream — filter, summarise, inner_join, lag, window_frame, the works — is identical. One line per table.

Code

library(tidyverse)
library(scales)
library(duckdb)
library(DBI)
library(dbplyr)
library(glue)

source("R/00-paths.R")
pq <- paths$parquet_root # replace with your own parquet folder

con <- dbConnect(duckdb::duckdb())              # in-memory DuckDB, no file
dbExecute(con, "SET memory_limit = '30GB'")
dbExecute(con, "SET threads TO 9")

# Point each lazy `tbl()` at a parquet glob. No CREATE TABLE, no ingest step.
itr_db  <- tbl(con, sql(glue("SELECT * FROM read_parquet('{pq}/itr_context/*.parquet')")))
demo_db <- tbl(con, sql(glue("SELECT * FROM read_parquet('{pq}/demo/*.parquet')")))
# ...one line per table you need.

Views over parquet are a touch slower than a materialised table on repeated joins (see A3), but for one-off exploration or scripts that already have parquet on disk, this is the fastest path from zero to querying.

Why are we doing this

With a normal in-memory dataset, we would just read it in and do some things to it:

Code

read_csv("some-data.csv") |> 
  rename_with(tolower) |> 
  mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |>
  count(occ_major) |>
  arrange(occ_major)

This usually works great, and the tidyverse syntax lets us write human-readable and expressive code that does the thing we want it to do. The problem arises when the data is too big for our computer’s memory. It either won’t be read in (you will encounter an error or your computer will explode) or it will take forever to do simple tasks.

Luckily, we can change a few things here and there and apply our neat code to enmormous datasets that would never fit in 64GB of memory (or even 256GB, even if you could afford it). Here we first ‘connect’ to our dataset and then run the same tidyverse code over it. At the end we use collect() to send the job to duckdb to process out of memeory. Shortly after, we have our result.

Code

1itr_db <- tbl(con, "itr")

itr_db |>
  rename_with(tolower) |>
  filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |>
  mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |>
  count(occ_major) |>
  arrange(occ_major) |>
2  collect()

1: add this.
2: and this.

⏱️ filter + mutate + count (NA): 0.069s

Once the data is in parquet and DuckDB, the day-to-day work is ordinary dplyr: filter, mutate, group_by, summarise, count, arrange. The verbs look the same as on a tibble. The difference is where the computation happens — the dplyr code is translated to SQL and executed inside DuckDB — and that difference is the reason you can run these pipelines over 170-million-row tables without the laptop noticing.

This chapter covers the everyday verbs, the small translation traps that bite once, and the escape hatches for the things dplyr does not cover. Joins and window functions get their own chapters. The evidence behind the speed and memory claims is in Appendix A2.

The mental model: build the pipe, then `collect()` at the end

tbl(con, "name") returns a lazy handle to a DuckDB table. “Lazy” means nothing actually runs when you apply a verb — each verb just adds a step to a pipeline description that DuckDB will run later. The actual work only happens when you call collect(), which executes the whole pipeline and pulls the result back into R as a tibble. (There is also compute(), which runs the pipeline but leaves the result inside DuckDB as a table.) The same pattern applies to Arrow: open_dataset() returns a lazy object that you build up with verbs and collect() at the end.

Code

# rename_with(tolower) lowercases SCREAMING_CASE columns so the rest of the
# pipe can stay in snake_case. It is lazy — no data moves.
itr_db     <- tbl(con, "itr")        |> rename_with(tolower)
demo_db    <- tbl(con, "demo")       |> rename_with(tolower)
ato_sp_db  <- tbl(con, "ato_spine")  |> rename_with(tolower)
dhda_sp_db <- tbl(con, "dhda_spine") |> rename_with(tolower)

Two key rules for getting this right:

Keep the pipe lazy until the last step. A collect() in the middle of a pipeline forces the work so far back into R memory, and every verb after it runs in R rather than in DuckDB. You lose the speed advantage. Save collect() for the final, small result.
Make the last step small. A collect() that tries to pull billions of rows into R will crash your session. Aggregate or slice_head() first.

Two useful diagnostics on a lazy object:

Code

glimpse(itr_db)

Rows: ??
Columns: 6
Database: DuckDB 1.4.1 [root@Darwin 25.2.0:R 4.5.2//Users/willmackey/offline/using-plida-data/duckdb/plida_tables.duckdb]
$ synthetic_aeuid          <chr> "000000000200", "0000000002D7", "0000000004FB…
$ incm_yr                  <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 201…
$ ocptn_grp_cd             <dbl> 3114, 2521, 5614, 7219, 7129, 2343, 2322, 612…
$ sub_ocptn_grp_cd         <dbl> 311412, 252112, 561411, 721911, 712916, 23431…
$ mn_bus_tax_ofc_anzsic_cd <dbl> 5, 14, 10, 3, 15, 17, 12, 7, 13, 17, 16, 16, …
$ clnt_rsdnt_ind           <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …

Code

# See the SQL that dbplyr would send to DuckDB
itr_db |> filter(incm_yr == 2015L) |> count(ocptn_grp_cd) |> show_query()

<SQL>
SELECT ocptn_grp_cd, COUNT(*) AS n
FROM (
  SELECT
    SYNTHETIC_AEUID AS synthetic_aeuid,
    INCM_YR AS incm_yr,
    OCPTN_GRP_CD AS ocptn_grp_cd,
    SUB_OCPTN_GRP_CD AS sub_ocptn_grp_cd,
    MN_BUS_TAX_OFC_ANZSIC_CD AS mn_bus_tax_ofc_anzsic_cd,
    CLNT_RSDNT_IND AS clnt_rsdnt_ind
  FROM itr
  WHERE (INCM_YR = 2015)
) q01
GROUP BY ocptn_grp_cd

show_query() prints the SQL — handy for sanity-checking what DuckDB is being asked to do, and for spotting the occasional translation surprise.

Filter and count: a single year of ITR

A single year of ITR is about 10 million rows. Count the returns filed in 2015 by ANZSCO occupation major group.

Code

itr_db |>
  filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |>
  mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |>
  count(occ_major) |>
  collect() |>
  arrange(occ_major)

⏱️ filter + count (10M) (duckdb): 0.053s

A fraction of a second. DuckDB uses the per-row-group min/max summaries we described in Reading and storing to skip every row-group that doesn’t contain any 2015 rows, so only 2015 rows are actually decoded. Memory stays flat: ordinary dplyr on a tibble would have paid around 250 MB to hold the whole year of data before doing anything with it.

The T1 benchmark in A2 shows this explicitly — DuckDB and Arrow both finish in well under a second while dplyr-on-tibbles takes 4+ seconds and loads the whole year into R memory first.

Groupby across all years

Same shape, now over 170 million rows spanning ten financial years.

Code

occ_by_year_db <- itr_db |>
  filter(!is.na(ocptn_grp_cd)) |>
  mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |>
  count(incm_yr, occ_major) |>
  collect()
head(occ_by_year_db)

⏱️ groupby (170M) (duckdb): 0.243s

Code

occ_labels <- c(
  "1" = "Managers", "2" = "Professionals", "3" = "Technicians",
  "4" = "Community/personal", "5" = "Clerical/admin",
  "6" = "Sales", "7" = "Machinery operators", "8" = "Labourers"
)

occ_by_year_db |>
  filter(occ_major %in% 1:8) |>
  mutate(occ_label = factor(occ_labels[as.character(occ_major)], levels = occ_labels)) |>
  ggplot(aes(incm_yr, n, colour = occ_label)) +
  geom_line(linewidth = 0.7) +
  scale_y_continuous(labels = label_comma()) +
  labs(x = "Income year", y = "Tax returns filed", colour = "ANZSCO major group",
       title = "Returns by occupation group over a decade",
       subtitle = "170M rows scanned in under a second") +
  theme_minimal(base_size = 11)

The T2 benchmark (A2 §T2) is where the regime changes become visible. dplyr on tibbles takes about 35 seconds and 14.5 GB of RAM — it has to load every row before grouping. DuckDB and Arrow both stream the scan and hold only the result (~1000 rows). That is roughly a 200× speed-up and a 50× memory reduction, on identical dplyr code, for changing one thing: what tbl points at.

`mutate()`: watch for small translation differences

mutate() is where small differences between backends appear. Your dplyr code is being translated — into DuckDB’s SQL on one side, or into Arrow’s equivalent function on the other — and the translations don’t always agree on the fine detail. For the everyday operations (x + 1, paste0, case_when, is.na) everything is fine. The innocuous-looking expression “round a year down to the nearest decade” is the textbook trap.

Code

gender_dist_db <- demo_db |>
  filter(!is.na(year_of_birth)) |>
  mutate(
    # Use floor() explicitly — do NOT rely on as.integer() to truncate,
    # because as.integer(x) means different things in R/arrow vs DuckDB.
    birth_decade = as.integer(floor(year_of_birth / 10) * 10),
    gender = case_when(
      core_gender == "M" ~ "Male",
      core_gender == "F" ~ "Female",
      TRUE               ~ "Other/unknown"
    )
  ) |>
  count(birth_decade, gender) |>
  arrange(birth_decade, gender) |>
  collect()
head(gender_dist_db, 8)

⏱️ mutate + arrange (30M) (duckdb): 0.124s

The as.integer() trap: truncate vs round

An earlier draft of this pipeline used as.integer(year_of_birth / 10) * 10L. It looks equivalent to the floor() version but gives different answers across backends:

In R (and therefore in Arrow), as.integer() truncates toward zero: as.integer(192.6) = 192, so 1926 maps to decade 1920.
In DuckDB, dbplyr translates as.integer(...) to SQL’s CAST(... AS INTEGER), which rounds half-to-even: CAST(192.6 AS INTEGER) = 193, so 1926 maps to decade 1930.

The consequence is silent and substantive. For this data, the “1920” Arrow bucket covered years 1920–1929 while the “1920” DuckDB bucket covered roughly 1915–1924, and the decade counts disagreed by tens of percent.

When you want truncation, say so explicitly with floor() (or trunc()); do not rely on as.integer() to do it for you across engines. The same caution applies to %/%, %%, and bankers’ rounding — all of which have backend-specific translations.

A few patterns are worth knowing by heart because they translate cleanly:

case_when() and if_else() — both work, both produce clean SQL.
stringr helpers (str_detect, str_sub, str_replace) — work for most regex-free patterns.
Basic date parts: year(), month(), day() — both backends.
coalesce() — both backends.

And a few that routinely need help:

Integer division (%/%) and modulus (%%) — Arrow’s translator is unreliable. Use floor(x / y) and x - floor(x / y) * y for portability.
Long-tail date arithmetic — interval arithmetic, lubridate helpers beyond year()/month()/day() — DuckDB handles more than Arrow, but both can bail. If dbplyr does not translate cleanly, swap a lubridate call for an equivalent arithmetic one (e.g. as.integer(difftime(a, b, units = "days")) instead of a %--% interval).
Rolling windows and frame clauses — use window_order() + window_frame(from, to) in dbplyr; see the Window functions chapter.

When the translation fails

You find out at collect() time: a dbplyr translation error (or, under Arrow, a “function not implemented” error). The right response, in order of preference:

Rewrite using something the translator already knows. The floor() example above. stringr::str_c() instead of paste0(). coalesce() instead of ifelse() when you’re handling missing values. This covers almost every case you will meet.
Last resort: drop a sql() fragment into a single mutate() or filter(). This keeps the pipe in tidyverse but hands DuckDB one literal SQL expression where there is no tidy alternative:
```
demo_db |>
  mutate(age_bucket = sql("(2015 - YEAR_OF_BIRTH) / 10 * 10")) |>
  count(age_bucket)
```
Reserve this for genuinely DuckDB-specific features that have no dplyr counterpart. If you find yourself writing more than one sql(...) per pipeline, rewrite the tidyverse side instead.

You should almost never reach for dbGetQuery() (running a hand-written SQL query in one shot). Writing an entire query as a SQL string means giving up lazy composition, the familiar dplyr syntax, and the automatic translation we rely on everywhere else in the book.

Arrow as a secondary engine

arrow gives you the same dplyr verbs over parquet without needing a DuckDB database. It is a reasonable alternative in a few specific situations:

You are looking at someone else’s parquet folder and do not want to stand up a DuckDB database just to explore it.
You are doing a simple “read, filter, aggregate, save” pipeline with no joins or window functions.
You want the output to be read seamlessly by Python tools that prefer the Arrow format.

Code

itr_ar <- open_dataset(file.path(pq, "itr_context")) |> rename_with(tolower)

itr_ar |>
  filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |>
  count(ocptn_grp_cd) |>
  collect() |>
  head()

For scans and simple groupbys Arrow and DuckDB are indistinguishable (see the walkthrough timings in A2). The gap opens on joins and window functions — covered in the joins and window functions chapters — and that is where DuckDB becomes the safe default.

Small result, large source: the whole point

Most PLIDA analyses end in a plot or a small table. The pipeline should look like a pyramid: a wide, filtered scan at the bottom; aggregations squeezing the result down through the middle; a small collect() at the top. Everything below collect() runs in DuckDB; everything above runs in R.

library(dplyr); library(dbplyr); library(ggplot2)

tbl(con, "itr") |>
  filter(incm_yr >= 2010L, !is.na(ocptn_grp_cd)) |>
  mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |>
  count(incm_yr, occ_major) |>
  collect() |>                               # ← here, and only here
  ggplot(aes(incm_yr, n, colour = factor(occ_major))) +
  geom_line()

The result that hits R is a few hundred rows. DuckDB did the work over 170M.

What lives in A2

T1: filter + count on one year of ITR — five backends side by side.
T2: groupby count over all ITR years — five backends side by side.
Walkthrough timings: DuckDB vs Arrow on the same dplyr code, across filter, groupby, mutate, and arrange.
Peak memory footprint for each approach.

--- title: "Data manipulation" subtitle: "dplyr verbs against a DuckDB database" --- ## Setup Load `tidyverse` plus the handful of packages DuckDB needs (`duckdb`, `DBI`, `dbplyr`), and set a memory and thread budget so DuckDB does not grab all of the machine. There are two ways to point DuckDB at our data — **from a DuckDB database** or **from parquet directly**. They give you the same lazy `tbl()` handles and the analysis code below works identically against either. ::: {.panel-tabset} ## From a DuckDB database (default) Use the materialised `plida_tables.duckdb` we built in [Reading and storing](01-reading-storing.qmd). Open it **read-only** so you can have multiple R sessions attached at once (handy when you are iterating in one session and rendering a Quarto document in another). This is what every chapter and case study from here on assumes. ```{r setup} library(tidyverse) library(scales) library(arrow) # used below in the "Arrow as a secondary engine" section library(duckdb) library(DBI) library(dbplyr) # Centralised paths for this project — a short file that just defines # where the DuckDB and parquet folders live. See the Setup chapter. source("R/00-paths.R") data_dir <- paths$duckdb_dir # replace with your own path con <- dbConnect( duckdb::duckdb(), dbdir = file.path(data_dir, "plida_tables.duckdb"), read_only = TRUE ) dbExecute(con, "SET memory_limit = '30GB'") # Set at about 90% of your RAM dbExecute(con, "SET threads TO 9") # Set as your number of cores less 1 ``` ## From parquet, no database file If you have parquet on disk but haven't ingested it into a DuckDB database yet, you can skip the database entirely. Open an **in-memory** DuckDB (pass no `dbdir =`) and hand `tbl()` a bit of SQL that tells DuckDB where to read the parquet from. Everything downstream — `filter`, `summarise`, `inner_join`, `lag`, `window_frame`, the works — is identical. One line per table. ```{r setup-parquet} #| eval: false library(tidyverse) library(scales) library(duckdb) library(DBI) library(dbplyr) library(glue) source("R/00-paths.R") pq <- paths$parquet_root # replace with your own parquet folder con <- dbConnect(duckdb::duckdb()) # in-memory DuckDB, no file dbExecute(con, "SET memory_limit = '30GB'") dbExecute(con, "SET threads TO 9") # Point each lazy `tbl()` at a parquet glob. No CREATE TABLE, no ingest step. itr_db <- tbl(con, sql(glue("SELECT * FROM read_parquet('{pq}/itr_context/*.parquet')"))) demo_db <- tbl(con, sql(glue("SELECT * FROM read_parquet('{pq}/demo/*.parquet')"))) # ...one line per table you need. ``` Views over parquet are a touch slower than a materialised table on repeated joins (see [A3](A3-join-benchmarks.qmd#t3-three-table-join)), but for one-off exploration or scripts that already have parquet on disk, this is the fastest path from zero to querying. ::: ```{r scaffold} #| include: false # Internal rendering scaffolding for the `#| time_it: "..."` chunk option # used below. Readers can ignore this. library(tictoc); library(knitr) timings <- tibble() knitr::knit_hooks$set(time_it = function(before, options, envir) { if (before) { gc(full = TRUE, verbose = FALSE) assign(".time_it_t0", Sys.time(), envir = envir) } else { secs <- as.numeric(difftime(Sys.time(), get(".time_it_t0", envir = envir), units = "secs")) task <- options$time_it approach <- if (!is.null(options$approach)) options$approach else if (grepl("-duckdb$", options$label)) "duckdb" else if (grepl("-arrow$", options$label)) "arrow" else NA_character_ timings <<- bind_rows(timings, tibble( task = task, approach = approach, seconds = secs )) sprintf("\n::: {.callout-note appearance='minimal' collapse='false'}\n⏱️ **%s** (%s): **%.3fs**\n:::\n", task, approach, secs) } }) pq <- paths$parquet_root ``` ## Why are we doing this With a _normal_ in-memory dataset, we would just read it in and do some things to it: ```{r normal} #| eval: false read_csv("some-data.csv") |> rename_with(tolower) |> mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |> count(occ_major) |> arrange(occ_major) ``` This usually works great, and the `tidyverse` syntax lets us write human-readable and expressive code that does the thing we want it to do. The problem arises when the data is too big for our computer's memory. It either won't be read in (you will encounter an error or your computer will explode) or it will take forever to do simple tasks. Luckily, we can change a few things here and there and apply our neat code to enmormous datasets that would never fit in 64GB of memory (or even 256GB, even if you could afford it). Here we first 'connect' to our dataset and _then_ run the same `tidyverse` code over it. At the end we use `collect()` to send the job to duckdb to process out of memeory. Shortly after, we have our result. ```{r big-data-example} #| time_it: "filter + mutate + count" itr_db <- tbl(con, "itr") # <1> itr_db |> rename_with(tolower) |> filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |> mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |> count(occ_major) |> arrange(occ_major) |> collect() # <2> ``` 1. add this. 2. and this. Once the data is in parquet and DuckDB, the day-to-day work is ordinary `dplyr`: `filter`, `mutate`, `group_by`, `summarise`, `count`, `arrange`. The verbs look the same as on a tibble. The difference is where the computation happens — **the dplyr code is translated to SQL and executed inside DuckDB** — and that difference is the reason you can run these pipelines over 170-million-row tables without the laptop noticing. This chapter covers the everyday verbs, the small translation traps that bite once, and the escape hatches for the things dplyr does not cover. Joins and window functions get their own chapters. The evidence behind the speed and memory claims is in [Appendix A2](A2-data-manipulation-benchmarks.qmd). ## The mental model: build the pipe, then `collect()` at the end `tbl(con, "name")` returns a **lazy** handle to a DuckDB table. "Lazy" means nothing actually runs when you apply a verb — each verb just adds a step to a pipeline description that DuckDB will run later. The actual work only happens when you call `collect()`, which executes the whole pipeline and pulls the result back into R as a tibble. (There is also `compute()`, which runs the pipeline but leaves the result inside DuckDB as a table.) The same pattern applies to Arrow: `open_dataset()` returns a lazy object that you build up with verbs and `collect()` at the end. ```{r read-duckdb} # rename_with(tolower) lowercases SCREAMING_CASE columns so the rest of the # pipe can stay in snake_case. It is lazy — no data moves. itr_db <- tbl(con, "itr") |> rename_with(tolower) demo_db <- tbl(con, "demo") |> rename_with(tolower) ato_sp_db <- tbl(con, "ato_spine") |> rename_with(tolower) dhda_sp_db <- tbl(con, "dhda_spine") |> rename_with(tolower) ``` Two key rules for getting this right: - **Keep the pipe lazy until the last step.** A `collect()` in the middle of a pipeline forces the work so far back into R memory, and every verb after it runs in R rather than in DuckDB. You lose the speed advantage. Save `collect()` for the final, small result. - **Make the last step small.** A `collect()` that tries to pull billions of rows into R will crash your session. Aggregate or `slice_head()` first. Two useful diagnostics on a lazy object: ```{r glimpse-db} glimpse(itr_db) ``` ```{r show_query} # See the SQL that dbplyr would send to DuckDB itr_db |> filter(incm_yr == 2015L) |> count(ocptn_grp_cd) |> show_query() ``` `show_query()` prints the SQL — handy for sanity-checking what DuckDB is being asked to do, and for spotting the occasional translation surprise. ## Filter and count: a single year of ITR A single year of ITR is about 10 million rows. Count the returns filed in 2015 by ANZSCO occupation major group. ```{r filter-duckdb} #| time_it: "filter + count (10M)" itr_db |> filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |> mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |> count(occ_major) |> collect() |> arrange(occ_major) ``` A fraction of a second. DuckDB uses the per-row-group min/max summaries we described in [Reading and storing](01-reading-storing.qmd#step-1-csv-to-parquet-using-duckdb) to skip every row-group that doesn't contain any 2015 rows, so only 2015 rows are actually decoded. Memory stays flat: ordinary dplyr on a tibble would have paid around 250 MB to hold the whole year of data before doing anything with it. The T1 benchmark in [A2](A2-data-manipulation-benchmarks.qmd#t1-filter-and-count) shows this explicitly — DuckDB and Arrow both finish in well under a second while `dplyr`-on-tibbles takes 4+ seconds and loads the whole year into R memory first. ## Groupby across all years Same shape, now over 170 million rows spanning ten financial years. ```{r groupby-duckdb} #| time_it: "groupby (170M)" occ_by_year_db <- itr_db |> filter(!is.na(ocptn_grp_cd)) |> mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |> count(incm_yr, occ_major) |> collect() head(occ_by_year_db) ``` ```{r plot-groupby} #| fig-width: 8 #| fig-height: 4.5 occ_labels <- c( "1" = "Managers", "2" = "Professionals", "3" = "Technicians", "4" = "Community/personal", "5" = "Clerical/admin", "6" = "Sales", "7" = "Machinery operators", "8" = "Labourers" ) occ_by_year_db |> filter(occ_major %in% 1:8) |> mutate(occ_label = factor(occ_labels[as.character(occ_major)], levels = occ_labels)) |> ggplot(aes(incm_yr, n, colour = occ_label)) + geom_line(linewidth = 0.7) + scale_y_continuous(labels = label_comma()) + labs(x = "Income year", y = "Tax returns filed", colour = "ANZSCO major group", title = "Returns by occupation group over a decade", subtitle = "170M rows scanned in under a second") + theme_minimal(base_size = 11) ``` The T2 benchmark ([A2 §T2](A2-data-manipulation-benchmarks.qmd#t2-groupby-across-all-years)) is where the regime changes become visible. `dplyr` on tibbles takes about 35 seconds and 14.5 GB of RAM — it has to load every row before grouping. DuckDB and Arrow both stream the scan and hold only the result (~1000 rows). That is roughly a 200× speed-up and a 50× memory reduction, on identical dplyr code, for changing one thing: what `tbl` points at. ## `mutate()`: watch for small translation differences `mutate()` is where small differences between backends appear. Your dplyr code is being translated — into DuckDB's SQL on one side, or into Arrow's equivalent function on the other — and the translations don't always agree on the fine detail. For the everyday operations (`x + 1`, `paste0`, `case_when`, `is.na`) everything is fine. The innocuous-looking expression "round a year down to the nearest decade" is the textbook trap. ```{r mutate-duckdb} #| time_it: "mutate + arrange (30M)" gender_dist_db <- demo_db |> filter(!is.na(year_of_birth)) |> mutate( # Use floor() explicitly — do NOT rely on as.integer() to truncate, # because as.integer(x) means different things in R/arrow vs DuckDB. birth_decade = as.integer(floor(year_of_birth / 10) * 10), gender = case_when( core_gender == "M" ~ "Male", core_gender == "F" ~ "Female", TRUE ~ "Other/unknown" ) ) |> count(birth_decade, gender) |> arrange(birth_decade, gender) |> collect() head(gender_dist_db, 8) ``` ::: {.callout-warning} ## The `as.integer()` trap: truncate vs round An earlier draft of this pipeline used `as.integer(year_of_birth / 10) * 10L`. It looks equivalent to the `floor()` version but gives different answers across backends: - In R (and therefore in Arrow), `as.integer()` **truncates toward zero**: `as.integer(192.6) = 192`, so `1926` maps to decade `1920`. - In DuckDB, dbplyr translates `as.integer(...)` to SQL's `CAST(... AS INTEGER)`, which **rounds half-to-even**: `CAST(192.6 AS INTEGER) = 193`, so `1926` maps to decade `1930`. The consequence is silent and substantive. For this data, the "1920" Arrow bucket covered years 1920–1929 while the "1920" DuckDB bucket covered roughly 1915–1924, and the decade counts disagreed by tens of percent. **When you want truncation, say so explicitly with `floor()` (or `trunc()`); do not rely on `as.integer()` to do it for you across engines.** The same caution applies to `%/%`, `%%`, and bankers' rounding — all of which have backend-specific translations. ::: A few patterns are worth knowing by heart because they translate cleanly: - `case_when()` and `if_else()` — both work, both produce clean SQL. - `stringr` helpers (`str_detect`, `str_sub`, `str_replace`) — work for most regex-free patterns. - Basic date parts: `year()`, `month()`, `day()` — both backends. - `coalesce()` — both backends. And a few that routinely need help: - **Integer division (`%/%`) and modulus (`%%`)** — Arrow's translator is unreliable. Use `floor(x / y)` and `x - floor(x / y) * y` for portability. - **Long-tail date arithmetic** — interval arithmetic, `lubridate` helpers beyond `year()`/`month()`/`day()` — DuckDB handles more than Arrow, but both can bail. If dbplyr does not translate cleanly, swap a `lubridate` call for an equivalent arithmetic one (e.g. `as.integer(difftime(a, b, units = "days"))` instead of a `%--%` interval). - **Rolling windows and frame clauses** — use `window_order()` + `window_frame(from, to)` in dbplyr; see the [Window functions chapter](04-window-functions.qmd). ## When the translation fails You find out at `collect()` time: a dbplyr translation error (or, under Arrow, a "function not implemented" error). The right response, in order of preference: 1. **Rewrite using something the translator already knows.** The `floor()` example above. `stringr::str_c()` instead of `paste0()`. `coalesce()` instead of `ifelse()` when you're handling missing values. This covers almost every case you will meet. 2. **Last resort: drop a `sql()` fragment into a single `mutate()` or `filter()`.** This keeps the pipe in tidyverse but hands DuckDB one literal SQL expression where there is no tidy alternative: ```r demo_db |> mutate(age_bucket = sql("(2015 - YEAR_OF_BIRTH) / 10 * 10")) |> count(age_bucket) ``` Reserve this for genuinely DuckDB-specific features that have no dplyr counterpart. **If you find yourself writing more than one `sql(...)` per pipeline, rewrite the tidyverse side instead.** You should almost never reach for `dbGetQuery()` (running a hand-written SQL query in one shot). Writing an entire query as a SQL string means giving up lazy composition, the familiar dplyr syntax, and the automatic translation we rely on everywhere else in the book. ## Arrow as a secondary engine `arrow` gives you the same dplyr verbs over parquet without needing a DuckDB database. It is a reasonable alternative in a few specific situations: - You are looking at someone else's parquet folder and do not want to stand up a DuckDB database just to explore it. - You are doing a simple "read, filter, aggregate, save" pipeline with no joins or window functions. - You want the output to be read seamlessly by Python tools that prefer the Arrow format. ```{r read-arrow} itr_ar <- open_dataset(file.path(pq, "itr_context")) |> rename_with(tolower) itr_ar |> filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |> count(ocptn_grp_cd) |> collect() |> head() ``` For scans and simple groupbys Arrow and DuckDB are indistinguishable (see the walkthrough timings in [A2](A2-data-manipulation-benchmarks.qmd#walkthrough-timings)). The gap opens on joins and window functions — covered in the [joins](03-joins.qmd) and [window functions](04-window-functions.qmd) chapters — and that is where DuckDB becomes the safe default. ## Small result, large source: the whole point Most PLIDA analyses end in a plot or a small table. The pipeline should look like a pyramid: a wide, filtered scan at the bottom; aggregations squeezing the result down through the middle; a small `collect()` at the top. Everything below `collect()` runs in DuckDB; everything above runs in R. ```r library(dplyr); library(dbplyr); library(ggplot2) tbl(con, "itr") |> filter(incm_yr >= 2010L, !is.na(ocptn_grp_cd)) |> mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |> count(incm_yr, occ_major) |> collect() |> # ← here, and only here ggplot(aes(incm_yr, n, colour = factor(occ_major))) + geom_line() ``` The result that hits R is a few hundred rows. DuckDB did the work over 170M. ## What lives in [A2](A2-data-manipulation-benchmarks.qmd) - T1: filter + count on one year of ITR — five backends side by side. - T2: groupby count over all ITR years — five backends side by side. - Walkthrough timings: DuckDB vs Arrow on the same dplyr code, across filter, groupby, mutate, and arrange. - Peak memory footprint for each approach. ```{r disconnect} #| include: false dbDisconnect(con, shutdown = TRUE) ```

Setup

Why are we doing this

The mental model: build the pipe, then collect() at the end

Filter and count: a single year of ITR

Groupby across all years

mutate(): watch for small translation differences

When the translation fails

Arrow as a secondary engine

Small result, large source: the whole point

What lives in A2

The mental model: build the pipe, then `collect()` at the end

`mutate()`: watch for small translation differences