---
title: "Data manipulation"
subtitle: "dplyr verbs against a DuckDB database"
---
## Setup
Load `tidyverse` plus the handful of packages DuckDB needs (`duckdb`, `DBI`, `dbplyr`), and set a memory and thread budget so DuckDB does not grab all of the machine. There are two ways to point DuckDB at our data — **from a DuckDB database** or **from parquet directly**. They give you the same lazy `tbl()` handles and the analysis code below works identically against either.
::: {.panel-tabset}
## From a DuckDB database (default)
Use the materialised `plida_tables.duckdb` we built in [Reading and storing](01-reading-storing.qmd). Open it **read-only** so you can have multiple R sessions attached at once (handy when you are iterating in one session and rendering a Quarto document in another). This is what every chapter and case study from here on assumes.
```{r setup}
library(tidyverse)
library(scales)
library(arrow) # used below in the "Arrow as a secondary engine" section
library(duckdb)
library(DBI)
library(dbplyr)
# Centralised paths for this project — a short file that just defines
# where the DuckDB and parquet folders live. See the Setup chapter.
source("R/00-paths.R")
data_dir <- paths$duckdb_dir # replace with your own path
con <- dbConnect(
duckdb::duckdb(),
dbdir = file.path(data_dir, "plida_tables.duckdb"),
read_only = TRUE
)
dbExecute(con, "SET memory_limit = '30GB'") # Set at about 90% of your RAM
dbExecute(con, "SET threads TO 9") # Set as your number of cores less 1
```
## From parquet, no database file
If you have parquet on disk but haven't ingested it into a DuckDB database yet, you can skip the database entirely. Open an **in-memory** DuckDB (pass no `dbdir =`) and hand `tbl()` a bit of SQL that tells DuckDB where to read the parquet from. Everything downstream — `filter`, `summarise`, `inner_join`, `lag`, `window_frame`, the works — is identical. One line per table.
```{r setup-parquet}
#| eval: false
library(tidyverse)
library(scales)
library(duckdb)
library(DBI)
library(dbplyr)
library(glue)
source("R/00-paths.R")
pq <- paths$parquet_root # replace with your own parquet folder
con <- dbConnect(duckdb::duckdb()) # in-memory DuckDB, no file
dbExecute(con, "SET memory_limit = '30GB'")
dbExecute(con, "SET threads TO 9")
# Point each lazy `tbl()` at a parquet glob. No CREATE TABLE, no ingest step.
itr_db <- tbl(con, sql(glue("SELECT * FROM read_parquet('{pq}/itr_context/*.parquet')")))
demo_db <- tbl(con, sql(glue("SELECT * FROM read_parquet('{pq}/demo/*.parquet')")))
# ...one line per table you need.
```
Views over parquet are a touch slower than a materialised table on repeated joins (see [A3](A3-join-benchmarks.qmd#t3-three-table-join)), but for one-off exploration or scripts that already have parquet on disk, this is the fastest path from zero to querying.
:::
```{r scaffold}
#| include: false
# Internal rendering scaffolding for the `#| time_it: "..."` chunk option
# used below. Readers can ignore this.
library(tictoc); library(knitr)
timings <- tibble()
knitr::knit_hooks$set(time_it = function(before, options, envir) {
if (before) {
gc(full = TRUE, verbose = FALSE)
assign(".time_it_t0", Sys.time(), envir = envir)
} else {
secs <- as.numeric(difftime(Sys.time(), get(".time_it_t0", envir = envir), units = "secs"))
task <- options$time_it
approach <- if (!is.null(options$approach)) options$approach
else if (grepl("-duckdb$", options$label)) "duckdb"
else if (grepl("-arrow$", options$label)) "arrow"
else NA_character_
timings <<- bind_rows(timings, tibble(
task = task, approach = approach, seconds = secs
))
sprintf("\n::: {.callout-note appearance='minimal' collapse='false'}\n⏱️ **%s** (%s): **%.3fs**\n:::\n", task, approach, secs)
}
})
pq <- paths$parquet_root
```
## Why are we doing this
With a _normal_ in-memory dataset, we would just read it in and do some things to it:
```{r normal}
#| eval: false
read_csv("some-data.csv") |>
rename_with(tolower) |>
mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |>
count(occ_major) |>
arrange(occ_major)
```
This usually works great, and the `tidyverse` syntax lets us write human-readable and expressive code that does the thing we want it to do. The problem arises when the data is too big for our computer's memory. It either won't be read in (you will encounter an error or your computer will explode) or it will take forever to do simple tasks.
Luckily, we can change a few things here and there and apply our neat code to enmormous datasets that would never fit in 64GB of memory (or even 256GB, even if you could afford it). Here we first 'connect' to our dataset and _then_ run the same `tidyverse` code over it. At the end we use `collect()` to send the job to duckdb to process out of memeory. Shortly after, we have our result.
```{r big-data-example}
#| time_it: "filter + mutate + count"
itr_db <- tbl(con, "itr") # <1>
itr_db |>
rename_with(tolower) |>
filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |>
mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |>
count(occ_major) |>
arrange(occ_major) |>
collect() # <2>
```
1. add this.
2. and this.
Once the data is in parquet and DuckDB, the day-to-day work is ordinary `dplyr`: `filter`, `mutate`, `group_by`, `summarise`, `count`, `arrange`. The verbs look the same as on a tibble. The difference is where the computation happens — **the dplyr code is translated to SQL and executed inside DuckDB** — and that difference is the reason you can run these pipelines over 170-million-row tables without the laptop noticing.
This chapter covers the everyday verbs, the small translation traps that bite once, and the escape hatches for the things dplyr does not cover. Joins and window functions get their own chapters. The evidence behind the speed and memory claims is in [Appendix A2](A2-data-manipulation-benchmarks.qmd).
## The mental model: build the pipe, then `collect()` at the end
`tbl(con, "name")` returns a **lazy** handle to a DuckDB table. "Lazy" means nothing actually runs when you apply a verb — each verb just adds a step to a pipeline description that DuckDB will run later. The actual work only happens when you call `collect()`, which executes the whole pipeline and pulls the result back into R as a tibble. (There is also `compute()`, which runs the pipeline but leaves the result inside DuckDB as a table.) The same pattern applies to Arrow: `open_dataset()` returns a lazy object that you build up with verbs and `collect()` at the end.
```{r read-duckdb}
# rename_with(tolower) lowercases SCREAMING_CASE columns so the rest of the
# pipe can stay in snake_case. It is lazy — no data moves.
itr_db <- tbl(con, "itr") |> rename_with(tolower)
demo_db <- tbl(con, "demo") |> rename_with(tolower)
ato_sp_db <- tbl(con, "ato_spine") |> rename_with(tolower)
dhda_sp_db <- tbl(con, "dhda_spine") |> rename_with(tolower)
```
Two key rules for getting this right:
- **Keep the pipe lazy until the last step.** A `collect()` in the middle of a pipeline forces the work so far back into R memory, and every verb after it runs in R rather than in DuckDB. You lose the speed advantage. Save `collect()` for the final, small result.
- **Make the last step small.** A `collect()` that tries to pull billions of rows into R will crash your session. Aggregate or `slice_head()` first.
Two useful diagnostics on a lazy object:
```{r glimpse-db}
glimpse(itr_db)
```
```{r show_query}
# See the SQL that dbplyr would send to DuckDB
itr_db |> filter(incm_yr == 2015L) |> count(ocptn_grp_cd) |> show_query()
```
`show_query()` prints the SQL — handy for sanity-checking what DuckDB is being asked to do, and for spotting the occasional translation surprise.
## Filter and count: a single year of ITR
A single year of ITR is about 10 million rows. Count the returns filed in 2015 by ANZSCO occupation major group.
```{r filter-duckdb}
#| time_it: "filter + count (10M)"
itr_db |>
filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |>
mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |>
count(occ_major) |>
collect() |>
arrange(occ_major)
```
A fraction of a second. DuckDB uses the per-row-group min/max summaries we described in [Reading and storing](01-reading-storing.qmd#step-1-csv-to-parquet-using-duckdb) to skip every row-group that doesn't contain any 2015 rows, so only 2015 rows are actually decoded. Memory stays flat: ordinary dplyr on a tibble would have paid around 250 MB to hold the whole year of data before doing anything with it.
The T1 benchmark in [A2](A2-data-manipulation-benchmarks.qmd#t1-filter-and-count) shows this explicitly — DuckDB and Arrow both finish in well under a second while `dplyr`-on-tibbles takes 4+ seconds and loads the whole year into R memory first.
## Groupby across all years
Same shape, now over 170 million rows spanning ten financial years.
```{r groupby-duckdb}
#| time_it: "groupby (170M)"
occ_by_year_db <- itr_db |>
filter(!is.na(ocptn_grp_cd)) |>
mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |>
count(incm_yr, occ_major) |>
collect()
head(occ_by_year_db)
```
```{r plot-groupby}
#| fig-width: 8
#| fig-height: 4.5
occ_labels <- c(
"1" = "Managers", "2" = "Professionals", "3" = "Technicians",
"4" = "Community/personal", "5" = "Clerical/admin",
"6" = "Sales", "7" = "Machinery operators", "8" = "Labourers"
)
occ_by_year_db |>
filter(occ_major %in% 1:8) |>
mutate(occ_label = factor(occ_labels[as.character(occ_major)], levels = occ_labels)) |>
ggplot(aes(incm_yr, n, colour = occ_label)) +
geom_line(linewidth = 0.7) +
scale_y_continuous(labels = label_comma()) +
labs(x = "Income year", y = "Tax returns filed", colour = "ANZSCO major group",
title = "Returns by occupation group over a decade",
subtitle = "170M rows scanned in under a second") +
theme_minimal(base_size = 11)
```
The T2 benchmark ([A2 §T2](A2-data-manipulation-benchmarks.qmd#t2-groupby-across-all-years)) is where the regime changes become visible. `dplyr` on tibbles takes about 35 seconds and 14.5 GB of RAM — it has to load every row before grouping. DuckDB and Arrow both stream the scan and hold only the result (~1000 rows). That is roughly a 200× speed-up and a 50× memory reduction, on identical dplyr code, for changing one thing: what `tbl` points at.
## `mutate()`: watch for small translation differences
`mutate()` is where small differences between backends appear. Your dplyr code is being translated — into DuckDB's SQL on one side, or into Arrow's equivalent function on the other — and the translations don't always agree on the fine detail. For the everyday operations (`x + 1`, `paste0`, `case_when`, `is.na`) everything is fine. The innocuous-looking expression "round a year down to the nearest decade" is the textbook trap.
```{r mutate-duckdb}
#| time_it: "mutate + arrange (30M)"
gender_dist_db <- demo_db |>
filter(!is.na(year_of_birth)) |>
mutate(
# Use floor() explicitly — do NOT rely on as.integer() to truncate,
# because as.integer(x) means different things in R/arrow vs DuckDB.
birth_decade = as.integer(floor(year_of_birth / 10) * 10),
gender = case_when(
core_gender == "M" ~ "Male",
core_gender == "F" ~ "Female",
TRUE ~ "Other/unknown"
)
) |>
count(birth_decade, gender) |>
arrange(birth_decade, gender) |>
collect()
head(gender_dist_db, 8)
```
::: {.callout-warning}
## The `as.integer()` trap: truncate vs round
An earlier draft of this pipeline used `as.integer(year_of_birth / 10) * 10L`. It looks equivalent to the `floor()` version but gives different answers across backends:
- In R (and therefore in Arrow), `as.integer()` **truncates toward zero**: `as.integer(192.6) = 192`, so `1926` maps to decade `1920`.
- In DuckDB, dbplyr translates `as.integer(...)` to SQL's `CAST(... AS INTEGER)`, which **rounds half-to-even**: `CAST(192.6 AS INTEGER) = 193`, so `1926` maps to decade `1930`.
The consequence is silent and substantive. For this data, the "1920" Arrow bucket covered years 1920–1929 while the "1920" DuckDB bucket covered roughly 1915–1924, and the decade counts disagreed by tens of percent.
**When you want truncation, say so explicitly with `floor()` (or `trunc()`); do not rely on `as.integer()` to do it for you across engines.** The same caution applies to `%/%`, `%%`, and bankers' rounding — all of which have backend-specific translations.
:::
A few patterns are worth knowing by heart because they translate cleanly:
- `case_when()` and `if_else()` — both work, both produce clean SQL.
- `stringr` helpers (`str_detect`, `str_sub`, `str_replace`) — work for most regex-free patterns.
- Basic date parts: `year()`, `month()`, `day()` — both backends.
- `coalesce()` — both backends.
And a few that routinely need help:
- **Integer division (`%/%`) and modulus (`%%`)** — Arrow's translator is unreliable. Use `floor(x / y)` and `x - floor(x / y) * y` for portability.
- **Long-tail date arithmetic** — interval arithmetic, `lubridate` helpers beyond `year()`/`month()`/`day()` — DuckDB handles more than Arrow, but both can bail. If dbplyr does not translate cleanly, swap a `lubridate` call for an equivalent arithmetic one (e.g. `as.integer(difftime(a, b, units = "days"))` instead of a `%--%` interval).
- **Rolling windows and frame clauses** — use `window_order()` + `window_frame(from, to)` in dbplyr; see the [Window functions chapter](04-window-functions.qmd).
## When the translation fails
You find out at `collect()` time: a dbplyr translation error (or, under Arrow, a "function not implemented" error). The right response, in order of preference:
1. **Rewrite using something the translator already knows.** The `floor()` example above. `stringr::str_c()` instead of `paste0()`. `coalesce()` instead of `ifelse()` when you're handling missing values. This covers almost every case you will meet.
2. **Last resort: drop a `sql()` fragment into a single `mutate()` or `filter()`.** This keeps the pipe in tidyverse but hands DuckDB one literal SQL expression where there is no tidy alternative:
```r
demo_db |>
mutate(age_bucket = sql("(2015 - YEAR_OF_BIRTH) / 10 * 10")) |>
count(age_bucket)
```
Reserve this for genuinely DuckDB-specific features that have no dplyr counterpart. **If you find yourself writing more than one `sql(...)` per pipeline, rewrite the tidyverse side instead.**
You should almost never reach for `dbGetQuery()` (running a hand-written SQL query in one shot). Writing an entire query as a SQL string means giving up lazy composition, the familiar dplyr syntax, and the automatic translation we rely on everywhere else in the book.
## Arrow as a secondary engine
`arrow` gives you the same dplyr verbs over parquet without needing a DuckDB database. It is a reasonable alternative in a few specific situations:
- You are looking at someone else's parquet folder and do not want to stand up a DuckDB database just to explore it.
- You are doing a simple "read, filter, aggregate, save" pipeline with no joins or window functions.
- You want the output to be read seamlessly by Python tools that prefer the Arrow format.
```{r read-arrow}
itr_ar <- open_dataset(file.path(pq, "itr_context")) |> rename_with(tolower)
itr_ar |>
filter(incm_yr == 2015L, !is.na(ocptn_grp_cd)) |>
count(ocptn_grp_cd) |>
collect() |>
head()
```
For scans and simple groupbys Arrow and DuckDB are indistinguishable (see the walkthrough timings in [A2](A2-data-manipulation-benchmarks.qmd#walkthrough-timings)). The gap opens on joins and window functions — covered in the [joins](03-joins.qmd) and [window functions](04-window-functions.qmd) chapters — and that is where DuckDB becomes the safe default.
## Small result, large source: the whole point
Most PLIDA analyses end in a plot or a small table. The pipeline should look like a pyramid: a wide, filtered scan at the bottom; aggregations squeezing the result down through the middle; a small `collect()` at the top. Everything below `collect()` runs in DuckDB; everything above runs in R.
```r
library(dplyr); library(dbplyr); library(ggplot2)
tbl(con, "itr") |>
filter(incm_yr >= 2010L, !is.na(ocptn_grp_cd)) |>
mutate(occ_major = as.integer(floor(ocptn_grp_cd / 1000))) |>
count(incm_yr, occ_major) |>
collect() |> # ← here, and only here
ggplot(aes(incm_yr, n, colour = factor(occ_major))) +
geom_line()
```
The result that hits R is a few hundred rows. DuckDB did the work over 170M.
## What lives in [A2](A2-data-manipulation-benchmarks.qmd)
- T1: filter + count on one year of ITR — five backends side by side.
- T2: groupby count over all ITR years — five backends side by side.
- Walkthrough timings: DuckDB vs Arrow on the same dplyr code, across filter, groupby, mutate, and arrange.
- Peak memory footprint for each approach.
```{r disconnect}
#| include: false
dbDisconnect(con, shutdown = TRUE)
```