This article details the fallback mechanism in duckplyr, which allows support for all dplyr verbs and R functions.
The duckplyr package aims at providing a fully compatible drop-in replacement for dplyr. All operations, R functions, and data types that are supported by dplyr should work in an identical way with duckplyr. This is achieved in two ways:
The following operation is supported by duckplyr:
The explain()
function shows what happens under the
hood:
The plan shows three operations:
b
column and removing the
a
column).Each operation is supported by DuckDB. The resulting object contains a plan for the entire pipeline that is executed lazily, only when the data is needed.
DuckDB accepts a tree of interconnected relation objects as input. Each relation object represents a logical step of the execution plan. The duckplyr package translates dplyr verbs into relation objects.
The last_rel()
function shows the last relation that has
been materialized:
It is NULL
because nothing has been computed yet.
Converting the object to a data frame triggers the computation:
The last_rel()
function now shows a relation that
describes logical plan for executing the whole pipeline.
Using a custom function with a side effect is not supported by DuckDB and triggers a dplyr fallback:
verbose_plus_one <- function(x) {
message("Adding one to ", paste(x, collapse = ", "))
x + 1
}
fallback <-
duckplyr::duckdb_tibble(a = 1:3) |>
arrange(desc(a)) |>
mutate(b = verbose_plus_one(a)) |>
select(-a)
The verbose_plus_one()
function is not supported by
DuckDB, so the mutate()
step is forwarded to dplyr and
already executed (eagerly) when the pipeline is defined. This is
confirmed by the last_rel()
function:
Only the arrange()
step is executed by DuckDB. Because
the dplyr implementation of mutate()
needs the data before
it can proceed, the data is first converted to a data frame, and this
triggers the materialization of the first step.
The explain()
function also confirms indirectly that at
least a part of the operation is handled by dplyr:
The final plan now only consists of a data frame scan. This is the
result of the mutate()
step, which at this stage already
has been executed by dplyr.
Converting the final object to a data frame triggers the rest of the computation:
The last_rel()
function confirms that only the final
select()
is handled by DuckDB again.
For any duck frame, one can control the automatic materialization. For fallbacks to dplyr, automatic materialization must be allowed for the duck frame at hand, as dplyr necessitates eager evaluation.
Therefore, by making a data frame stingy, one can ensure a pipeline
will error when a fallback to dplyr would have normally happened. See
vignette("prudence")
for details.
By using operations supported by duckplyr and avoiding fallbacks as
much as possible, your pipelines will be executed by DuckDB in an
optimized way. As of duckplyr 1.1.0, most DuckDB functions can be used
directly, see vignette("duckdb")
for details.
Using the fallback_sitrep()
and
fallback_config()
functions you can examine and change
settings related to fallbacks.
You can choose to make fallbacks verbose with
fallback_config(info = TRUE)
.
You can change settings related to logging and reporting fallback to duckplyr development team to inform their work.
See vignette("telemetry")
for details.
The fallback mechanism in duckplyr allows for a seamless integration of dplyr verbs and R functions that are not supported by DuckDB. It is transparent to the user and only triggers when necessary. With small or medium-sized data sets, it will not even be noticeable in most settings.
See vignette("large")
for techniques for working with
large data, vignette("limits")
for the currently
implementated translations, vignette("duckdb")
for direct
access to DuckDB functions, vignette("prudence")
for
details on controlling fallback behavior, and
vignette("telemetry")
for the automatic reporting of
fallback situations.