The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
There are two parts to dbplyr SQL translation: translating dplyr
verbs, and translating expressions within those verbs. This vignette
describes how entire verbs are translated;
vignette("translation-function")
describes how individual
expressions within those verbs are translated.
All dplyr verbs generate a SELECT
statement. To
demonstrate we’ll make a temporary database with a couple of tables
library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
flights <- copy_to(con, nycflights13::flights)
airports <- copy_to(con, nycflights13::airports)
select()
and mutate()
modify the
SELECT
clause:
## <SQL>
## SELECT `dep_delay`, `arr_delay`
## FROM `nycflights13::flights`
flights %>%
select(distance, air_time) %>%
mutate(speed = distance / (air_time / 60)) %>%
show_query()
## <SQL>
## SELECT `distance`, `air_time`, `distance` / (`air_time` / 60.0) AS `speed`
## FROM `nycflights13::flights`
filter()
generates a WHERE
clause:
## <SQL>
## SELECT `nycflights13::flights`.*
## FROM `nycflights13::flights`
## WHERE (`month` = 1.0) AND (`day` = 1.0)
arrange()
generates an ORDER BY
clause:
## <SQL>
## SELECT `nycflights13::flights`.*
## FROM `nycflights13::flights`
## ORDER BY `carrier`, `arr_delay` DESC
summarise()
and group_by()
work
together to generate a GROUP BY
clause:
flights %>%
group_by(month, day) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
show_query()
## `summarise()` has grouped output by "month". You can override using the
## `.groups` argument.
## <SQL>
## SELECT `month`, `day`, AVG(`dep_delay`) AS `delay`
## FROM `nycflights13::flights`
## GROUP BY `month`, `day`
R | SQL |
---|---|
inner_join() |
SELECT * FROM x JOIN y ON x.a = y.a |
left_join() |
SELECT * FROM x LEFT JOIN y ON x.a = y.a |
right_join() |
SELECT * FROM x RIGHT JOIN y ON x.a = y.a |
full_join() |
SELECT * FROM x FULL JOIN y ON x.a = y.a |
semi_join() |
SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a) |
anti_join() |
SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a) |
intersect(x, y) |
SELECT * FROM x INTERSECT SELECT * FROM y |
union(x, y) |
SELECT * FROM x UNION SELECT * FROM y |
setdiff(x, y) |
SELECT * FROM x EXCEPT SELECT * FROM y |
x
and y
don’t have to be tables in the same
database. If you specify copy = TRUE
, dplyr will copy the
y
table into the same location as the x
variable. This is useful if you’ve downloaded a summarised dataset and
determined a subset of interest that you now want the full data for. You
can use semi_join(x, y, copy = TRUE)
to upload the indices
of interest to a temporary table in the same database as x
,
and then perform a efficient semi join in the database.
If you’re working with large data, it maybe also be helpful to set
auto_index = TRUE
. That will automatically add an index on
the join variables to the temporary table.
The verb level SQL translation is implemented on top of
tbl_lazy
, which basically tracks the operations you perform
in a pipeline (see lazy-ops.R
). Turning that into a SQL
query takes place in three steps:
sql_build()
recurses over the lazy op data structure
building up query objects (select_query()
,
join_query()
, set_op_query()
etc) that
represent the different subtypes of SELECT
queries that we
might generate.
sql_optimise()
takes a pass over these SQL objects,
looking for potential optimisations. Currently this only involves
removing subqueries where possible.
sql_render()
calls an SQL generation function
(sql_query_select()
, sql_query_join()
,
sql_query_semi_join()
, sql_query_set_op()
, …)
to produce the actual SQL. Each of these functions is a generic, taking
the connection as an argument, so that the translation can be customised
for different databases.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.