The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
{DBI}
/{dbplyr}
backendConnecting to Databricks SQL Warehouses typically involves using one of the methods below:
Method | Advantages | Disadvantages |
---|---|---|
ODBC | Works with various authentication mechanisms, robust, acceptable performance. | Heavy install that can be difficult to install on corporate devices. Very slow with larger write operations. |
JDBC | Not sure to be honest. | Slower than ODBC. Requires Java. Very slow with larger write operations. |
Statement Execution API | Straightforward API that can handle data of all sizes. | No direct support for write operations. |
Databricks SQL Connector ( databricks-sql-connector ) |
Convenient to work with and good performance. | It’s in Python, requires {reticulate} . |
That leaves a void for something that is:
Native to R
Easy to install
Acceptable performance (or better)
Handles bulk uploads efficiently
Supports OAuth U2M (no tokens!)
{brickster}
now provides both {DBI}
and
{dbplyr}
backends for working with Databricks SQL
warehouses.
It uses the Statement Execution API in combination with the Files API (the latter specifically for large data uploads).
Connecting to a Databricks SQL warehouse requires creating a
{DBI}
connection:
{DBI}
BackendExecute SQL directly and return results as data frames:
Tables can be references either via Id()
,
I()
, or using the name as-is:
# List available tables
tables <- dbListTables(con)
# Check if specific table exists
dbExistsTable(con, "samples.nyctaxi.trips")
dbExistsTable(con, I("samples.nyctaxi.trips"))
dbExistsTable(con, Id("samples", "nyctaxi", "trips"))
# Get column information
dbListFields(con, "samples.nyctaxi.trips")
dbListFields(con, I("samples.nyctaxi.trips"))
dbListFields(con, Id("samples", "nyctaxi", "trips"))
When writing data (append, overwrite, etc) there are two possible behaviours:
.parquet
files to a Volume directory and
COPY INTO
or CTAS
For data larger than 50k rows {brickster}
will only
permit the use of method (2), which requires staging_volume
to be specified when establishing the connection, or directly to
dbWriteQuery
.
Ensure that staging_volume
is a valid Volume path and
you have permission to write files.
# small data (150 rows)
# generates an in-line CTAS
dbWriteTable(
conn = con,
name = Id(catalog = "<catalog>", schema = "<schema>", table = "<table>"),
value = iris,
overwrite = TRUE
)
# bigger data (4 million rows)
# writes parquet files to volume then CTAS
iris_big <- sample_n(iris, replace = TRUE, size = 4000000)
dbWriteTable(
conn = con,
name = Id(catalog = "<catalog>", schema = "<schema>", table = "<table>"),
value = iris_big,
overwrite = TRUE,
staging_volume = "/Volumes/<catalog>/<schema>/<volume>/..." # or inherited from connection
progress = TRUE
)
{dbplyr}
BackendAs in the {DBI}
backend tables can be referenced either
via Id()
, I()
, or using the name as-is in
tbl()
:
# Connect to existing tables
tbl(con, "samples.nyctaxi.trips")
tbl(con, I("samples.nyctaxi.trips"))
tbl(con, Id("samples", "nyctaxi", "trips"))
tbl(con, in_catalog("samples", "nyctaxi", "trips"))
Chain dplyr operations - they execute remotely on Databricks:
# Filter and select (translated to SQL)
long_trips <- tbl(con, "samples.nyctaxi.trips") |>
filter(trip_distance > 10) |>
select(
tpep_pickup_datetime,
tpep_dropoff_datetime,
trip_distance,
fare_amount
)
# View the generated SQL (without executing)
show_query(long_trips)
# Execute and collect results
long_trips |> collect()
As a general reminder, call collect()
at the latest
point possible in your analysis to take reduce the required computation
locally.
# Customer summary statistics
trips_summary <- tbl(con, "samples.nyctaxi.trips") |>
group_by(pickup_zip) %>%
summarise(
trip_count = n(),
total_fare_amount = sum(fare_amount, na.rm = TRUE),
total_trip_distance = sum(trip_distance, na.rm = TRUE),
avg_fare_amount = mean(fare_amount, na.rm = TRUE)
) |>
arrange(desc(avg_fare_amount))
# Execute to get the 20 most expensive pickip zip codes with more than 30 trips
top_zipz <- trips_summary |>
filter(trip_count > 20) |>
head(20) |>
collect()
A key difference is that temporary tables are not supported - this
makes functions like copy_to
only usable when specifying
temporary
as FALSE
which will use
dbWriteTable
to create a table.
The connection for the DatabricksSQL
driver is different
to other {DBI}
backends as it doesn’t have a persistent
session (it’s all just API calls). This means calling
dbDisconnect
serves no purpose when it comes to freeing
resources on the SQL warehouse.
{WebR}
Support{bricksters}
core dependencies are all {WebR}
compatible. The backend uses {nanoarrow}
as a fallback when
{arrow}
is unavailable (currently {arrow}
is a
Suggests
).
It’s recommended to always have {arrow}
installed to
improve the performance of data loading.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.