The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Using etl

Ben Baumer

2023-10-12

etl is an R package to facilitate Extract - Transform - Load (ETL) operations for medium data. The end result is generally a populated SQL database, but the user interaction takes place solely within R.

Using etl

Instantiate an etl object using a string that determines the class of the resulting object, and the package that provides access to that data. The trivial mtcars database is built into etl.

library(etl)
cars <- etl("mtcars")
## No database was specified so I created one for you at:
## /tmp/Rtmp6KULdZ/filed38756869699.sqlite3
class(cars)
## [1] "etl_mtcars"           "etl"                  "src_SQLiteConnection"
## [4] "src_dbi"              "src_sql"              "src"

Pay careful attention to where the SQLite database is stored. The default location is a temporary directory, but you will want to move this to a more secure location if you want this storage to be persistent. See file.copy() for examples on how to move a file.

Connect to a local or remote database

etl works with a local or remote database to store your data. Every etl object extends a dplyr::src_dbi object. If, as in the example above, you do not specify a SQL source, a local RSQLite database will be created for you. However, you can also specify any source that inherits from dplyr::src_dbi.

Note: If you want to use a database other than a local RSQLite, you must create the mtcars database and have permission to write to it first!

# For PostgreSQL
library(RPostgreSQL)
db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")

# Alternatively, for MySQL
library(RMySQL)
db <- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost")
cars <- etl("mtcars", db)

At the heart of etl are three functions: etl_extract(), etl_transform(), and etl_load().

Extract

The first step is to acquire data from an online source.

cars %>%
  etl_extract()
## Extracting raw data...

This creates a local store of raw data.

Transform

These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).

cars %>%
  etl_transform()

Load

Populate the SQL database with the transformed data.

cars %>%
  etl_load()
## Loading 12 file(s) into the database...

Do it all at once

To populate the whole database from scratch, use etl_create.

cars %>%
  etl_create()

You can also update an existing database without re-initializing, but watch out for primary key collisions.

cars %>%
  etl_update()

Step-by-step

Under the hood, there are three functions that etl_update chains together:

getS3method("etl_update", "default")
## function (obj, ...) 
## {
##     obj <- obj %>% etl_extract(...) %>% etl_transform(...) %>% 
##         etl_load(...)
##     invisible(obj)
## }
## <bytecode: 0x562333745908>
## <environment: namespace:etl>

etl_create is simply a call to etl_update that forces the SQL database to be written from scratch.

getS3method("etl_create", "default")
## function (obj, ...) 
## {
##     obj <- obj %>% etl_init(...) %>% etl_update(...) %>% etl_cleanup(...)
##     invisible(obj)
## }
## <bytecode: 0x562334250fc0>
## <environment: namespace:etl>

Do Your Analysis

Now that your database is populated, you can work with it as a src data table just like any other dplyr source.

cars %>%
  tbl("mtcars") %>%
  group_by(cyl) %>%
  summarise(N = n(), mean_mpg = mean(mpg))
## Warning: Missing values are always removed in SQL aggregation functions.
## Use `na.rm = TRUE` to silence this warning
## This warning is displayed once every 8 hours.
## # Source:   SQL [3 x 3]
## # Database: sqlite 3.41.2 [/tmp/Rtmp6KULdZ/filed38756869699.sqlite3]
##     cyl     N mean_mpg
##   <int> <int>    <dbl>
## 1     4    11     26.7
## 2     6     7     19.7
## 3     8    14     15.1

Extending etl

Create your own ETL packages

Suppose you want to create your own ETL package called pkgname. All you have to do is write a package that requires etl, and then you have to write one S3 methods:

etl_extract.etl_pkgname()

You may also wish to write

etl_transform.etl_pkgname()
etl_load.etl_pkgname()

All of these functions must take and return an object of class etl_pkgname that inherits from etl. Please see the “Extending etl” vignette for more information.

Use other ETL packages

Packages that use the etl framework are available on CRAN and/or GitHub:

tools::dependsOnPkgs("etl")
## [1] "macleish"

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.