The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Transfer REDCap data to a database and use in R without exceeding available memory. Compatible with all databases but specifically optimized for DuckDB—a fast and portable SQL engine with first-class integration in R/Posit products.
R objects live entirely in memory, causing three problems if not using a specialized framework:
redquack’s solution to this problem is to:
From CRAN:
# install.packages("pak")
::pak("redquack") pak
Development version:
::pak("dylanpieper/redquack") pak
These packages are used in the examples and are not imported by redquack:
::pak(c("dplyr", "duckdb", "keyring")) pak
Your REDCap API token allows R to interface with REDCap and should be stored securely. I recommend using the keyring package to store your API token. For example:
::key_set("redcap_token") keyring
Data from REDCap is transferred to a database via a DBI connection in chunks of record IDs:
library(redquack)
<- DBI::dbConnect(duckdb::duckdb(), "redcap.duckdb")
duckdb
<- redcap_to_db(
result conn = duckdb,
redcap_uri = "https://redcap.example.org/api/",
token = keyring::key_get("redcap_token"),
record_id_name = "record_id",
chunk_size = 1000
# Increase chunk size for memory-efficient systems (faster)
# Decrease chunk size for memory-constrained systems (slower)
)
The function returns a list with class
redcap_transfer_result
:
success
: Logical if the transfer was completed with no
failed processingerror_chunks
: Vector of chunk numbers that failed
processingtime_s
: Numeric value for total seconds to transfer and
optimize dataThe database created by redcap_to_db()
contains two
tables:
data
: Contains all exported REDCap records with
optimized column types
<- DBI::dbGetQuery(duckdb, "SELECT * FROM data LIMIT 1000") data
log
: Contains timestamped logs of the transfer
process for troubleshooting
<- DBI::dbGetQuery(duckdb, "SELECT * FROM log") log
Data is imported as VARCHAR/TEXT for consistent handling across chunks.
For DuckDB, data types are automatically optimized after transfer to improve query performance:
In DuckDB, you can query the data to inspect the data types:
::dbGetQuery(duckdb, "PRAGMA table_info(data)") DBI
You can also automatically convert data types in R using readr:
::type_convert(data) readr
To optimize query performance with other databases, you must alter the data table manually.
Query and collect the data with dplyr:
library(dplyr)
<- tbl(duckdb, "data") |>
demographics filter(is.na(redcap_repeat_instrument)) |>
select(record_id, age, race, sex, gender) |>
collect()
If you collect()
your data into memory in the last step,
it can make a slow process nearly instantaneous. The following example
data is 2,825,092 rows x 397 columns:
system.time(
<- duckdb |>
records tbl("data") |>
collect() |>
group_by(redcap_repeat_instrument) |>
summarize(count = n()) |>
arrange(desc(count))
)#> user system elapsed
#> 5.048 5.006 6.077
system.time(
<- duckdb |>
records tbl("data") |>
group_by(redcap_repeat_instrument) |>
summarize(count = n()) |>
arrange(desc(count)) |>
collect()
)#> user system elapsed
#> 0.040 0.015 0.040
You can also write a Parquet file directly from DuckDB and use arrow. A Parquet file will be about 5 times smaller than a DuckDB file:
::dbExecute(duckdb, "COPY (SELECT * FROM data) TO 'redcap.parquet' (FORMAT PARQUET)") DBI
Remember to close the connection when finished:
::dbDisconnect(duckdb) DBI
While this package is only optimized for DuckDB, I invite
collaborators to help optimize it for other databases. The pathway I
suggest right now is to target your edits in
R/optimize_data_types.R
. Feel free to submit a PR and share
any other ideas you may have.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.