Creating a cdm reference using Spark

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

So far we’ve been using a local Spark connection for introducing the OmopOnSpark package. However, in practice, when working with patient-level health data our data will most likely be in the cloud-based Databricks plaform which is built around Apache Spark. Once we have created our cdm reference, the same code we have seen when working with a local Spark dataset will also work with Databricks. It is just that the way we connect will differ.

usethis::edit_r_environ()

DATABRICKS_HOST = "Enter here your Workspace URL"
DATABRICKS_TOKEN = "Enter here your personal token"

With these saved you should now be able to connect with sparklyr, specifying your cluster ID.

library(sparklyr)
con <- spark_connect(
  cluster_id = "Enter here your cluster ID",
  method = "databricks_connect"
)
con

With this, we can check that everything is working and we have an open connection

connection_is_open(con)

With this, we we should be able to create a reference to a table. Let’s say we our OMOP CDM data is in a catalog called “my_catalog” and a schema called “my_omop_schema”. We should be able to create a reference to our person table.

library(dplyr)
tbl(con, I("my_catalog.my_omop_schema.person"))

tbl(con, I("my_catalog.my_omop_schema.person")) |> 
  head(5) |> 
  collect()

As well as this we should be able to go in the other direction and copy data from R to a Spark dataframe.

spark_cars_df <- sdf_copy_to(con,
                             cars,
                             overwrite = TRUE)
spark_cars_df

If these basics are working we should be well set-up to start working with OmopOnSpark. Here we would spefify our cdm schema as we’ve seen above. And now let’s say we have another schema called “my_results_schema” where we want to save any study-specific tables. We’ll use this when specifying the write schema. In addition, we can also give a write prefix and all the tables we create during the course of the working with this cdm reference will start with this prefix.

library(OmopOnSpark)
cdm <- cdmFromSpark(con,
  cdmSchema = "my_catalog.my_omop_schema",
  writeSchema = "my_catalog.my_results_schema",
  writePrefix = "study_1_"
)
cdm

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.