Read in SAS data in parallel into Spark

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Jan Wijffels

2021-04-19

This R package allows R users to easily import large SAS datasets into Spark tables in parallel.

The package uses the spark-sas7bdat Spark package in order to read a SAS dataset in Spark. That Spark package imports the data in parallel on the Spark cluster using the Parso library and this process is launched from R using the sparklyr functionality.

More information about the spark-sas7bdat Spark package and sparklyr can be found at:

Example

The following example reads in a file called iris.sas7bdat in parallel in a table called sas_example in Spark. Do try this with bigger data on your cluster and look at the help of the sparklyr package to connect to your Spark cluster.

library(sparklyr)
library(spark.sas7bdat)
mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")

sc <- spark_connect(master = "local")
x <- spark_read_sas(sc, path = mysasfile, table = "sas_example")

The resulting pointer to a Spark table can be further used in dplyr statements. These will be executed in parallel using the Spark functionalities of the spark-sas7bdat package.

library(dplyr)
library(magrittr)
x %>% group_by(Species) %>%
  summarise(count = n(), length = mean(Sepal_Length), width = mean(Sepal_Width))

Support in big data and Spark analysis

Need support in big data and Spark analysis? Contact BNOSAC: http://www.bnosac.be

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.