The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Introduction

stddiff.spark provides Spark-compatible implementations of the standardized difference calculations from the stddiff package. The interface is identical to stddiff, so you can swap your existing calls in-place without changing your workflow.

Because Spark DataFrames do not have native factor types, categorical variables are encoded using alphabetic ordering: the first level alphabetically becomes 0, the second becomes 1, and so on. This ensures consistent, deterministic calculations for binary and multi-level categorical variables.

[!Note] If you want to choose a specific reference category, you must update the values in your Spark DataFrame so that the desired reference level comes first alphabetically. For example:

library(dplyr)
# Suppose original category: "Control", "Treatment"
spark_df <- spark_df %>%
  mutate(group = ifelse(group == "Treatment", "A_Treatment", group))

Here, prefixing “Treatment” with “A_” ensures it comes first alphabetically, making it the reference level for standardized difference calculations.

Functions automatically dispatch to the stddiff package when non-Spark data is supplied, so the same code works seamlessly on both local R data frames and Spark DataFrames.

Installation

CRAN

install.packages("stddiff.spark")

GitHub

# install.packages("remotes") # if you don’t have it
remotes::install_github("alicja-januszkiewicz/stddiff.spark")

Usage

library(sparklyr)
library(dplyr)
library(stddiff.spark)

# connect to Spark
sc <- spark_connect(master = "local")

# create example local data
my_data <- data.frame(
  treatment = c(1, 0, 1, 0, 1, 0),
  age       = c(34, 28, 45, 30, 50, 33),
  bmi       = c(22.1, 24.3, 27.8, 23.5, 28.2, 25.0),
  weight    = c(70, 65, 85, 68, 90, 72)
)

# copy data to Spark
spark_df <- copy_to(sc, my_data, overwrite = TRUE)

# compute standardized differences for numeric variables
stddiff.numeric(spark_df, gcol = 1, vcol = 2:4)

# disconnect Spark
spark_disconnect(sc)

Requirements

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.