The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
stddiff.spark provides Spark-compatible implementations
of the standardized difference calculations from the
stddiff package. The interface is identical to
stddiff, so you can swap your existing calls in-place
without changing your workflow.
Because Spark DataFrames do not have native factor types, categorical variables are encoded using alphabetic ordering: the first level alphabetically becomes 0, the second becomes 1, and so on. This ensures consistent, deterministic calculations for binary and multi-level categorical variables.
[!Note] If you want to choose a specific reference category, you must update the values in your Spark DataFrame so that the desired reference level comes first alphabetically. For example:
library(dplyr) # Suppose original category: "Control", "Treatment" spark_df <- spark_df %>% mutate(group = ifelse(group == "Treatment", "A_Treatment", group))Here, prefixing “Treatment” with “A_” ensures it comes first alphabetically, making it the reference level for standardized difference calculations.
Functions automatically dispatch to the stddiff package
when non-Spark data is supplied, so the same code works seamlessly on
both local R data frames and Spark DataFrames.
install.packages("stddiff.spark")# install.packages("remotes") # if you don’t have it
remotes::install_github("alicja-januszkiewicz/stddiff.spark")library(sparklyr)
library(dplyr)
library(stddiff.spark)
# connect to Spark
sc <- spark_connect(master = "local")
# create example local data
my_data <- data.frame(
treatment = c(1, 0, 1, 0, 1, 0),
age = c(34, 28, 45, 30, 50, 33),
bmi = c(22.1, 24.3, 27.8, 23.5, 28.2, 25.0),
weight = c(70, 65, 85, 68, 90, 72)
)
# copy data to Spark
spark_df <- copy_to(sc, my_data, overwrite = TRUE)
# compute standardized differences for numeric variables
stddiff.numeric(spark_df, gcol = 1, vcol = 2:4)
# disconnect Spark
spark_disconnect(sc)These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.