split and combine

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

split and combine_smsm

Introduction

The Introduction to synthACS briefly mentions the split and combine_smsm functionality in Sections 3.2 and 3.4 respectively. There, we note that deriving the sample synthetic micro data is a memory intensive process and advise using synthACS on a high performance machine. Of course, such a machine is not always available, which is when split and combine_smsm are needed.

A brief illustration of these two functions is provided in this vignette. The same example data is used as in the introductory vignette:

library(data.table)
library(acs)
library(synthACS)
library(retry)

ca_geo <- geo.make(state = "CA", county = "*")
ca_dat_SMSM <- pull_synth_data(2014, 5, ca_geo)

Overview of `split()` and `combine_smsm()`

The split and combine_smsm functions are used, respectively, to reduce the computational requirements of a large spatial microsimulation task into a set of smaller tasks and to recombine the results. They enable the well known “split-apply-combine” strategy for Data Analysis (Wickham 2011). In this case, the “apply” step is intentionally performed sequentially and not inside another function in order to minimize RAM usage and enable a garbage-collection step between intensive in-memory function calls.

The syntax for both is straightforward:

split(<object>, n_splits= N)
combine_smsm(<object1>, <object2>, ..., <objectk>)

split takes a larger macroASC class object and splits it into n_splits smaller macroACS objects. Similarly combine_smsm takes several smaller smsm_set objects and combines them into a single, larger, smsm_set class object.

Example Code

An example of this is provided below:

# split()
n_splits <- 20
split_ca_dat <- split(ca_dat_SMSM, n_splits = n_splits)
tmp_opts <- vector("list", length= n_splits)

for (i in 1:n_splits) {
    # Section 3.3 of introduction: SMSM via simulated annealing
    # derive synthetic datasets  
    tmp_synth <- derive_synth_datasets(split_ca_dat[[i]], leave_cores = 0)
    
    # create constraints for simulated annealing
    a <- all_geog_constraint_age(tmp_synth, method = "macro.table")
    g <- all_geog_constraint_gender(tmp_synth, method = "macro.table")
    m <- all_geog_constraint_marital_status(tmp_synth, method = "macro.table")
    r <- all_geog_constraint_race(tmp_synth, method = "synthetic")
    e <- all_geog_constraint_edu(tmp_synth, method = "synthetic")
    
    cll <- all_geogs_add_constraint(attr_name = "age", attr_total_list = a, 
                                    macro_micro = tmp_synth)
    cll <- all_geogs_add_constraint(attr_name = "gender", attr_total_list = g, 
                                    macro_micro = tmp_synth, constraint_list_list = cll)
    cll <- all_geogs_add_constraint(attr_name = "marital_status", attr_total_list = m, 
                                    macro_micro = tmp_synth, constraint_list_list = cll)
    cll <- all_geogs_add_constraint(attr_name = "race", attr_total_list = r, 
                                    macro_micro = tmp_synth, constraint_list_list = cll)
    cll <- all_geogs_add_constraint(attr_name = "edu_attain", attr_total_list = e, 
                                    macro_micro = tmp_synth, constraint_list_list = cll)
    
    # anneal
    tmp_opts[[i]] <- all_geog_optimize_microdata(tmp_synth, seed = 6550L, verbose = TRUE,
                                          constraint_list_list = cll, p_accept = 0.4, max_iter = 10000L)
}

# create the string needed for combine_smsm(). 
paste0("tmp_opts[[", 1:n_splits, "]]", sep= ", ", collapse= "")
# [1] "tmp_opts[[1]], tmp_opts[[2]], tmp_opts[[3]], tmp_opts[[4]], tmp_opts[[5]], 
# tmp_opts[[6]], tmp_opts[[7]], tmp_opts[[8]], tmp_opts[[9]], tmp_opts[[10]], 
# tmp_opts[[11]], tmp_opts[[12]], tmp_opts[[13]], tmp_opts[[14]], tmp_opts[[15]], 
# tmp_opts[[16]], tmp_opts[[17]], tmp_opts[[18]], tmp_opts[[19]], tmp_opts[[20]], "

# copy and paste the resulting string, excluding the final trailing comma
opt_ca <- combine_smsm(tmp_opts[[1]], tmp_opts[[2]], tmp_opts[[3]], tmp_opts[[4]], tmp_opts[[5]], 
                       tmp_opts[[6]], tmp_opts[[7]], tmp_opts[[8]], tmp_opts[[9]], tmp_opts[[10]], 
                       tmp_opts[[11]], tmp_opts[[12]], tmp_opts[[13]], tmp_opts[[14]], 
                       tmp_opts[[15]], tmp_opts[[16]], tmp_opts[[17]], tmp_opts[[18]], 
                       tmp_opts[[19]], tmp_opts[[20]])

References

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. https://www.jstatsoft.org/v40/i01/.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

split and combine_smsm

Introduction

Overview of split() and combine_smsm()

Example Code

References

Overview of `split()` and `combine_smsm()`