The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Collapsing

2019-12-30

In some situations, you may want to use encodefrom() to collapse values, that is, group unique raw values into a smaller set of clean values / labels. For example, say you have the following data set, which gives each state’s census division number and name:

Data

id state cendiv cendiv_name
1 AL 6 East South Central
2 AK 9 Pacific
3 AZ 8 Mountain
4 AR 7 West South Central
5 CA 9 Pacific
6 CO 8 Mountain
7 CT 1 New England
8 DE 5 South Atlantic
10 FL 5 South Atlantic
12 HI 9 Pacific
14 IL 3 East North Central
15 IN 3 East North Central
16 IA 4 West North Central
31 NJ 2 Middle Atlantic
33 NY 2 Middle Atlantic

Rather than using the nine census divisions, you would rather group states by their regions. You have the following crosswalk:

Crosswalk

cendiv cenreg cenregnm
1 1 Northeast
2 1 Northeast
3 2 Midwest
4 2 Midwest
5 3 South
6 3 South
7 3 South
8 4 West
9 4 West

As long as

  1. raw values are unique in the crosswalk
  2. clean and label columns have a 1:1 match

Then you can use encodefrom() to collapse categories as you move from raw to clean values.

library(crosswalkr)
library(dplyr)
library(haven)
## data
df <- tibble(id = c(1:8,10,12,14:16,31,33),
             state = c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','HI',
                       'IL','IN','IA','NJ','NY'),
             cendiv = c(6,9,8,7,9,8,1,5,5,9,3,3,4,2,2),
             cendiv_name = c('East South Central','Pacific','Mountain',
                             'West South Central','Pacific','Mountain','New England',
                             'South Atlantic','South Atlantic','Pacific',
                             'East North Central','East North Central',
                             'West North Central','Middle Atlantic','Middle Atlantic'))
             
## crosswalk
cw <- tibble(cendiv = 1:9,
             cenreg = c(1,1,2,2,3,3,3,4,4),
             cenregnm = c('Northeast','Northeast','Midwest','Midwest',
                          'South','South','South','West','West'))
## encode new column
df <- df %>%
    mutate(cenreg = encodefrom(., var = cendiv, cw_file = cw, raw = cendiv,
                               clean = cenreg, label = cenregnm))
df
## # A tibble: 15 x 5
##       id state cendiv cendiv_name               cenreg
##    <dbl> <chr>  <dbl> <chr>                  <dbl+lbl>
##  1     1 AL         6 East South Central 3 [South]    
##  2     2 AK         9 Pacific            4 [West]     
##  3     3 AZ         8 Mountain           4 [West]     
##  4     4 AR         7 West South Central 3 [South]    
##  5     5 CA         9 Pacific            4 [West]     
##  6     6 CO         8 Mountain           4 [West]     
##  7     7 CT         1 New England        1 [Northeast]
##  8     8 DE         5 South Atlantic     3 [South]    
##  9    10 FL         5 South Atlantic     3 [South]    
## 10    12 HI         9 Pacific            4 [West]     
## 11    14 IL         3 East North Central 2 [Midwest]  
## 12    15 IN         3 East North Central 2 [Midwest]  
## 13    16 IA         4 West North Central 2 [Midwest]  
## 14    31 NJ         2 Middle Atlantic    1 [Northeast]
## 15    33 NY         2 Middle Atlantic    1 [Northeast]

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.