Collapsing

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Collapsing

2025-04-08

In some situations, you may want to use encodefrom() to collapse values, that is, group unique raw values into a smaller set of clean values / labels. For example, say you have the following data set, which gives each state’s census division number and name:

Data

id	state	cendiv	cendiv_name
1	AL	6	East South Central
2	AK	9	Pacific
3	AZ	8	Mountain
4	AR	7	West South Central
5	CA	9	Pacific
6	CO	8	Mountain
7	CT	1	New England
8	DE	5	South Atlantic
10	FL	5	South Atlantic
12	HI	9	Pacific
14	IL	3	East North Central
15	IN	3	East North Central
16	IA	4	West North Central
31	NJ	2	Middle Atlantic
33	NY	2	Middle Atlantic

Rather than using the nine census divisions, you would rather group states by their regions. You have the following crosswalk:

Crosswalk

cendiv	cenreg	cenregnm
1	1	Northeast
2	1	Northeast
3	2	Midwest
4	2	Midwest
5	3	South
6	3	South
7	3	South
8	4	West
9	4	West

As long as

raw values are unique in the crosswalk
clean and label columns have a 1:1 match

Then you can use encodefrom() to collapse categories as you move from raw to clean values.

library(crosswalkr)
library(dplyr)
library(haven)

## data
df <- tibble(id = c(1:8,10,12,14:16,31,33),
             state = c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','HI',
                       'IL','IN','IA','NJ','NY'),
             cendiv = c(6,9,8,7,9,8,1,5,5,9,3,3,4,2,2),
             cendiv_name = c('East South Central','Pacific','Mountain',
                             'West South Central','Pacific','Mountain','New England',
                             'South Atlantic','South Atlantic','Pacific',
                             'East North Central','East North Central',
                             'West North Central','Middle Atlantic','Middle Atlantic'))
             
## crosswalk
cw <- tibble(cendiv = 1:9,
             cenreg = c(1,1,2,2,3,3,3,4,4),
             cenregnm = c('Northeast','Northeast','Midwest','Midwest',
                          'South','South','South','West','West'))

## encode new column
df <- df %>%
    mutate(cenreg = encodefrom(., var = cendiv, cw_file = cw, raw = cendiv,
                               clean = cenreg, label = cenregnm))

df

## # A tibble: 15 × 5
##       id state cendiv cendiv_name        cenreg       
##    <dbl> <chr>  <dbl> <chr>              <dbl+lbl>    
##  1     1 AL         6 East South Central 3 [South]    
##  2     2 AK         9 Pacific            4 [West]     
##  3     3 AZ         8 Mountain           4 [West]     
##  4     4 AR         7 West South Central 3 [South]    
##  5     5 CA         9 Pacific            4 [West]     
##  6     6 CO         8 Mountain           4 [West]     
##  7     7 CT         1 New England        1 [Northeast]
##  8     8 DE         5 South Atlantic     3 [South]    
##  9    10 FL         5 South Atlantic     3 [South]    
## 10    12 HI         9 Pacific            4 [West]     
## 11    14 IL         3 East North Central 2 [Midwest]  
## 12    15 IN         3 East North Central 2 [Midwest]  
## 13    16 IA         4 West North Central 2 [Midwest]  
## 14    31 NJ         2 Middle Atlantic    1 [Northeast]
## 15    33 NY         2 Middle Atlantic    1 [Northeast]

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.