Using clean_strings

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Using clean_strings

clean_strings is the way to prepare strings for name matching, either within tier_match (see the Using-tier-match vignette). There are several useful options that allow for many different options.

Here’s the example string we’ll be using:

name_vec <- corp_data1[, Company]
name_vec
#>  [1] "Walmart"            "Bershire Hataway"   "Apple"             
#>  [4] "Exxon Mobile"       "McKesson "          "UnitedHealth Group"
#>  [7] "CVS Health"         "General Motors"     "AT&T"              
#> [10] "Ford Motor Company"

First, we can use the basic string cleaning defaults:

clean_strings(name_vec)
#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "general motors"     "atandt"            
#> [10] "ford motor company"

Without any additional arguments, clean_strings does the following:

Make everything lowercase
Replace the special characters &, @, %, $ with their word equivalents
Remove all other special characters (e.g. commas, periods)
Convert tabs to spaces
Remove extra spaces

Then, we have a few different options we can use.

sp_char_words

sp_char_words is a data.frame with 2 columns: the first column is symbols to replace, and the second is their replacement. fedmatch as a built-in set of symbols:

print(sp_char_words)
#>    character replacement
#>       <char>      <char>
#> 1:       \\&         and
#> 2:       \\$      dollar
#> 3:       \\%     percent
#> 4:       \\@          at

But, you can use any data.frame you’d like, to make whatever replacements you’d like:

new_sp_char <- data.table::data.table(character = c("o"), replacement = c("apple"))
clean_strings(name_vec, sp_char_words = new_sp_char)
#>  [1] "walmart"                            "bershire hataway"                  
#>  [3] "apple"                              "exxapplen mapplebile"              
#>  [5] "mckessapplen"                       "unitedhealth grappleup"            
#>  [7] "cvs health"                         "general mappletapplers"            
#>  [9] "at t"                               "fapplerd mappletappler capplempany"

common_words

common_words is similar, but it respects word boundaries (so you don’t replace every usage of ‘Corp’ with ‘Corporation’, for example.) fedmatch has a built-in set of 54 words and their replacements:

print(corporate_words[1:5])
#>      abbr     long.names
#>    <char>         <char>
#> 1:  accep     acceptance
#> 2:   amer        america
#> 3:  assoc     associates
#> 4:     cl company listed
#> 5:  cmnty      community

But, you can use whatever words you’d like:

clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"),
                                                              replacement = c("bananas", "oranges")))
#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "bananas motors"     "atandt"            
#> [10] "ford motor company"

(bananas motors sounds like a lovely place to work). Note that the ‘almart’ in ‘walmart’ didn’t get replaced, because common_words respects word boundaries.,

You can also use a related function, word_frequency, to look for the most common strings in your data:

word_frequency(sample(c("hi", "Hello", "bye    "), 1e4, replace = TRUE))
#>      Word Count
#>    <char> <int>
#> 1:  hello  3376
#> 2:    bye  3323
#> 3:     hi  3301

Remove characters and words

remove_words and remove_char are booleans that let you simply remove the words in ‘common_words’ or specify a set of characters to remove rather than replacing them.

clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c"))
#>  [1] "w lm rt"                           "bershire h t w y"                 
#>  [3] "pple"                              "exxapplen mapplebile"             
#>  [5] "m kessapplen"                      "unitedhe lth grappleup"           
#>  [7] "vs he lth"                         "gener l mappletapplers"           
#>  [9] "t t"                               "fapplerd mappletappler applemp ny"
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"),
                                                              replacement = c("bananas", "oranges")),
              remove_words = TRUE)
#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "motors"             "atandt"            
#> [10] "ford motor"

stem

stem is a boolean that lets you stem words, using SnowballC::wordStem. ‘stemming’ words means removing common suffixes:

clean_strings(c( "call", "calling", "called"), stem = TRUE)
#> [1] "call" "call" "call"

See the documentation in SnowballC::wordStem for details.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.