The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
clean_strings
is the way to prepare strings for name
matching, either within tier_match
(see the
Using-tier-match
vignette). There are several useful
options that allow for many different options.
Here’s the example string we’ll be using:
name_vec <- corp_data1[, Company]
name_vec
#> [1] "Walmart" "Bershire Hataway" "Apple"
#> [4] "Exxon Mobile" "McKesson " "UnitedHealth Group"
#> [7] "CVS Health" "General Motors" "AT&T"
#> [10] "Ford Motor Company"
First, we can use the basic string cleaning defaults:
clean_strings(name_vec)
#> [1] "walmart" "bershire hataway" "apple"
#> [4] "exxon mobile" "mckesson" "unitedhealth group"
#> [7] "cvs health" "general motors" "atandt"
#> [10] "ford motor company"
Without any additional arguments, clean_strings
does the
following:
Then, we have a few different options we can use.
sp_char_words
is a data.frame with 2 columns: the first
column is symbols to replace, and the second is their replacement.
fedmatch
as a built-in set of symbols:
print(sp_char_words)
#> character replacement
#> <char> <char>
#> 1: \\& and
#> 2: \\$ dollar
#> 3: \\% percent
#> 4: \\@ at
But, you can use any data.frame you’d like, to make whatever replacements you’d like:
new_sp_char <- data.table::data.table(character = c("o"), replacement = c("apple"))
clean_strings(name_vec, sp_char_words = new_sp_char)
#> [1] "walmart" "bershire hataway"
#> [3] "apple" "exxapplen mapplebile"
#> [5] "mckessapplen" "unitedhealth grappleup"
#> [7] "cvs health" "general mappletapplers"
#> [9] "at t" "fapplerd mappletappler capplempany"
common_words
is similar, but it respects word boundaries
(so you don’t replace every usage of ‘Corp’ with ‘Corporation’, for
example.) fedmatch
has a built-in set of 54 words and their
replacements:
print(corporate_words[1:5])
#> abbr long.names
#> <char> <char>
#> 1: accep acceptance
#> 2: amer america
#> 3: assoc associates
#> 4: cl company listed
#> 5: cmnty community
But, you can use whatever words you’d like:
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"),
replacement = c("bananas", "oranges")))
#> [1] "walmart" "bershire hataway" "apple"
#> [4] "exxon mobile" "mckesson" "unitedhealth group"
#> [7] "cvs health" "bananas motors" "atandt"
#> [10] "ford motor company"
(bananas motors sounds like a lovely place to work). Note that the ‘almart’ in ‘walmart’ didn’t get replaced, because common_words respects word boundaries.,
You can also use a related function, word_frequency
, to
look for the most common strings in your data:
remove_words and remove_char are booleans that let you simply remove the words in ‘common_words’ or specify a set of characters to remove rather than replacing them.
clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c"))
#> [1] "w lm rt" "bershire h t w y"
#> [3] "pple" "exxapplen mapplebile"
#> [5] "m kessapplen" "unitedhe lth grappleup"
#> [7] "vs he lth" "gener l mappletapplers"
#> [9] "t t" "fapplerd mappletappler applemp ny"
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"),
replacement = c("bananas", "oranges")),
remove_words = TRUE)
#> [1] "walmart" "bershire hataway" "apple"
#> [4] "exxon mobile" "mckesson" "unitedhealth group"
#> [7] "cvs health" "motors" "atandt"
#> [10] "ford motor"
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.