The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Example of stringdist_inner_join: Correcting misspellings against a dictionary

David Robinson

2020-05-14

Often you find yourself with a set of words that you want to combine with a “dictionary”- it could be a literal dictionary (as in this case) or a domain-specific category system. But you want to allow for small differences in spelling or punctuation.

The fuzzyjoin package comes with a set of common misspellings (from Wikipedia):

library(dplyr)
library(fuzzyjoin)
data(misspellings)

misspellings
## # A tibble: 4,505 x 2
##    misspelling correct   
##    <chr>       <chr>     
##  1 abandonned  abandoned 
##  2 aberation   aberration
##  3 abilties    abilities 
##  4 abilty      ability   
##  5 abondon     abandon   
##  6 abbout      about     
##  7 abotu       about     
##  8 abouta      about a   
##  9 aboutit     about it  
## 10 aboutthe    about the 
## # … with 4,495 more rows
# use the dictionary of words from the qdapDictionaries package,
# which is based on the Nettalk corpus.
library(qdapDictionaries)
words <- tbl_df(DICTIONARY)
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
words
## # A tibble: 20,137 x 2
##    word  syllables
##    <chr>     <dbl>
##  1 hm            1
##  2 hmm           1
##  3 hmmm          1
##  4 hmph          1
##  5 mmhmm         2
##  6 mmhm          2
##  7 mm            1
##  8 mmm           1
##  9 mmmm          1
## 10 pff           1
## # … with 20,127 more rows

As an example, we’ll pick 1000 of these words (you could try it on all of them though), and use stringdist_inner_join to join them against our dictionary.

set.seed(2016)
sub_misspellings <- misspellings %>%
  sample_n(1000)
joined <- sub_misspellings %>%
  stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 1)

By default, stringdist_inner_join uses optimal string alignment (Damerau–Levenshtein distance), and we’re setting a maximum distance of 1 for a join. Notice that they’ve been joined in cases where misspelling is close to (but not equal to) word:

joined
## # A tibble: 760 x 4
##    misspelling correct    word       syllables
##    <chr>       <chr>      <chr>          <dbl>
##  1 cyclinder   cylinder   cylinder           3
##  2 beastiality bestiality bestiality         5
##  3 affilate    affiliate  affiliate          4
##  4 supress     suppress   suppress           2
##  5 intevene    intervene  intervene          3
##  6 resaurant   restaurant restaurant         3
##  7 univesity   university university         5
##  8 allegedely  allegedly  allegedly          4
##  9 emiting     emitting   smiting            2
## 10 probaly     probably   probably           3
## # … with 750 more rows

Note that there are some redundancies; words that could be multiple items in the dictionary. These end up with one row per “guess” in the output. How many words did we classify?

joined %>%
  count(misspelling, correct)
## # A tibble: 462 x 3
##    misspelling correct         n
##    <chr>       <chr>       <int>
##  1 abilty      ability         1
##  2 accademic   academic        1
##  3 accademy    academy         1
##  4 accension   accession       2
##  5 acceptence  acceptance      1
##  6 acedemic    academic        1
##  7 achive      achieve         4
##  8 acommodate  accommodate     1
##  9 acuracy     accuracy        1
## 10 addmission  admission       1
## # … with 452 more rows

So we found a match in the dictionary for about half of the misspellings. In how many of the ones we classified did we get at least one of our guesses right?

which_correct <- joined %>%
  group_by(misspelling, correct) %>%
  summarize(guesses = n(), one_correct = any(correct == word))

which_correct
## # A tibble: 462 x 4
## # Groups:   misspelling [453]
##    misspelling correct     guesses one_correct
##    <chr>       <chr>         <int> <lgl>      
##  1 abilty      ability           1 TRUE       
##  2 accademic   academic          1 TRUE       
##  3 accademy    academy           1 TRUE       
##  4 accension   accession         2 TRUE       
##  5 acceptence  acceptance        1 TRUE       
##  6 acedemic    academic          1 TRUE       
##  7 achive      achieve           4 TRUE       
##  8 acommodate  accommodate       1 TRUE       
##  9 acuracy     accuracy          1 TRUE       
## 10 addmission  admission         1 TRUE       
## # … with 452 more rows
# percentage of guesses getting at least one right
mean(which_correct$one_correct)
## [1] 0.8246753
# number uniquely correct (out of the original 1000)
sum(which_correct$guesses == 1 & which_correct$one_correct)
## [1] 290

Not bad.

Note that stringdist_inner_join is not the only function we can use. If we’re interested in including the words that we couldn’t classify, we could have used stringdist_left_join:

left_joined <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 1)

left_joined
## # A tibble: 1,298 x 4
##    misspelling   correct       word       syllables
##    <chr>         <chr>         <chr>          <dbl>
##  1 Sanhedrim     Sanhedrin     <NA>              NA
##  2 cyclinder     cylinder      cylinder           3
##  3 beastiality   bestiality    bestiality         5
##  4 consicousness consciousness <NA>              NA
##  5 affilate      affiliate     affiliate          4
##  6 repubicans    republicans   <NA>              NA
##  7 comitted      committed     <NA>              NA
##  8 emmisions     emissions     <NA>              NA
##  9 acquited      acquitted     <NA>              NA
## 10 decompositing decomposing   <NA>              NA
## # … with 1,288 more rows
left_joined %>%
  filter(is.na(word))
## # A tibble: 538 x 4
##    misspelling   correct       word  syllables
##    <chr>         <chr>         <chr>     <dbl>
##  1 Sanhedrim     Sanhedrin     <NA>         NA
##  2 consicousness consciousness <NA>         NA
##  3 repubicans    republicans   <NA>         NA
##  4 comitted      committed     <NA>         NA
##  5 emmisions     emissions     <NA>         NA
##  6 acquited      acquitted     <NA>         NA
##  7 decompositing decomposing   <NA>         NA
##  8 decieved      deceived      <NA>         NA
##  9 asociated     associated    <NA>         NA
## 10 commonweath   commonwealth  <NA>         NA
## # … with 528 more rows

(To get just the ones without matches immediately, we could have used stringdist_anti_join). If we increase our distance threshold, we’ll increase the fraction with a correct guess, but also get more false positive guesses:

left_joined2 <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 2)

left_joined2
## # A tibble: 8,721 x 4
##    misspelling   correct       word       syllables
##    <chr>         <chr>         <chr>          <dbl>
##  1 Sanhedrim     Sanhedrin     <NA>              NA
##  2 cyclinder     cylinder      cylinder           3
##  3 beastiality   bestiality    bestiality         5
##  4 consicousness consciousness <NA>              NA
##  5 affilate      affiliate     affiliate          4
##  6 repubicans    republicans   <NA>              NA
##  7 comitted      committed     committee          3
##  8 emmisions     emissions     <NA>              NA
##  9 acquited      acquitted     acquire            2
## 10 acquited      acquitted     acquit             2
## # … with 8,711 more rows
left_joined2 %>%
  filter(is.na(word))
## # A tibble: 286 x 4
##    misspelling   correct        word  syllables
##    <chr>         <chr>          <chr>     <dbl>
##  1 Sanhedrim     Sanhedrin      <NA>         NA
##  2 consicousness consciousness  <NA>         NA
##  3 repubicans    republicans    <NA>         NA
##  4 emmisions     emissions      <NA>         NA
##  5 commonweath   commonwealth   <NA>         NA
##  6 supressed     suppressed     <NA>         NA
##  7 aproximately  approximately  <NA>         NA
##  8 Missisippi    Mississippi    <NA>         NA
##  9 lazyness      laziness       <NA>         NA
## 10 constituional constitutional <NA>         NA
## # … with 276 more rows

Most of the missing words here simply aren’t in our dictionary.

You can try other distance thresholds, other dictionaries, and other distance metrics (see stringdist-metrics for more). This function is especially useful on a domain-specific dataset, such as free-form survey input that is likely to be close to one of a handful of responses.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.