Example of stringdist_inner_join: Correcting misspellings against a dictionary

David Robinson

2017-06-19

Often you find yourself with a set of words that you want to combine with a “dictionary”- it could be a literal dictionary (as in this case) or a domain-specific category system. But you want to allow for small differences in spelling or punctuation.

The fuzzyjoin package comes with a set of common misspellings (from Wikipedia):

library(dplyr)
library(fuzzyjoin)
data(misspellings)

misspellings
## # A tibble: 4,505 x 2
##    misspelling    correct
##          <chr>      <chr>
##  1  abandonned  abandoned
##  2   aberation aberration
##  3    abilties  abilities
##  4      abilty    ability
##  5     abondon    abandon
##  6      abbout      about
##  7       abotu      about
##  8      abouta    about a
##  9     aboutit   about it
## 10    aboutthe  about the
## # ... with 4,495 more rows
# use the dictionary of words from the qdapDictionaries package,
# which is based on the Nettalk corpus.
library(qdapDictionaries)
words <- tbl_df(DICTIONARY)

words
## # A tibble: 20,137 x 2
##     word syllables
##  * <chr>     <dbl>
##  1    hm         1
##  2   hmm         1
##  3  hmmm         1
##  4  hmph         1
##  5 mmhmm         2
##  6  mmhm         2
##  7    mm         1
##  8   mmm         1
##  9  mmmm         1
## 10   pff         1
## # ... with 20,127 more rows

As an example, we’ll pick 1000 of these words (you could try it on all of them though), and use stringdist_inner_join to join them against our dictionary.

set.seed(2016)
sub_misspellings <- misspellings %>%
  sample_n(1000)
joined <- sub_misspellings %>%
  stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 1)

By default, stringdist_inner_join uses optimal string alignment (Damerau–Levenshtein distance), and we’re setting a maximum distance of 1 for a join. Notice that they’ve been joined in cases where misspelling is close to (but not equal to) word:

joined
## # A tibble: 728 x 4
##    misspelling correct    word syllables
##          <chr>   <chr>   <chr>     <dbl>
##  1       sould  should   could         1
##  2       sould  should  should         1
##  3       sould  should    sold         1
##  4       sould  should    soul         1
##  5       sould  should   sound         1
##  6       sould  should   would         1
##  7       fiels   feels   field         1
##  8       fiels   feels    fils         1
##  9    conscent consent consent         2
## 10       fleed   freed   bleed         1
## # ... with 718 more rows

Note that there are some redundancies; words that could be multiple items in the dictionary. These end up with one row per “guess” in the output. How many words did we classify?

joined %>%
  count(misspelling, correct)
## # A tibble: 455 x 3
##    misspelling      correct     n
##          <chr>        <chr> <int>
##  1   abritrary    arbitrary     1
##  2   accademic     academic     1
##  3   accension    ascension     2
##  4  accessable   accessible     1
##  5    accidant     accident     1
##  6 accidentaly accidentally     1
##  7   accordeon    accordion     1
##  8      addopt        adopt     1
##  9   addtional   additional     1
## 10  admendment    amendment     1
## # ... with 445 more rows

So we found a match in the dictionary for about half of the misspellings. In how many of the ones we classified did we get at least one of our guesses right?

which_correct <- joined %>%
  group_by(misspelling, correct) %>%
  summarize(guesses = n(), one_correct = any(correct == word))

which_correct
## # A tibble: 455 x 4
## # Groups:   misspelling [?]
##    misspelling      correct guesses one_correct
##          <chr>        <chr>   <int>       <lgl>
##  1   abritrary    arbitrary       1        TRUE
##  2   accademic     academic       1        TRUE
##  3   accension    ascension       2        TRUE
##  4  accessable   accessible       1        TRUE
##  5    accidant     accident       1        TRUE
##  6 accidentaly accidentally       1       FALSE
##  7   accordeon    accordion       1        TRUE
##  8      addopt        adopt       1        TRUE
##  9   addtional   additional       1        TRUE
## 10  admendment    amendment       1        TRUE
## # ... with 445 more rows
# percentage of guesses getting at least one right
mean(which_correct$one_correct)
## [1] 0.8527473
# number uniquely correct (out of the original 1000)
sum(which_correct$guesses == 1 & which_correct$one_correct)
## [1] 294

Not bad.

Note that stringdist_inner_join is not the only function we can use. If we’re interested in including the words that we couldn’t classify, we could have use stringdiststringdist_left_join:

left_joined <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 1)

left_joined
## # A tibble: 1,273 x 4
##     misspelling      correct   word syllables
##           <chr>        <chr>  <chr>     <dbl>
##  1   charactors   characters   <NA>        NA
##  2   Brasillian    Brazilian   <NA>        NA
##  3        sould       should  could         1
##  4        sould       should should         1
##  5        sould       should   sold         1
##  6        sould       should   soul         1
##  7        sould       should  sound         1
##  8        sould       should  would         1
##  9  belligerant  belligerent   <NA>        NA
## 10 incorportaed incorporated   <NA>        NA
## # ... with 1,263 more rows
left_joined %>%
  filter(is.na(word))
## # A tibble: 545 x 4
##     misspelling      correct  word syllables
##           <chr>        <chr> <chr>     <dbl>
##  1   charactors   characters  <NA>        NA
##  2   Brasillian    Brazilian  <NA>        NA
##  3  belligerant  belligerent  <NA>        NA
##  4 incorportaed incorporated  <NA>        NA
##  5         awya         away  <NA>        NA
##  6     occuring    occurring  <NA>        NA
##  7  surveilence surveillance  <NA>        NA
##  8    abondoned    abandoned  <NA>        NA
##  9     alledges      alleges  <NA>        NA
## 10  deliberatly deliberately  <NA>        NA
## # ... with 535 more rows

(To get just the ones without matches immediately, we could have used stringdist_anti_join). If we increase our distance threshold, we’ll increase the fraction with a correct guess, but also get more false positive guesses:

left_joined2 <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 2)

left_joined2
## # A tibble: 7,691 x 4
##    misspelling    correct       word syllables
##          <chr>      <chr>      <chr>     <dbl>
##  1  charactors characters  character         3
##  2  charactors characters charactery         4
##  3  Brasillian  Brazilian       <NA>        NA
##  4       sould     should       auld         1
##  5       sould     should       bold         1
##  6       sould     should      bound         1
##  7       sould     should       cold         1
##  8       sould     should      could         1
##  9       sould     should       fold         1
## 10       sould     should       foul         1
## # ... with 7,681 more rows
left_joined2 %>%
  filter(is.na(word))
## # A tibble: 264 x 4
##      misspelling       correct  word syllables
##            <chr>         <chr> <chr>     <dbl>
##  1    Brasillian     Brazilian  <NA>        NA
##  2   belligerant   belligerent  <NA>        NA
##  3      occuring     occurring  <NA>        NA
##  4     abondoned     abandoned  <NA>        NA
##  5  correponding corresponding  <NA>        NA
##  6 archeaologist archaeologist  <NA>        NA
##  7   emmediately   immediately  <NA>        NA
##  8    possessess     possesses  <NA>        NA
##  9       unahppy       unhappy  <NA>        NA
## 10        Guilio        Giulio  <NA>        NA
## # ... with 254 more rows

Most of the missing words here simply aren’t in our dictionary.

You can try other distance thresholds, other dictionaries, and other distance metrics (see [stringdist-metrics] for more). This function is especially useful on a domain-specific dataset, such as free-form survey input that is likely to be close to one of a handful of responses.