Often you find yourself with a set of words that you want to combine with a “dictionary”- it could be a literal dictionary (as in this case) or a domain-specific category system. But you want to allow for small differences in spelling or punctuation.

The fuzzyjoin package comes with a set of common misspellings (from Wikipedia):

library(dplyr)
library(fuzzyjoin)
data(misspellings)

misspellings
## Source: local data frame [4,505 x 2]
## 
##    misspelling    correct
##          (chr)      (chr)
## 1   abandonned  abandoned
## 2    aberation aberration
## 3     abilties  abilities
## 4       abilty    ability
## 5      abondon    abandon
## 6       abbout      about
## 7        abotu      about
## 8       abouta    about a
## 9      aboutit   about it
## 10    aboutthe  about the
## ..         ...        ...
# use the dictionary of words from the qdapDictionaries package,
# which is based on the Nettalk corpus.
library(qdapDictionaries)
words <- tbl_df(DICTIONARY)

words
## Source: local data frame [20,137 x 2]
## 
##     word syllables
##    (chr)     (dbl)
## 1     hm         1
## 2    hmm         1
## 3   hmmm         1
## 4   hmph         1
## 5  mmhmm         2
## 6   mmhm         2
## 7     mm         1
## 8    mmm         1
## 9   mmmm         1
## 10   pff         1
## ..   ...       ...

As an example, we’ll pick 1000 of these words (you could try it on all of them though), and use stringdist_inner_join to join them against our dictionary.

set.seed(2016)
sub_misspellings <- misspellings %>%
  sample_n(1000)
joined <- sub_misspellings %>%
  stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 1)

By default, stringdist_inner_join uses optimal string alignment (Damerau–Levenshtein distance), and we’re setting a maximum distance of 1 for a join. Notice that they’ve been joined in cases where misspelling is close to (but not equal to) word:

joined
## Source: local data frame [728 x 4]
## 
##    misspelling correct    word syllables
##          (chr)   (chr)   (chr)     (dbl)
## 1        sould  should   could         1
## 2        sould  should  should         1
## 3        sould  should    sold         1
## 4        sould  should    soul         1
## 5        sould  should   sound         1
## 6        sould  should   would         1
## 7        fiels   feels   field         1
## 8        fiels   feels    fils         1
## 9     conscent consent consent         2
## 10       fleed   freed   bleed         1
## ..         ...     ...     ...       ...

Note that there are some redundancies; words that could be multiple items in the dictionary. These end up with one row per “guess” in the output. How many words did we classify?

joined %>%
  count(misspelling, correct)
## Source: local data frame [455 x 3]
## Groups: misspelling [?]
## 
##    misspelling      correct     n
##          (chr)        (chr) (int)
## 1    abritrary    arbitrary     1
## 2    accademic     academic     1
## 3    accension    ascension     2
## 4   accessable   accessible     1
## 5     accidant     accident     1
## 6  accidentaly accidentally     1
## 7    accordeon    accordion     1
## 8       addopt        adopt     1
## 9    addtional   additional     1
## 10  admendment    amendment     1
## ..         ...          ...   ...

So we found a match in the dictionary for about half of the misspellings. In how many of the ones we classified did we get at least one of our guesses right?

which_correct <- joined %>%
  group_by(misspelling, correct) %>%
  summarize(guesses = n(), one_correct = any(correct == word))

which_correct
## Source: local data frame [455 x 4]
## Groups: misspelling [?]
## 
##    misspelling      correct guesses one_correct
##          (chr)        (chr)   (int)       (lgl)
## 1    abritrary    arbitrary       1        TRUE
## 2    accademic     academic       1        TRUE
## 3    accension    ascension       2        TRUE
## 4   accessable   accessible       1        TRUE
## 5     accidant     accident       1        TRUE
## 6  accidentaly accidentally       1       FALSE
## 7    accordeon    accordion       1        TRUE
## 8       addopt        adopt       1        TRUE
## 9    addtional   additional       1        TRUE
## 10  admendment    amendment       1        TRUE
## ..         ...          ...     ...         ...
# percentage of guesses getting at least one right
mean(which_correct$one_correct)
## [1] 0.8527473
# number uniquely correct (out of the original 1000)
sum(which_correct$guesses == 1 & which_correct$one_correct)
## [1] 294

Not bad.

Note that stringdist_inner_join is not the only function we can use. If we’re interested in including the words that we couldn’t classify, we could have use stringdiststringdist_left_join:

left_joined <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 1)

left_joined
## Source: local data frame [1,273 x 4]
## 
##     misspelling      correct   word syllables
##           (chr)        (chr)  (chr)     (dbl)
## 1    charactors   characters     NA        NA
## 2    Brasillian    Brazilian     NA        NA
## 3         sould       should  could         1
## 4         sould       should should         1
## 5         sould       should   sold         1
## 6         sould       should   soul         1
## 7         sould       should  sound         1
## 8         sould       should  would         1
## 9   belligerant  belligerent     NA        NA
## 10 incorportaed incorporated     NA        NA
## ..          ...          ...    ...       ...
left_joined %>%
  filter(is.na(word))
## Source: local data frame [545 x 4]
## 
##     misspelling      correct  word syllables
##           (chr)        (chr) (chr)     (dbl)
## 1    charactors   characters    NA        NA
## 2    Brasillian    Brazilian    NA        NA
## 3   belligerant  belligerent    NA        NA
## 4  incorportaed incorporated    NA        NA
## 5          awya         away    NA        NA
## 6      occuring    occurring    NA        NA
## 7   surveilence surveillance    NA        NA
## 8     abondoned    abandoned    NA        NA
## 9      alledges      alleges    NA        NA
## 10  deliberatly deliberately    NA        NA
## ..          ...          ...   ...       ...

(To get just the ones without matches immediately, we could have used stringdist_anti_join). If we increase our distance threshold, we’ll increase the fraction with a correct guess, but also get more false positive guesses:

left_joined2 <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 2)

left_joined2
## Source: local data frame [7,691 x 4]
## 
##    misspelling    correct       word syllables
##          (chr)      (chr)      (chr)     (dbl)
## 1   charactors characters  character         3
## 2   charactors characters charactery         4
## 3   Brasillian  Brazilian         NA        NA
## 4        sould     should       auld         1
## 5        sould     should       bold         1
## 6        sould     should      bound         1
## 7        sould     should       cold         1
## 8        sould     should      could         1
## 9        sould     should       fold         1
## 10       sould     should       foul         1
## ..         ...        ...        ...       ...
left_joined2 %>%
  filter(is.na(word))
## Source: local data frame [264 x 4]
## 
##      misspelling       correct  word syllables
##            (chr)         (chr) (chr)     (dbl)
## 1     Brasillian     Brazilian    NA        NA
## 2    belligerant   belligerent    NA        NA
## 3       occuring     occurring    NA        NA
## 4      abondoned     abandoned    NA        NA
## 5   correponding corresponding    NA        NA
## 6  archeaologist archaeologist    NA        NA
## 7    emmediately   immediately    NA        NA
## 8     possessess     possesses    NA        NA
## 9        unahppy       unhappy    NA        NA
## 10        Guilio        Giulio    NA        NA
## ..           ...           ...   ...       ...

Most of the missing words here simply aren’t in our dictionary.

You can try other distance thresholds, other dictionaries, and other distance metrics (see [stringdist-metrics] for more). This function is especially useful on a domain-specific dataset, such as free-form survey input that is likely to be close to one of a handful of responses.