The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
stringdist_inner_join
: Correcting misspellings against a
dictionaryOften you find yourself with a set of words that you want to combine with a “dictionary”- it could be a literal dictionary (as in this case) or a domain-specific category system. But you want to allow for small differences in spelling or punctuation.
The fuzzyjoin package comes with a set of common misspellings (from Wikipedia):
## # A tibble: 4,505 × 2
## misspelling correct
## <chr> <chr>
## 1 abandonned abandoned
## 2 aberation aberration
## 3 abilties abilities
## 4 abilty ability
## 5 abondon abandon
## 6 abbout about
## 7 abotu about
## 8 abouta about a
## 9 aboutit about it
## 10 aboutthe about the
## # ℹ 4,495 more rows
# use the dictionary of words from the qdapDictionaries package,
# which is based on the Nettalk corpus.
library(qdapDictionaries)
words <- tbl_df(DICTIONARY)
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 20,137 × 2
## word syllables
## <chr> <dbl>
## 1 hm 1
## 2 hmm 1
## 3 hmmm 1
## 4 hmph 1
## 5 mmhmm 2
## 6 mmhm 2
## 7 mm 1
## 8 mmm 1
## 9 mmmm 1
## 10 pff 1
## # ℹ 20,127 more rows
As an example, we’ll pick 1000 of these words (you could try it on
all of them though), and use stringdist_inner_join
to join
them against our dictionary.
joined <- sub_misspellings %>%
stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 1)
By default, stringdist_inner_join
uses optimal string
alignment (Damerau–Levenshtein distance), and we’re setting a maximum
distance of 1 for a join. Notice that they’ve been joined in cases where
misspelling
is close to (but not equal to)
word
:
## # A tibble: 760 × 4
## misspelling correct word syllables
## <chr> <chr> <chr> <dbl>
## 1 cyclinder cylinder cylinder 3
## 2 beastiality bestiality bestiality 5
## 3 affilate affiliate affiliate 4
## 4 supress suppress suppress 2
## 5 intevene intervene intervene 3
## 6 resaurant restaurant restaurant 3
## 7 univesity university university 5
## 8 allegedely allegedly allegedly 4
## 9 emiting emitting smiting 2
## 10 probaly probably probably 3
## # ℹ 750 more rows
Note that there are some redundancies; words that could be multiple items in the dictionary. These end up with one row per “guess” in the output. How many words did we classify?
## # A tibble: 462 × 3
## misspelling correct n
## <chr> <chr> <int>
## 1 abilty ability 1
## 2 accademic academic 1
## 3 accademy academy 1
## 4 accension accession 2
## 5 acceptence acceptance 1
## 6 acedemic academic 1
## 7 achive achieve 4
## 8 acommodate accommodate 1
## 9 acuracy accuracy 1
## 10 addmission admission 1
## # ℹ 452 more rows
So we found a match in the dictionary for about half of the misspellings. In how many of the ones we classified did we get at least one of our guesses right?
which_correct <- joined %>%
group_by(misspelling, correct) %>%
summarize(guesses = n(), one_correct = any(correct == word))
which_correct
## # A tibble: 462 × 4
## # Groups: misspelling [453]
## misspelling correct guesses one_correct
## <chr> <chr> <int> <lgl>
## 1 abilty ability 1 TRUE
## 2 accademic academic 1 TRUE
## 3 accademy academy 1 TRUE
## 4 accension accession 2 TRUE
## 5 acceptence acceptance 1 TRUE
## 6 acedemic academic 1 TRUE
## 7 achive achieve 4 TRUE
## 8 acommodate accommodate 1 TRUE
## 9 acuracy accuracy 1 TRUE
## 10 addmission admission 1 TRUE
## # ℹ 452 more rows
## [1] 0.8246753
# number uniquely correct (out of the original 1000)
sum(which_correct$guesses == 1 & which_correct$one_correct)
## [1] 290
Not bad.
Note that stringdist_inner_join
is not the only function
we can use. If we’re interested in including the words that we
couldn’t classify, we could have used
stringdist_left_join
:
left_joined <- sub_misspellings %>%
stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 1)
left_joined
## # A tibble: 1,298 × 4
## misspelling correct word syllables
## <chr> <chr> <chr> <dbl>
## 1 Sanhedrim Sanhedrin <NA> NA
## 2 cyclinder cylinder cylinder 3
## 3 beastiality bestiality bestiality 5
## 4 consicousness consciousness <NA> NA
## 5 affilate affiliate affiliate 4
## 6 repubicans republicans <NA> NA
## 7 comitted committed <NA> NA
## 8 emmisions emissions <NA> NA
## 9 acquited acquitted <NA> NA
## 10 decompositing decomposing <NA> NA
## # ℹ 1,288 more rows
## # A tibble: 538 × 4
## misspelling correct word syllables
## <chr> <chr> <chr> <dbl>
## 1 Sanhedrim Sanhedrin <NA> NA
## 2 consicousness consciousness <NA> NA
## 3 repubicans republicans <NA> NA
## 4 comitted committed <NA> NA
## 5 emmisions emissions <NA> NA
## 6 acquited acquitted <NA> NA
## 7 decompositing decomposing <NA> NA
## 8 decieved deceived <NA> NA
## 9 asociated associated <NA> NA
## 10 commonweath commonwealth <NA> NA
## # ℹ 528 more rows
(To get just the ones without matches immediately, we could
have used stringdist_anti_join
). If we increase our
distance threshold, we’ll increase the fraction with a correct guess,
but also get more false positive guesses:
left_joined2 <- sub_misspellings %>%
stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 2)
left_joined2
## # A tibble: 8,721 × 4
## misspelling correct word syllables
## <chr> <chr> <chr> <dbl>
## 1 Sanhedrim Sanhedrin <NA> NA
## 2 cyclinder cylinder cylinder 3
## 3 beastiality bestiality bestiality 5
## 4 consicousness consciousness <NA> NA
## 5 affilate affiliate affiliate 4
## 6 repubicans republicans <NA> NA
## 7 comitted committed committee 3
## 8 emmisions emissions <NA> NA
## 9 acquited acquitted acquire 2
## 10 acquited acquitted acquit 2
## # ℹ 8,711 more rows
## # A tibble: 286 × 4
## misspelling correct word syllables
## <chr> <chr> <chr> <dbl>
## 1 Sanhedrim Sanhedrin <NA> NA
## 2 consicousness consciousness <NA> NA
## 3 repubicans republicans <NA> NA
## 4 emmisions emissions <NA> NA
## 5 commonweath commonwealth <NA> NA
## 6 supressed suppressed <NA> NA
## 7 aproximately approximately <NA> NA
## 8 Missisippi Mississippi <NA> NA
## 9 lazyness laziness <NA> NA
## 10 constituional constitutional <NA> NA
## # ℹ 276 more rows
Most of the missing words here simply aren’t in our dictionary.
You can try other distance thresholds, other dictionaries, and other distance metrics (see stringdist-metrics for more). This function is especially useful on a domain-specific dataset, such as free-form survey input that is likely to be close to one of a handful of responses.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.