The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The spell.replacer
package provides probabilistic
spelling correction for character vectors in R. It uses the Jaro-Winkler
string distance metric combined with word frequency data from the Corpus
of Contemporary American English (COCA) to automatically correct
misspelled words.
The main function is spell_replace()
, which takes a
character vector and returns it with corrected spellings:
# Example text with misspellings
text <- c("This is a smple text with some mispelled words.",
"We can corect them automaticaly.")
# Apply spell correction
corrected_text <- spell_replace(text)
print(corrected_text)
#> [1] "This is a simple text with some spelled words."
#> [2] "We can correct them automatically."
The package uses a two-step process:
hunspell
package to identify words not found in standard
dictionariesYou can adjust the correction behavior with several parameters:
# More restrictive threshold (fewer corrections)
conservative <- spell_replace(text, threshold = 0.08)
# Ignore potential proper names
text_with_names <- "John went to Bostan yesterday."
corrected_names <- spell_replace(text_with_names, ignore_names = TRUE)
print(corrected_names)
#> [1] "John went to Boston yesterday."
You can also correct individual words using the
correct()
function:
One of the main benefits of spell.replacer
is that it
integrates seamlessly with tidyverse workflows. You can easily apply
spell correction to entire columns of text data:
library(dplyr)
# Example dataframe with text column
docs <- data.frame(
id = 1:3,
text = c("This docment has misspellings.",
"Anothr exmple with erors.",
"The finl text sampel.")
)
# Apply spell correction using tidy syntax
docs %>%
mutate(text = spell_replace(text))
The package processes approximately 1,000 words per second, making it suitable for large-scale text processing tasks. For example:
This makes spell.replacer
practical for preprocessing
large text datasets before analysis.
The package includes the coca_list
dataset with the
100,000 most frequent words from COCA:
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.