The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Getting Started with spell.replacer

library(spell.replacer)

Introduction

The spell.replacer package provides probabilistic spelling correction for character vectors in R. It uses the Jaro-Winkler string distance metric combined with word frequency data from the Corpus of Contemporary American English (COCA) to automatically correct misspelled words.

Basic Usage

The main function is spell_replace(), which takes a character vector and returns it with corrected spellings:

# Example text with misspellings
text <- c("This is a smple text with some mispelled words.",
          "We can corect them automaticaly.")

# Apply spell correction
corrected_text <- spell_replace(text)
print(corrected_text)
#> [1] "This is a simple text with some spelled words."
#> [2] "We can correct them automatically."

How It Works

The package uses a two-step process:

  1. Identify misspelled words: Uses the hunspell package to identify words not found in standard dictionaries
  2. Find corrections: For each misspelled word, calculates Jaro-Winkler distance to words in the COCA frequency list and selects the best match

Customizing Correction

You can adjust the correction behavior with several parameters:

# More restrictive threshold (fewer corrections)
conservative <- spell_replace(text, threshold = 0.08)

# Ignore potential proper names
text_with_names <- "John went to Bostan yesterday."
corrected_names <- spell_replace(text_with_names, ignore_names = TRUE)
print(corrected_names)
#> [1] "John went to Boston yesterday."

Single Word Correction

You can also correct individual words using the correct() function:

# Correct a single word
corrected_word <- correct("recieve", coca_list)
print(corrected_word)
#> [1] "receive"

Working with Dataframes

One of the main benefits of spell.replacer is that it integrates seamlessly with tidyverse workflows. You can easily apply spell correction to entire columns of text data:

library(dplyr)

# Example dataframe with text column
docs <- data.frame(
  id = 1:3,
  text = c("This docment has misspellings.",
           "Anothr exmple with erors.",
           "The finl text sampel.")
)

# Apply spell correction using tidy syntax
docs %>%
  mutate(text = spell_replace(text))

Performance

The package processes approximately 1,000 words per second, making it suitable for large-scale text processing tasks. For example:

  • A 100,000 word corpus would take about 1.7 minutes
  • A 1,000,000 word corpus would take about 16 minutes

This makes spell.replacer practical for preprocessing large text datasets before analysis.

Word Frequency Data

The package includes the coca_list dataset with the 100,000 most frequent words from COCA:

# Most frequent words
head(coca_list, 10)
#>  [1] "the"  "and"  "of"   "to"   "a"    "in"   "that" "is"   "i"    "for"

# Check if a word is in the list
"hello" %in% coca_list
#> [1] TRUE

# Find the frequency rank of a word
which(coca_list == "hello")
#> [1] 2579

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.