The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

fuzzystring

CRAN status R-CMD-check Lifecycle: stable

fuzzystring provides fast, flexible fuzzy string joins for data.frame and data.table objects using approximate string matching. It combines stringdist-based matching with a data.table backend and compiled C++ result assembly to reduce overhead in large joins while preserving standard join semantics.

Why fuzzystring?

Real-world identifiers rarely line up exactly. fuzzystring is designed for workloads such as:

The package includes:

Installation

# Install from CRAN
install.packages("fuzzystring")

# Development version from GitHub
# pak::pak("PaulESantos/fuzzystring")
# remotes::install_github("PaulESantos/fuzzystring")

Quick start

library(fuzzystring)

x <- data.frame(
  name = c("Idea", "Premiom", "Very Good"),
  id = 1:3
)

y <- data.frame(
  approx_name = c("Ideal", "Premium", "VeryGood"),
  grp = c("A", "B", "C")
)

fuzzystring_inner_join(
  x, y,
  by = c(name = "approx_name"),
  max_dist = 2,
  distance_col = "distance"
)

Join families

fuzzystring_inner_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_left_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_right_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_full_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_semi_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_anti_join(x, y, by = c(name = "approx_name"), max_dist = 2)

Distance methods

fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "osa")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "dl")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "jw")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "soundex")

Case-insensitive matching

fuzzystring_inner_join(
  x, y,
  by = c(name = "approx_name"),
  ignore_case = TRUE,
  max_dist = 1
)

Included example data

The package ships with misspellings, a dataset of common misspellings adapted from Wikipedia for examples and testing.

data(misspellings)
head(misspellings)

Performance

fuzzystring keeps more of the join execution on a compiled path than the original fuzzyjoin implementation. In practice, the package combines:

The benchmark article summarizes a precomputed comparison against fuzzyjoin::stringdist_join() using the same methods and sample sizes:

Multiple-column joins

fuzzystring_join() can match across more than one string column by applying the same distance method and threshold to each mapped column.

x_multi <- data.frame(
  first = c("Jon", "Maira"),
  last = c("Smyth", "Gonzales")
)

y_multi <- data.frame(
  first_ref = c("John", "Maria"),
  last_ref = c("Smith", "Gonzalez"),
  id = 1:2
)

fuzzystring_inner_join(
  x_multi, y_multi,
  by = c(first = "first_ref", last = "last_ref"),
  method = "osa",
  max_dist = 1
)

Credits

fuzzystring builds on ideas popularized by fuzzyjoin, while reinterpreting the join pipeline around data.table and compiled C++ result assembly.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.