The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

fuzzystring provides fast, flexible fuzzy string
joins for data.frame and data.table objects
using approximate string matching. It combines
stringdist-based matching with a data.table
backend and compiled C++ result assembly to reduce overhead in large
joins while preserving standard join semantics.
Real-world identifiers rarely line up exactly.
fuzzystring is designed for workloads such as:
The package includes:
inner, left, right,
full, semi, and anti joinsstringdist methods, including OSA,
Levenshtein, Damerau-Levenshtein, Jaro-Winkler, q-gram, cosine, jaccard,
and soundexx
(data.table, tibble, or base data.frame)# Install from CRAN
install.packages("fuzzystring")
# Development version from GitHub
# pak::pak("PaulESantos/fuzzystring")
# remotes::install_github("PaulESantos/fuzzystring")library(fuzzystring)
x <- data.frame(
name = c("Idea", "Premiom", "Very Good"),
id = 1:3
)
y <- data.frame(
approx_name = c("Ideal", "Premium", "VeryGood"),
grp = c("A", "B", "C")
)
fuzzystring_inner_join(
x, y,
by = c(name = "approx_name"),
max_dist = 2,
distance_col = "distance"
)fuzzystring_inner_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_left_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_right_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_full_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_semi_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_anti_join(x, y, by = c(name = "approx_name"), max_dist = 2)fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "osa")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "dl")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "jw")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "soundex")fuzzystring_inner_join(
x, y,
by = c(name = "approx_name"),
ignore_case = TRUE,
max_dist = 1
)The package ships with misspellings, a dataset of common
misspellings adapted from Wikipedia for examples and testing.
data(misspellings)
head(misspellings)fuzzystring keeps more of the join execution on a
compiled path than the original fuzzyjoin implementation.
In practice, the package combines:
data.table grouping and candidate planningThe benchmark article summarizes a precomputed comparison against
fuzzyjoin::stringdist_join() using the same methods and
sample sizes:
fuzzystring_join() can match across more than one string
column by applying the same distance method and threshold to each mapped
column.
x_multi <- data.frame(
first = c("Jon", "Maira"),
last = c("Smyth", "Gonzales")
)
y_multi <- data.frame(
first_ref = c("John", "Maria"),
last_ref = c("Smith", "Gonzalez"),
id = 1:2
)
fuzzystring_inner_join(
x_multi, y_multi,
by = c(first = "first_ref", last = "last_ref"),
method = "osa",
max_dist = 1
)fuzzystring builds on ideas popularized by
fuzzyjoin, while reinterpreting the join pipeline around
data.table and compiled C++ result assembly.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.