The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
comparator implements comparison functions for clustering and record linkage applications. It includes functions for comparing strings, sequences and numeric vectors. Where possible, comparators are implemented in C/C++ to ensure fast performance.
Levenshtein()
: Levenshtein distance/similarityDamerauLevenshtein()
Damerau-Levenshtein
distance/similarityHamming()
: Hamming distance/similarityOSA()
: Optimal String Alignment
distance/similarityLCS()
: Longest Common Subsequence
distance/similarityJaro()
: Jaro distance/similarityJaroWinkler()
: Jaro-Winkler distance/similarityNot yet implemented.
MongeElkan()
: Monge-Elkan similarityFuzzyTokenSet()
: Fuzzy Token Set distanceInVocabulary()
: Compares strings using a reference
vocabulary. Useful for comparing names.Lookup()
: Retrieves distances/similarities from a
lookup tableBinaryComp()
: Compares strings based on whether they
agree/disagree exactly.Euclidean()
: Euclidean (L-2) distanceManhattan()
: Manhattan (L-1) distanceChebyshev()
: Chebyshev (L-∞) distanceMinkowski()
: Minkowski (L-p) distanceYou can install the latest release from CRAN by entering:
install.packages("comparator")
The development version can be installed from GitHub using
devtools
:
# install.packages("devtools")
::install_github("ngmarchant/comparator") devtools
A comparator is instantiated by calling its constructor function. For example, we can instantiate a Levenshtein similarity comparator that ignores differences in upper/lowercase characters as follows:
<- Levenshtein(similarity = TRUE, normalize = TRUE, ignore_case = TRUE) comparator
We can apply the comparator to character vectors element-wise as follows:
<- c("John Doe", "Jane Doe")
x <- c("jonathon doe", "jane doe")
y elementwise(comparator, x, y)
#> [1] 0.6666667 1.0000000
# shorthand for above
comparator(x, y)
#> [1] 0.6666667 1.0000000
This comparator is also defined on sequences:
<- list(c(1, 2, 1, 1), c(1, 2, 3, 4))
x_seq <- list(c(4, 3, 2, 1), c(1, 2, 3, 1))
y_seq elementwise(comparator, x_seq, y_seq)
#> [1] 0.4545455 0.7777778
# shorthand for above
comparator(x_seq, y_seq)
#> [1] 0.4545455 0.7777778
Pairwise comparisons are also supported using the following syntax:
# compare each string in x with each string in y and return a similarity matrix
pairwise(comparator, x, y, return_matrix = TRUE)
#> [,1] [,2]
#> [1,] 0.6666667 0.6842105
#> [2,] 0.5384615 1.0000000
# compare the strings in x pairwise and return a similarity matrix
pairwise(comparator, x, return_matrix = TRUE)
#> [,1] [,2]
#> [1,] 1.0000000 0.6842105
#> [2,] 0.6842105 1.0000000
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.