Merge records

A common set up is the following: x is a dataset of names with mistakes and abbreviations, while y is a dataset of id and their possible names - the “true” or “master dataset”. The goal is to find the id of each observation in x.

Pre-cleaning

The package includes two functions that can be applied to the “true” dataset y before using fuzzy_join. They allow to control both type I and type II mistakes when matching on names.

R id <- c(1, 1, 2, 2) name <- c("coca cola company", "coca cola incorporated", "apple incorporated", "apple corp") count_combinations(name, id = id) #. id name count_within count_across #> 1 1 coca 2 1 #> 2 1 cola 2 1 #> 3 1 company 1 1 #> 4 1 incorporated 1 2 #> 5 2 apple 2 1 #> 6 2 corp 1 1 #> 7 2 incorporated 1 2 Words with high count_within and low count_across are good identifiers, since they are specific to some id. On the other hand, words with low count_within and high count_across are not good identifiers, and one may want to delete these words from x and y.

Fuzzy joins

For each row in x, fuzzy_join finds the closest row(s) in y for a specific metric. The distance is a weighted average of a string distance over multiple columns. Both the weights and the string distance can be specified by the user. By default, fuzzy_join uses the jaro-winkler distance with a winkler adjustment of 0.1 (which gives a higher score to common prefixes).

x <- data.table(a = c("france", "franc"), b = c("arras", "dijon"))
y <- data.table(a = c("franc", "france"), b = c("arvars", "dijjon"))
fuzzy_join(x, y, fuzzy = c("a", "b"), w = c(0.1, 0.9))
#>      distance    a.x   b.x    a.y    b.y
#> 1: 0.09133333 france arras  franc arvars
#> 2: 0.03833333  franc dijon france dijjon
fuzzy_join(x, y, exact = "a", fuzzy = "b")
#>   distance    a.x   b.x    a.y    b.y
#>          0  franc dijon  franc arvars
#>          0 france arras france dijjon

The function corresponds roughly to the Stata command reclink. The type of distance between strings can be arbitrarly specified thanks to the package stringdist.