Record linkage using machine learning

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

In this example we will show how reclin2 can be used in combination with machine learning to perform record linkage. We will use the same example as in the introduction vignette and will skip over some of the initial steps in the linkage project. We will use plain logistic regression. Not the most sophisticated machine learning algorithm, but for the simplistic example more than enough. Other algorithms are easily substituted.

When performing record linkage, we will compare combinations of records from both datasets. After comparison we end up with a large dataset of pairs with properties of these pairs (the comparison vectors). The goal of record linkage is to divide these pairs into two groups: one group with pairs where both records in the pair belong to the same object, the matching set, and one group where both records in the pair do not belong to the same object, the unmatched set. Record linkage is, therefore, a classification problem and when we know for some of the pairs if they belong to the matching set or the unmatching set, we can use that to train a supervised classification method.

Generate the pairs and compare

First we have to generate all pairs and compare these. This is similar as in regular probabilistic linkage.

> library(reclin2)
> library(data.table)
> data("linkexample1", "linkexample2")
> print(linkexample1)
  id lastname firstname    address sex postcode
1  1    Smith      Anna 12 Mainstr   F  1234 AB
2  2    Smith    George 12 Mainstr   M  1234 AB
3  3  Johnson      Anna 61 Mainstr   F  1234 AB
4  4  Johnson   Charles 61 Mainstr   M  1234 AB
5  5  Johnson    Charly 61 Mainstr   M  1234 AB
6  6 Schwartz       Ben  1 Eaststr   M  6789 XY
> print(linkexample2)
  id lastname firstname       address  sex postcode
1  2    Smith    Gearge 12 Mainstreet <NA>  1234 AB
2  3   Jonson        A. 61 Mainstreet    F  1234 AB
3  4  Johnson   Charles    61 Mainstr    F  1234 AB
4  6 Schwartz       Ben        1 Main    M  6789 XY
5  7 Schwartz      Anna     1 Eaststr    F  6789 XY
> pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"), 
+   inplace = TRUE, comparators = list(lastname = cmp_jarowinkler(), 
+   firstname = cmp_jarowinkler(), address = cmp_jarowinkler()))
> print(pairs)
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname   address    sex
    <int> <int>    <num>     <num>     <num> <lgcl>
 1:     1     1 1.000000 0.4722222 0.9230769     NA
 2:     1     2 0.000000 0.5833333 0.8641026   TRUE
 3:     1     3 0.447619 0.4642857 0.9333333   TRUE
 4:     2     1 1.000000 0.8888889 0.9230769     NA
 5:     2     2 0.000000 0.0000000 0.8641026  FALSE
 6:     2     3 0.447619 0.5396825 0.9333333  FALSE
 7:     3     1 0.447619 0.4722222 0.8641026     NA
 8:     3     2 0.952381 0.5833333 0.9230769   TRUE
 9:     3     3 1.000000 0.4642857 1.0000000   TRUE
10:     4     1 0.447619 0.6428571 0.8641026     NA
11:     4     2 0.952381 0.0000000 0.9230769  FALSE
12:     4     3 1.000000 1.0000000 1.0000000  FALSE
13:     5     1 0.447619 0.5555556 0.8641026     NA
14:     5     2 0.952381 0.0000000 0.9230769  FALSE
15:     5     3 1.000000 0.8492063 1.0000000  FALSE
16:     6     4 1.000000 1.0000000 0.6111111   TRUE
17:     6     5 1.000000 0.5277778 1.0000000  FALSE

On of the things we run into, is that the variable sex has missing values. We could set these to FALSE (this is what is done when calling problink_em during estimation of the model), but with machine learning we could also include these as a separate category. For that we first need to define a custom comparison function.

> na_as_class <- function(x, y) {
+   factor(
+     ifelse(is.na(x) | is.na(y), 2L, (y == x)*1L),
+     levels = 0:2, labels = c("eq", "uneq", "mis"))
+ }

We then remove the old variable sex (otherwise compare_pairs will complain that we cannot assign a factor to a logical vector) and compare the pairs again with the new comparison function.

> pairs[, sex := NULL]
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname   address
    <int> <int>    <num>     <num>     <num>
 1:     1     1 1.000000 0.4722222 0.9230769
 2:     1     2 0.000000 0.5833333 0.8641026
 3:     1     3 0.447619 0.4642857 0.9333333
 4:     2     1 1.000000 0.8888889 0.9230769
 5:     2     2 0.000000 0.0000000 0.8641026
 6:     2     3 0.447619 0.5396825 0.9333333
 7:     3     1 0.447619 0.4722222 0.8641026
 8:     3     2 0.952381 0.5833333 0.9230769
 9:     3     3 1.000000 0.4642857 1.0000000
10:     4     1 0.447619 0.6428571 0.8641026
11:     4     2 0.952381 0.0000000 0.9230769
12:     4     3 1.000000 1.0000000 1.0000000
13:     5     1 0.447619 0.5555556 0.8641026
14:     5     2 0.952381 0.0000000 0.9230769
15:     5     3 1.000000 0.8492063 1.0000000
16:     6     4 1.000000 1.0000000 0.6111111
17:     6     5 1.000000 0.5277778 1.0000000
> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"), 
+   inplace = TRUE, comparators = list(lastname = cmp_jarowinkler(), 
+   firstname = cmp_jarowinkler(), address = cmp_jarowinkler(), sex = na_as_class))
> print(pairs)
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname   address    sex
    <int> <int>    <num>     <num>     <num> <fctr>
 1:     1     1 1.000000 0.4722222 0.9230769    mis
 2:     1     2 0.000000 0.5833333 0.8641026   uneq
 3:     1     3 0.447619 0.4642857 0.9333333   uneq
 4:     2     1 1.000000 0.8888889 0.9230769    mis
 5:     2     2 0.000000 0.0000000 0.8641026     eq
 6:     2     3 0.447619 0.5396825 0.9333333     eq
 7:     3     1 0.447619 0.4722222 0.8641026    mis
 8:     3     2 0.952381 0.5833333 0.9230769   uneq
 9:     3     3 1.000000 0.4642857 1.0000000   uneq
10:     4     1 0.447619 0.6428571 0.8641026    mis
11:     4     2 0.952381 0.0000000 0.9230769     eq
12:     4     3 1.000000 1.0000000 1.0000000     eq
13:     5     1 0.447619 0.5555556 0.8641026    mis
14:     5     2 0.952381 0.0000000 0.9230769     eq
15:     5     3 1.000000 0.8492063 1.0000000     eq
16:     6     4 1.000000 1.0000000 0.6111111   uneq
17:     6     5 1.000000 0.5277778 1.0000000     eq

Estimate the model and use the model to classify the pairs

In order to estimate the model we need some pairs for which we know the truth. One way of obtaining this information is by reviewing some of the pairs. The number of pairs will generally grow with O(N²) with N the size of the smallest dataset. The number of matches in these pairs is usually O(N). Therefore, the fraction of matches in the pairs is O(1/N) and therefore usually very small. Therefore, when sampling records for review it is usually a good idea to not sample the pairs completely random, but, for example, oversample pairs that agree on more variables.

Another way of getting a training dataset is when additional information is available. For example, when linking a dataset to a population register for some of the records in the dataset an official id might be available. For these records the true match status can be determined. This is what we will simulate in the example below. Let’s assume we know from three of the records in linkexample2 the id:

> linkexample2$known_id <- linkexample2$id
> linkexample2$known_id[c(2,5)] <- NA
> setDT(linkexample2)

We the know for these records the true match status in the pairs. Below we add this to the pairs:

> compare_vars(pairs, "y", on_x = "id", on_y = "known_id", y = linkexample2, inplace = TRUE)

Note that we supply y = linkexample2 in the call. This is needed as the copy of linkexample2 stored with pairs does not contain the known_id column. We can also add the true status for all records to measure the performance of the linkage in the end

> compare_vars(pairs, "y_true", on_x = "id", on_y = "id", inplace = TRUE)
> print(pairs)
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname   address    sex      y y_true
    <int> <int>    <num>     <num>     <num> <fctr> <lgcl> <lgcl>
 1:     1     1 1.000000 0.4722222 0.9230769    mis  FALSE  FALSE
 2:     1     2 0.000000 0.5833333 0.8641026   uneq     NA  FALSE
 3:     1     3 0.447619 0.4642857 0.9333333   uneq  FALSE  FALSE
 4:     2     1 1.000000 0.8888889 0.9230769    mis   TRUE   TRUE
 5:     2     2 0.000000 0.0000000 0.8641026     eq     NA  FALSE
 6:     2     3 0.447619 0.5396825 0.9333333     eq  FALSE  FALSE
 7:     3     1 0.447619 0.4722222 0.8641026    mis  FALSE  FALSE
 8:     3     2 0.952381 0.5833333 0.9230769   uneq     NA   TRUE
 9:     3     3 1.000000 0.4642857 1.0000000   uneq  FALSE  FALSE
10:     4     1 0.447619 0.6428571 0.8641026    mis  FALSE  FALSE
11:     4     2 0.952381 0.0000000 0.9230769     eq     NA  FALSE
12:     4     3 1.000000 1.0000000 1.0000000     eq   TRUE   TRUE
13:     5     1 0.447619 0.5555556 0.8641026    mis  FALSE  FALSE
14:     5     2 0.952381 0.0000000 0.9230769     eq     NA  FALSE
15:     5     3 1.000000 0.8492063 1.0000000     eq  FALSE  FALSE
16:     6     4 1.000000 1.0000000 0.6111111   uneq   TRUE   TRUE
17:     6     5 1.000000 0.5277778 1.0000000     eq     NA  FALSE

We now have all of the information needed to estimate our (machine learning) model. Note that this will give a bunch of warnings as we estimating six parameters with only eleven observations and the parameters will not be reliably estimated.

> m <- glm(y ~ lastname + firstname + address + sex, data = pairs, family = binomial())

> pairs[, prob := predict(m, type = "response", newdata = pairs)]
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname   address    sex      y y_true         prob
    <int> <int>    <num>     <num>     <num> <fctr> <lgcl> <lgcl>        <num>
 1:     1     1 1.000000 0.4722222 0.9230769    mis  FALSE  FALSE 2.220446e-16
 2:     1     2 0.000000 0.5833333 0.8641026   uneq     NA  FALSE 1.000000e+00
 3:     1     3 0.447619 0.4642857 0.9333333   uneq  FALSE  FALSE 7.317210e-12
 4:     2     1 1.000000 0.8888889 0.9230769    mis   TRUE   TRUE 1.000000e+00
 5:     2     2 0.000000 0.0000000 0.8641026     eq     NA  FALSE 2.220446e-16
 6:     2     3 0.447619 0.5396825 0.9333333     eq  FALSE  FALSE 2.220446e-16
 7:     3     1 0.447619 0.4722222 0.8641026    mis  FALSE  FALSE 2.220446e-16
 8:     3     2 0.952381 0.5833333 0.9230769   uneq     NA   TRUE 2.214629e-12
 9:     3     3 1.000000 0.4642857 1.0000000   uneq  FALSE  FALSE 2.220446e-16
10:     4     1 0.447619 0.6428571 0.8641026    mis  FALSE  FALSE 1.665098e-11
11:     4     2 0.952381 0.0000000 0.9230769     eq     NA  FALSE 2.220446e-16
12:     4     3 1.000000 1.0000000 1.0000000     eq   TRUE   TRUE 1.000000e+00
13:     5     1 0.447619 0.5555556 0.8641026    mis  FALSE  FALSE 2.220446e-16
14:     5     2 0.952381 0.0000000 0.9230769     eq     NA  FALSE 2.220446e-16
15:     5     3 1.000000 0.8492063 1.0000000     eq  FALSE  FALSE 4.477438e-11
16:     6     4 1.000000 1.0000000 0.6111111   uneq   TRUE   TRUE 1.000000e+00
17:     6     5 1.000000 0.5277778 1.0000000     eq     NA  FALSE 2.220446e-16
> pairs[, select := prob > 0.5]
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname   address    sex      y y_true         prob
    <int> <int>    <num>     <num>     <num> <fctr> <lgcl> <lgcl>        <num>
 1:     1     1 1.000000 0.4722222 0.9230769    mis  FALSE  FALSE 2.220446e-16
 2:     1     2 0.000000 0.5833333 0.8641026   uneq     NA  FALSE 1.000000e+00
 3:     1     3 0.447619 0.4642857 0.9333333   uneq  FALSE  FALSE 7.317210e-12
 4:     2     1 1.000000 0.8888889 0.9230769    mis   TRUE   TRUE 1.000000e+00
 5:     2     2 0.000000 0.0000000 0.8641026     eq     NA  FALSE 2.220446e-16
 6:     2     3 0.447619 0.5396825 0.9333333     eq  FALSE  FALSE 2.220446e-16
 7:     3     1 0.447619 0.4722222 0.8641026    mis  FALSE  FALSE 2.220446e-16
 8:     3     2 0.952381 0.5833333 0.9230769   uneq     NA   TRUE 2.214629e-12
 9:     3     3 1.000000 0.4642857 1.0000000   uneq  FALSE  FALSE 2.220446e-16
10:     4     1 0.447619 0.6428571 0.8641026    mis  FALSE  FALSE 1.665098e-11
11:     4     2 0.952381 0.0000000 0.9230769     eq     NA  FALSE 2.220446e-16
12:     4     3 1.000000 1.0000000 1.0000000     eq   TRUE   TRUE 1.000000e+00
13:     5     1 0.447619 0.5555556 0.8641026    mis  FALSE  FALSE 2.220446e-16
14:     5     2 0.952381 0.0000000 0.9230769     eq     NA  FALSE 2.220446e-16
15:     5     3 1.000000 0.8492063 1.0000000     eq  FALSE  FALSE 4.477438e-11
16:     6     4 1.000000 1.0000000 0.6111111   uneq   TRUE   TRUE 1.000000e+00
17:     6     5 1.000000 0.5277778 1.0000000     eq     NA  FALSE 2.220446e-16
    select
    <lgcl>
 1:  FALSE
 2:   TRUE
 3:  FALSE
 4:   TRUE
 5:  FALSE
 6:  FALSE
 7:  FALSE
 8:  FALSE
 9:  FALSE
10:  FALSE
11:  FALSE
12:   TRUE
13:  FALSE
14:  FALSE
15:  FALSE
16:   TRUE
17:  FALSE
> table(pairs$select > 0.5, pairs$y_true)
       
        FALSE TRUE
  FALSE    12    1
  TRUE      1    3

Given the small size of the dataset we have to estimate the model on, this is not too bad.

Create the linked data set

We now know which pairs are to be linked, but we still have to actually link them. link does that (the optional arguments all_x and all_y control the type of linkage):

> linked_data_set <- link(pairs, selection = "select", all_y = TRUE)
> print(linked_data_set)
  Total number of pairs: 5 pairs

Key: <.y>
      .y    .x  id.x lastname.x firstname.x  address.x  sex.x postcode.x  id.y
   <int> <int> <int>     <fctr>      <fctr>     <fctr> <fctr>     <fctr> <int>
1:     1     2     2      Smith      George 12 Mainstr      M    1234 AB     2
2:     2     1     1      Smith        Anna 12 Mainstr      F    1234 AB     3
3:     3     4     4    Johnson     Charles 61 Mainstr      M    1234 AB     4
4:     4     6     6   Schwartz         Ben  1 Eaststr      M    6789 XY     6
5:     5    NA    NA       <NA>        <NA>       <NA>   <NA>       <NA>     7
   lastname.y firstname.y     address.y  sex.y postcode.y
       <fctr>      <fctr>        <fctr> <fctr>     <fctr>
1:      Smith      Gearge 12 Mainstreet   <NA>    1234 AB
2:     Jonson          A. 61 Mainstreet      F    1234 AB
3:    Johnson     Charles    61 Mainstr      F    1234 AB
4:   Schwartz         Ben        1 Main      M    6789 XY
5:   Schwartz        Anna     1 Eaststr      F    6789 XY

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.