Multistage deterministic linkage in R

10 November 2019

Introduction

Linking multiple datasets to consolidate information is a common task in research, particularly in those involving the use of “big data”. Deterministic record linkage is the simplest and most common method for record linkage however, its accuracy relies on data quality. Too many incorrect or missing values will often provide an unacceptable number of false matches or mismatches.

This function aims to provide a simple, multistage and flexible implementation of deterministic record linkage that tries to improve the linkage of datasets containing missing or incorrect group identifiers e.g. customer, patient or admission codes. In such instances, alternative identifiers like dates, names, height or other attributes are used in a specified order of preference.

Uses

Arguments in record_group() control separate aspects of the linkage process. Different combinations of each argument can be used to link datasets in a variety of ways. Examples of these include;

Implementation

Record linkage is done in stages. Each stage is considered more relevant than the subsequent one i.e. a match at stage 1 is taken to be more relevant than one at stage 2.

Records are assigned a unique group ID if they match on a criteria. The group ID is essentially the record ID (sn) of one of the matching records. Therefore, if you use a familiar record ID (sn), you can link the results back to the original dataset.

The criteria should be provided as column names of the attributes to be compared. This argument takes advantage of dplyr quasiquotation.

One or more sub_criteria can be used at each stage to include additional conditions for a match. This is provided as a list of named vectors. Each named vector contains column names. When a sub_criteria is used, records will only be assigned a group ID when they match on the criteria, and at least one named column in each sub_criteria.

Each sub_criteria should be paired with a corresponding criteria. To do this, the vector name for each sub_criteria should be prefixed with "s" and the corresponding criteria number e.g. "s1" or "s4". When a criteria is paired to more than one sub_criteria, include a suffix after the criteria number e.g. "s1a", "s1b" and "s1c". See examples. Any sub_criteria not paired to a criteria will be ignored. The sub_criteria argument does not support quasiquotation.

At each stage, the function prints the number of records that have been assigned a group ID and how many groups have only one record.

NOTE; to_s4 and to_s4() changes the function’s output from a data.frame (current default) to pid objects. pid objects will be the default output in the next release.

library(diyar); library(dplyr)
data(patient_list); 
dbs <- patient_list[c("forename","surname","sex")]; dbs
#> # A tibble: 6 x 3
#>   forename surname sex  
#>   <chr>    <chr>   <chr>
#> 1 James    Green   M    
#> 2 ESTHER   Kulmar  F    
#> 3 ""       OBI     F    
#> 4 Jamey    Green   M    
#> 5 Daniel   Kulmar  M    
#> 6 Henry    OBI     M

# 1 stage <- Matching surname only
dbs$pids_a <-record_group(dbs, criteria = surname, to_s4 = TRUE)
#> 
#> Group criteria 1 - `surname`
#> 6 of 6 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 0 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 0.
#> 
#> Record grouping complete - 0 record(s) assigned a group unique ID.

# 2 stage - Matching surname, then matching sex
dbs$pids_b <- record_group(dbs, criteria = c(surname, sex), display = FALSE, to_s4 = TRUE)
#> Record grouping complete - 0 record(s) assigned a group unique ID.

dbs
#> # A tibble: 6 x 5
#>   forename surname sex   pids_a       pids_b      
#>   <chr>    <chr>   <chr> <pid>        <pid>       
#> 1 James    Green   M     P-1 (CRI 01) P-1 (CRI 01)
#> 2 ESTHER   Kulmar  F     P-2 (CRI 01) P-2 (CRI 01)
#> 3 ""       OBI     F     P-3 (CRI 01) P-3 (CRI 01)
#> 4 Jamey    Green   M     P-1 (CRI 01) P-1 (CRI 01)
#> 5 Daniel   Kulmar  M     P-2 (CRI 01) P-2 (CRI 01)
#> 6 Henry    OBI     M     P-3 (CRI 01) P-3 (CRI 01)
# Note that exact matching is case sensitive. See range matching.

Record matching

The choice and ordering of column names in criteria and sub_criteria determines the relevance of matches at each stage. This flexibility allows you link records in different ways however, you should consider a practical combination which would yield more “true” matches than “false” matches.

For example, in patient_list above, linking on surnames and then sex (pid_b) leads to a different result compared to when linking on sex before surnames (pid_c). See Record group expansion.

dbs$pids_c <- record_group(dbs, criteria =  c(sex, surname), display = FALSE, to_s4 = TRUE)
#> Record grouping complete - 0 record(s) assigned a group unique ID.

dbs
#> # A tibble: 6 x 6
#>   forename surname sex   pids_a       pids_b       pids_c      
#>   <chr>    <chr>   <chr> <pid>        <pid>        <pid>       
#> 1 James    Green   M     P-1 (CRI 01) P-1 (CRI 01) P-1 (CRI 01)
#> 2 ESTHER   Kulmar  F     P-2 (CRI 01) P-2 (CRI 01) P-2 (CRI 01)
#> 3 ""       OBI     F     P-3 (CRI 01) P-3 (CRI 01) P-2 (CRI 01)
#> 4 Jamey    Green   M     P-1 (CRI 01) P-1 (CRI 01) P-1 (CRI 01)
#> 5 Daniel   Kulmar  M     P-2 (CRI 01) P-2 (CRI 01) P-1 (CRI 01)
#> 6 Henry    OBI     M     P-3 (CRI 01) P-3 (CRI 01) P-1 (CRI 01)

Both result are logically correct considering the criteria used however, a two stage linkage on surnames followed by sex is not the most practical option given the dataset. For instance, based on pid_b, records 3 and 6 have been linked together due to matching surnames, when they could actually be cousins not the same individual. A better combination would be forename at stage 1, followed by surname and sex at stage 2. See pid_d below;

dbs_2 <- patient_list; dbs_2
#> # A tibble: 6 x 4
#>   rd_id forename surname sex  
#>   <int> <chr>    <chr>   <chr>
#> 1     1 James    Green   M    
#> 2     2 ESTHER   Kulmar  F    
#> 3     3 ""       OBI     F    
#> 4     4 Jamey    Green   M    
#> 5     5 Daniel   Kulmar  M    
#> 6     6 Henry    OBI     M

dbs_2$cri_2 <- paste(dbs_2$surname, dbs_2$sex,sep="-")
dbs_2$pid_d <- record_group(dbs_2, rd_id, c(forename, cri_2), display = FALSE, to_s4 = TRUE)
#> Record grouping complete - 4 record(s) assigned a group unique ID.

dbs_2
#> # A tibble: 6 x 6
#>   rd_id forename surname sex   cri_2    pid_d       
#>   <int> <chr>    <chr>   <chr> <chr>    <pid>       
#> 1     1 James    Green   M     Green-M  P-1 (CRI 02)
#> 2     2 ESTHER   Kulmar  F     Kulmar-F P-2 (No Hit)
#> 3     3 ""       OBI     F     OBI-F    P-3 (No Hit)
#> 4     4 Jamey    Green   M     Green-M  P-1 (CRI 02)
#> 5     5 Daniel   Kulmar  M     Kulmar-M P-5 (No Hit)
#> 6     6 Henry    OBI     M     OBI-M    P-6 (No Hit)

As mentioned earlier, at each stage of record linkage, a sub_criteria can be used to include additional conditions for a match. Just like criteria, any column in the dataset can be used as a sub_criteria.

Below are examples of record linkage using different combinations of the same criteria and sub_criteria

library(tidyr)
data(Opes); Opes
#> # A tibble: 8 x 8
#>   rd_id name  department  hair_colour date_of_birth db_pt1 db_pt2  db_pt3 
#>   <int> <chr> <chr>       <chr>       <chr>         <chr>  <chr>   <chr>  
#> 1     1 Ope   Procurement Brown       23/03/1986    23/03  23/1986 03/1986
#> 2     2 Ope   Security    Brown       23/03/1986    23/03  23/1986 03/1986
#> 3     3 Ope   Security    Brown       23/03/1968    23/03  23/1968 03/1968
#> 4     4 Ope   Publishing  Green       01/02/1985    01/02  01/1985 02/1985
#> 5     5 Ope   Publishing  Teal        02/01/1985    02/01  02/1985 01/1985
#> 6     6 Ope   Publishing  Grey        11/03/1964    11/03  11/1964 03/1964
#> 7     7 Ope   Publishing  White       11/03/1964    11/03  11/1964 03/1964
#> 8     8 Ope   Procurement Black       11/10/1985    11/10  11/1985 10/1985

# 1 stage linkage
  # sub_criteria - name, and either department, hair colour or date of birth
Opes$pids_a <- record_group(Opes, criteria = name, 
                            sub_criteria = list("s1a"=c("department","hair_colour","date_of_birth")),
                            display = FALSE, to_s4 = TRUE)
#> Record grouping complete - 0 record(s) assigned a group unique ID.

Opes[c("name","department","hair_colour","date_of_birth","pids_a")]
#> # A tibble: 8 x 5
#>   name  department  hair_colour date_of_birth pids_a      
#>   <chr> <chr>       <chr>       <chr>         <pid>       
#> 1 Ope   Procurement Brown       23/03/1986    P-1 (CRI 01)
#> 2 Ope   Security    Brown       23/03/1986    P-1 (CRI 01)
#> 3 Ope   Security    Brown       23/03/1968    P-1 (CRI 01)
#> 4 Ope   Publishing  Green       01/02/1985    P-4 (CRI 01)
#> 5 Ope   Publishing  Teal        02/01/1985    P-4 (CRI 01)
#> 6 Ope   Publishing  Grey        11/03/1964    P-4 (CRI 01)
#> 7 Ope   Publishing  White       11/03/1964    P-4 (CRI 01)
#> 8 Ope   Procurement Black       11/10/1985    P-1 (CRI 01)
  
# 1 stage linkage 
  # sub_criteria - name, and either department or hair colour, and date of birth 
Opes$pids_b <- record_group(Opes, criteria = name, 
                            sub_criteria = list(
                              "s1a"=c("department","hair_colour"),
                              "s1b"=c("date_of_birth")),
                            display = FALSE, to_s4 = TRUE)
#> Record grouping complete - 4 record(s) assigned a group unique ID.

Opes[c("name","department","hair_colour","date_of_birth","pids_b")]
#> # A tibble: 8 x 5
#>   name  department  hair_colour date_of_birth pids_b      
#>   <chr> <chr>       <chr>       <chr>         <pid>       
#> 1 Ope   Procurement Brown       23/03/1986    P-1 (CRI 01)
#> 2 Ope   Security    Brown       23/03/1986    P-1 (CRI 01)
#> 3 Ope   Security    Brown       23/03/1968    P-3 (No Hit)
#> 4 Ope   Publishing  Green       01/02/1985    P-4 (No Hit)
#> 5 Ope   Publishing  Teal        02/01/1985    P-5 (No Hit)
#> 6 Ope   Publishing  Grey        11/03/1964    P-6 (CRI 01)
#> 7 Ope   Publishing  White       11/03/1964    P-6 (CRI 01)
#> 8 Ope   Procurement Black       11/10/1985    P-8 (No Hit)

# 1 stage linkage 
  # sub_criteria - name, and either department or hair colour, and either day and month of birth, day and year of birth or month and year of birth
Opes$pids_c <- record_group(Opes, criteria = name, 
                            sub_criteria = list(
                              "s1a"=c("department","hair_colour"), 
                              "s1b"=c("db_pt1","db_pt2","db_pt3")),
                            display = FALSE, to_s4 =TRUE)
#> Record grouping complete - 3 record(s) assigned a group unique ID.

Opes[c("name","department","hair_colour","date_of_birth","pids_c")]
#> # A tibble: 8 x 5
#>   name  department  hair_colour date_of_birth pids_c      
#>   <chr> <chr>       <chr>       <chr>         <pid>       
#> 1 Ope   Procurement Brown       23/03/1986    P-1 (CRI 01)
#> 2 Ope   Security    Brown       23/03/1986    P-1 (CRI 01)
#> 3 Ope   Security    Brown       23/03/1968    P-1 (CRI 01)
#> 4 Ope   Publishing  Green       01/02/1985    P-4 (No Hit)
#> 5 Ope   Publishing  Teal        02/01/1985    P-5 (No Hit)
#> 6 Ope   Publishing  Grey        11/03/1964    P-6 (CRI 01)
#> 7 Ope   Publishing  White       11/03/1964    P-6 (CRI 01)
#> 8 Ope   Procurement Black       11/10/1985    P-8 (No Hit)

# 1 stage linkage 
  # sub_criteria - name, and department, and hair colour, and either day and month of birth, day and year of birth or month and year of birth
Opes$pids_d <- record_group(Opes, criteria =name, 
               sub_criteria = list(
                 "s1a"=c("department"),
                 "s1c"=c("hair_colour"),
                 "s1b"=c("db_pt1","db_pt2","db_pt3")),  
               display = FALSE, to_s4 = TRUE)
#> Record grouping complete - 6 record(s) assigned a group unique ID.

Opes[c("name","department","hair_colour","date_of_birth","pids_d")]
#> # A tibble: 8 x 5
#>   name  department  hair_colour date_of_birth pids_d      
#>   <chr> <chr>       <chr>       <chr>         <pid>       
#> 1 Ope   Procurement Brown       23/03/1986    P-1 (No Hit)
#> 2 Ope   Security    Brown       23/03/1986    P-2 (CRI 01)
#> 3 Ope   Security    Brown       23/03/1968    P-2 (CRI 01)
#> 4 Ope   Publishing  Green       01/02/1985    P-4 (No Hit)
#> 5 Ope   Publishing  Teal        02/01/1985    P-5 (No Hit)
#> 6 Ope   Publishing  Grey        11/03/1964    P-6 (No Hit)
#> 7 Ope   Publishing  White       11/03/1964    P-7 (No Hit)
#> 8 Ope   Procurement Black       11/10/1985    P-8 (No Hit)

Note that using sub_criteria costs additional processing time, so it should be avoided when unnecessary. For example, the two implementations below (pids_e and pids_f) will yield the same result. However, the second one will take less time to complete. This time difference is more noticeable with very large datasets.

# 1 stage linkage 
  # stage 1 - name, and date of birth, and department and hair colour 
Opes$pids_e <- record_group(Opes, criteri = name, 
                            sub_criteria = list(
                              "s1a"=c("department"), 
                              "s1b"=c("hair_colour"), 
                              "s1c"=c("date_of_birth")),
                            display = TRUE, to_s4 = TRUE)
#> 
#> Group criteria 1 - `name`
#> Matching criteria 1: iteration 2
#> Matching criteria 1: iteration 3
#> Matching criteria 1: iteration 4
#> Matching criteria 1: iteration 5
#> Matching criteria 1: iteration 6
#> Matching criteria 1: iteration 7
#> Matching criteria 1: iteration 8
#> 8 of 8 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 8 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 8.
#> 
#> Record grouping complete - 8 record(s) assigned a group unique ID.

Opes$cri <- paste(Opes$name, Opes$date_of_birth, Opes$department, Opes$hair_colour, sep="-")

Opes$pids_f <- record_group(Opes, criteria = cri,  display = TRUE, to_s4 =TRUE)
#> 
#> Group criteria 1 - `cri`
#> 8 of 8 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 8 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 8.
#> 
#> Record grouping complete - 8 record(s) assigned a group unique ID.

Opes[c("name","department","hair_colour","date_of_birth","pids_e","pids_f")]
#> # A tibble: 8 x 6
#>   name  department  hair_colour date_of_birth pids_e       pids_f      
#>   <chr> <chr>       <chr>       <chr>         <pid>        <pid>       
#> 1 Ope   Procurement Brown       23/03/1986    P-1 (No Hit) P-1 (No Hit)
#> 2 Ope   Security    Brown       23/03/1986    P-2 (No Hit) P-2 (No Hit)
#> 3 Ope   Security    Brown       23/03/1968    P-3 (No Hit) P-3 (No Hit)
#> 4 Ope   Publishing  Green       01/02/1985    P-4 (No Hit) P-4 (No Hit)
#> 5 Ope   Publishing  Teal        02/01/1985    P-5 (No Hit) P-5 (No Hit)
#> 6 Ope   Publishing  Grey        11/03/1964    P-6 (No Hit) P-6 (No Hit)
#> 7 Ope   Publishing  White       11/03/1964    P-7 (No Hit) P-7 (No Hit)
#> 8 Ope   Procurement Black       11/10/1985    P-8 (No Hit) P-8 (No Hit)

Range matching

Records can be matched in two ways; exact matches as in the examples above, or matching a range of values. The latter is done by converting the range of values to a number_line() object, and the gid argument/slot set to the actual value. This number_line object is then used as a sub_criteria. number_line objects are considered a match if they overlap. See the example below.

library(lubridate)
Opes_c <- Opes["date_of_birth"]
Opes_c
#> # A tibble: 8 x 1
#>   date_of_birth
#>   <chr>        
#> 1 23/03/1986   
#> 2 23/03/1986   
#> 3 23/03/1968   
#> 4 01/02/1985   
#> 5 02/01/1985   
#> 6 11/03/1964   
#> 7 11/03/1964   
#> 8 11/10/1985

# Match on date of birth + 3 months
Opes_c$date_of_birth <- dmy(Opes_c$date_of_birth)
Opes_c$range_a <- expand_number_line(as.number_line(Opes_c$date_of_birth), period(2, "years"), "end")
Opes_c$range_a@gid <- as.numeric(Opes_c$date_of_birth)

Opes_c$pids_a <- record_group(Opes_c, criteria = range_a, to_s4 =TRUE)
#> 
#> Group criteria 1 - `dmvr_d1`
#> Matching criteria 1: iteration 2
#> Matching criteria 1: iteration 3
#> Matching criteria 1: iteration 4
#> Matching criteria 1: iteration 5
#> Matching criteria 1: iteration 6
#> Matching criteria 1: iteration 7
#> Matching criteria 1: iteration 8
#> Matching criteria 1: iteration 9
#> 8 of 8 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 2 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 2.
#> 
#> Record grouping complete - 2 record(s) assigned a group unique ID.

Opes_c[c("date_of_birth","range_a","pids_a")]
#> # A tibble: 8 x 3
#>   date_of_birth range_a                  pids_a      
#>   <date>        <numbr_ln>               <pid>       
#> 1 1986-03-23    1986-03-23 -> 1988-03-23 P-4 (CRI 01)
#> 2 1986-03-23    1986-03-23 -> 1988-03-23 P-4 (CRI 01)
#> 3 1968-03-23    1968-03-23 -> 1970-03-23 P-3 (No Hit)
#> 4 1985-02-01    1985-02-01 -> 1987-02-01 P-4 (CRI 01)
#> 5 1985-01-02    1985-01-02 -> 1987-01-02 P-5 (No Hit)
#> 6 1964-03-11    1964-03-11 -> 1966-03-11 P-6 (CRI 01)
#> 7 1964-03-11    1964-03-11 -> 1966-03-11 P-6 (CRI 01)
#> 8 1985-10-11    1985-10-11 -> 1987-10-11 P-4 (CRI 01)

# Match on age +/- 5 years
Opes_c$age <- as.numeric(round((Sys.Date() - Opes_c$date_of_birth)/365.5)) # approximate age
Opes_c$range_b <- expand_number_line(as.number_line(Opes_c$age), 5, "both")
Opes_c$range_b@gid <- Opes_c$age

Opes_c$pids_b <- record_group(Opes_c, criteria = range_b, to_s4 =TRUE)
#> 
#> Group criteria 1 - `dmvr_d1`
#> Matching criteria 1: iteration 2
#> 8 of 8 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 0 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 0.
#> 
#> Record grouping complete - 0 record(s) assigned a group unique ID.

Opes_c[c("age","range_b","pids_b")]
#> # A tibble: 8 x 3
#>     age range_b    pids_b      
#>   <dbl> <numbr_ln> <pid>       
#> 1    34 29 -> 39   P-1 (CRI 01)
#> 2    34 29 -> 39   P-1 (CRI 01)
#> 3    52 47 -> 57   P-3 (CRI 01)
#> 4    35 30 -> 40   P-1 (CRI 01)
#> 5    35 30 -> 40   P-1 (CRI 01)
#> 6    56 51 -> 61   P-3 (CRI 01)
#> 7    56 51 -> 61   P-3 (CRI 01)
#> 8    34 29 -> 39   P-1 (CRI 01)

Record group expansion

At each stage of record linkage, records are either assigned a new group ID or inherit an existing one. The following scenario explain how these happen;

It’s worth reiterating that record_group() expects the criteria to be listed in order of decreasing relevance. Therefore, existing group IDs can be inherited but will not be overwritten. This is because groups formed at earlier stages are considered more “certain” than those formed at subsequent stages.

The example below with patient_list demonstrates this behaviour.

data(patient_list_2); patient_list_2
#> # A tibble: 5 x 4
#>   rd_id forename surname     sex   
#>   <int> <chr>    <chr>       <chr> 
#> 1     1 ""       Jefferson   Male  
#> 2     2 ""       Jefferson   Female
#> 3     3 Tomi     Abdul       Male  
#> 4     4 Tomi     Abdulkareem Female
#> 5     5 <NA>     <NA>        Male

patient_list_2$pids_a <- record_group(patient_list_2, rd_id, c(forename, surname, sex), to_s4 = TRUE)
#> 
#> Group criteria 1 - `forename`
#> 2 of 5 record(s) have been assigned a group ID. 3 record(s) not yet grouped.
#> 0 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 3.
#> 
#> Group criteria 2 - `surname`
#> 2 of 3 record(s) have been assigned a group ID. 1 record(s) not yet grouped.
#> 0 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 1.
#> 
#> Group criteria 3 - `sex`
#> 1 of 1 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 0 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 0.
#> 
#> Record grouping complete - 0 record(s) assigned a group unique ID.

patient_list_2
#> # A tibble: 5 x 5
#>   rd_id forename surname     sex    pids_a      
#>   <int> <chr>    <chr>       <chr>  <pid>       
#> 1     1 ""       Jefferson   Male   P-1 (CRI 02)
#> 2     2 ""       Jefferson   Female P-1 (CRI 02)
#> 3     3 Tomi     Abdul       Male   P-3 (CRI 01)
#> 4     4 Tomi     Abdulkareem Female P-3 (CRI 01)
#> 5     5 <NA>     <NA>        Male   P-3 (CRI 03)

Handling missing values

Records with missing values for a particular criteria are excluded from that stage of record linkage. If a record has missing values for every listed criteria, it’s assigned a unique group ID at the end of record linkage.

It’s common for databases to use specific characters or numbers to represent missing or unknown data e.g. N/A, Nil, 01/01/1100, 111111 etc. These pseudo-missing values will need to be recoded to one of the two recognised by record_group() - NA or an empty string (""). If this is not done, the function will assume the pseudo-missing values are valid values and therefore group them together. This can cause a continuous cascade of false matches as seen below.

patient_list_2$forename <- ifelse(patient_list_2$rd_id %in% 1:3, "Nil", patient_list_2$forename)
# 2 stage linkage
    # Stage 1 - forename
    # Stage 2 - Surname
patient_list_2$pids_b <- record_group(patient_list_2, criteria = c(forename, surname), 
                                      display = FALSE, to_s4 =TRUE)
#> Record grouping complete - 2 record(s) assigned a group unique ID.

# 2 stage linkage
    # Stage 1 - forename
    # Stage 2 - Surname and sex
patient_list_2 <- mutate(patient_list_2, cri_3 = paste(surname,sex,sep=""))

patient_list_2$pids_c <- record_group(patient_list_2, criteria = c(forename, cri_3), 
                                      display = FALSE, to_s4 = TRUE)
#> Record grouping complete - 2 record(s) assigned a group unique ID.

patient_list_2[c("forename","surname","sex","pids_b","pids_c")]
#> # A tibble: 5 x 5
#>   forename surname     sex    pids_b       pids_c      
#>   <chr>    <chr>       <chr>  <pid>        <pid>       
#> 1 Nil      Jefferson   Male   P-1 (CRI 01) P-1 (CRI 01)
#> 2 Nil      Jefferson   Female P-1 (CRI 01) P-1 (CRI 01)
#> 3 Nil      Abdul       Male   P-1 (CRI 01) P-1 (CRI 01)
#> 4 Tomi     Abdulkareem Female P-4 (No Hit) P-4 (No Hit)
#> 5 <NA>     <NA>        Male   P-5 (No Hit) P-5 (No Hit)

In the example above, records 1-3 are assigned a single group ID based on matching forenames ("Nil"). Then records 4-6 are assigned the same group ID because they having matching surnames with either records 1-3. Even when stage 2 is changed to matching surnames and sex, records 2 and 3 are still “incorrectly” grouped together. Although, this is likely not the desired outcome, it’s the expected result given the parameters supplied to the function. This issue can be addressed by recoding "Nil" to NA or "".

# Using NA as the proxy for missing value
patient_list_2 <- mutate(patient_list_2,forename = ifelse(forename=="Nil",NA,forename))

patient_list_2$pids_d <- record_group(patient_list_2, rd_id, c(forename, surname), 
                                      display = FALSE, to_s4 = TRUE)
#> Record grouping complete - 3 record(s) assigned a group unique ID.

# Using "" as the proxy for missing value
patient_list_2 <- mutate(patient_list_2,forename = ifelse(is.na(forename),"",forename))  

patient_list_2$pids_e <- record_group(patient_list_2, rd_id, c(forename, surname), 
                                      display = FALSE, to_s4 = TRUE)
#> Record grouping complete - 3 record(s) assigned a group unique ID.

patient_list_2[c("forename","surname","sex","pids_d","pids_e")]
#> # A tibble: 5 x 5
#>   forename surname     sex    pids_d       pids_e      
#>   <chr>    <chr>       <chr>  <pid>        <pid>       
#> 1 ""       Jefferson   Male   P-1 (CRI 02) P-1 (CRI 02)
#> 2 ""       Jefferson   Female P-1 (CRI 02) P-1 (CRI 02)
#> 3 ""       Abdul       Male   P-3 (No Hit) P-3 (No Hit)
#> 4 Tomi     Abdulkareem Female P-4 (No Hit) P-4 (No Hit)
#> 5 ""       <NA>        Male   P-5 (No Hit) P-5 (No Hit)

Conclusion

As a general rule, the more unique a criteria, the earlier it should be listed in criteria. Also, the set and ordering of criteria is a personal choice but should also be practical given the dataset. For example, when linking a vehicular database with no existing identifier, vehicle colour alone is less practical than colour and brand name, which in turn is less practical than colour, brand name, make and model. However colour, brand name, make and model and 10 other parameters might be too strict and would need to be relaxed. On the other hand, the dataset could be so small that vehicle colour alone is a sufficient criteria. record_group() aims to minimize false mismatches due to random errors in data collection or missing values. The choice and ordering of criteria and sub_criteria should balance the availability of alternative identifiers with their practicality as group identifiers.