Linking multiple datasets to consolidate information is a common task in research, particularly in those involving the use of “big data”. Deterministic record linkage is the simplest and most common method for record linkage however, its accuracy relies on data quality. Too many incorrect or missing values will often provide an unacceptable number of false matches or mismatches.
This function aims to provide a simple, multistage and flexible implementation of deterministic record linkage that tries to maximise successful linkage of datasets with missing or incorrect group identifiers e.g. customer, patient or event ID. In such instances, alternative identifiers like dates, names, height or other attributes are used in a specified order of preference.
Arguments in record_group()
control separate aspects of the linkage process. Different combinations of each argument can be used to link datasets in a variety of ways. Examples of these include;
criteria
criteria
and one (or more) matching sub_criteria
. See record matchingRecord linkage is done in stages. Each stage is considered more certain than the subsequent one i.e. a match at stage 1 is considered more certain than one at stage 2.
Records are assigned a unique group ID if they match on a criteria
. The group ID is essentially the record ID (sn
) of one of the matching records. As a result, if you use a familiar record ID (sn
), you can link the results back to the original dataset.
The criteria
should be provided as column names of the attributes to be compared. This argument takes advantage of dplyr
quasiquotation.
One or more sub_criteria
can be used at each stage to include additional conditions for a match. This is provided as a list
of column names as named vectors
. If a sub_criteria
is used, records will only be assigned a group ID when they match on the criteria
, and at least one named column in each sub_criteria
.
Each sub_criteria
should be paired with a corresponding criteria
. To do this, the vector name for each sub_criteria
should be prefixed with "s"
and the corresponding criteria
number e.g. "s1"
or "s4"
. When a criteria
has more than one sub_criteria
, include a suffix after the criteria
number e.g. "s1a"
, "s1b"
, "s1c"
or "s2a"
. See examples. Any sub_criteria
not paired to a criteria
will be ignored. The sub_criteria
argument does not support quasiquotation.
At each stage, the function prints the number of records that have been assigned a group ID and how many groups have only one record.
Below are two implementation of a single stage record linkage. One is based on matching forenames, and the other is based on matching forenames and surnames.
library(diyar)
library(dplyr)
data(patient_list); patient_list
#> # A tibble: 6 x 4
#> rd_id forename surname sex
#> <int> <chr> <chr> <chr>
#> 1 1 James Green M
#> 2 2 ESTHER Kulmar F
#> 3 3 "" OBI F
#> 4 4 Jamey Green M
#> 5 5 Daniel Kulmar M
#> 6 6 Henry OBI M
# Matching forename only
cbind(patient_list, record_group(patient_list, rd_id, forename))
#>
#> Group criteria 1 - `forename`
#> 5 of 6 record(s) have been assigned a group ID. 1 record(s) not yet grouped.
#> 5 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 6.
#>
#> Record grouping complete - 6 record(s) assigned a group unique ID.
#> rd_id forename surname sex sn pid pid_cri
#> 1 1 James Green M 1 1 None
#> 2 2 ESTHER Kulmar F 2 2 None
#> 3 3 OBI F 3 3 None
#> 4 4 Jamey Green M 4 4 None
#> 5 5 Daniel Kulmar M 5 5 None
#> 6 6 Henry OBI M 6 6 None
# Matching forename and surname
patient_list <- mutate(patient_list, cri_1 = paste(forename, surname,sep="-") )
cbind(patient_list, record_group(patient_list, rd_id, cri_1, display = FALSE))
#> Record grouping complete - 6 record(s) assigned a group unique ID.
#> rd_id forename surname sex cri_1 sn pid pid_cri
#> 1 1 James Green M James-Green 1 1 None
#> 2 2 ESTHER Kulmar F ESTHER-Kulmar 2 2 None
#> 3 3 OBI F -OBI 3 3 None
#> 4 4 Jamey Green M Jamey-Green 4 4 None
#> 5 5 Daniel Kulmar M Daniel-Kulmar 5 5 None
#> 6 6 Henry OBI M Henry-OBI 6 6 None
# Note that exact matching is case sensitive. See range matching.
The choice and ordering of criteria
and sub_criteria
directly impacts the linkage. Before using this function, review the dataset and decide which combinations of criteria
and sub_criteria
would be appropriate. record_group()
can use any combination available from the dataset however, you should consider a practical combination which would yield more “true” matches than “false” matches.
For example, in patient_list
above, linking on forenames only, or forenames and surnames does not yield any match. However, linking in two stages - forename followed by surname, will pair records 1 and 4, 2 and 5, 3 and 6. See Record group expansion.
cbind(patient_list, record_group(patient_list, rd_id, c(forename, surname), display = FALSE))
#> Record grouping complete - 0 record(s) assigned a group unique ID.
#> rd_id forename surname sex cri_1 sn pid pid_cri
#> 1 1 James Green M James-Green 1 1 Criteria 2
#> 2 2 ESTHER Kulmar F ESTHER-Kulmar 2 2 Criteria 2
#> 3 3 OBI F -OBI 3 3 Criteria 2
#> 4 4 Jamey Green M Jamey-Green 4 1 Criteria 2
#> 5 5 Daniel Kulmar M Daniel-Kulmar 5 2 Criteria 2
#> 6 6 Henry OBI M Henry-OBI 6 3 Criteria 2
Although this result is logically correct, a two stage linkage on forenames followed by surnames is not the most practical option given the dataset. For instance, records 3 and 6 could be cousins and not the same individual. A better combination would be forename at stage 1, followed by surname and sex at stage 2. See below;
patient_list <- mutate(patient_list, cri_2 = paste(surname, sex,sep="-") )
cbind(patient_list, record_group(patient_list, rd_id, c(forename, cri_2), display = FALSE))
#> Record grouping complete - 4 record(s) assigned a group unique ID.
#> rd_id forename surname sex cri_1 cri_2 sn pid pid_cri
#> 1 1 James Green M James-Green Green-M 1 1 Criteria 2
#> 2 2 ESTHER Kulmar F ESTHER-Kulmar Kulmar-F 2 2 None
#> 3 3 OBI F -OBI OBI-F 3 3 None
#> 4 4 Jamey Green M Jamey-Green Green-M 4 1 Criteria 2
#> 5 5 Daniel Kulmar M Daniel-Kulmar Kulmar-M 5 5 None
#> 6 6 Henry OBI M Henry-OBI OBI-M 6 6 None
As mentioned earlier, at each stage of record linkage, a sub_criteria
can be used to include additional conditions for a match. Just like criteria
, any column in the dataset can be used as a sub_criteria
. Although, a practical combination for the given dataset is recommended.
Below are examples of record linkage using different combinations of the same criteria
and sub_criteria
library(tidyr)
data(Opes); Opes
#> # A tibble: 8 x 8
#> rd_id name department hair_colour date_of_birth db_pt1 db_pt2 db_pt3
#> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Ope Procurement Brown 23/03/1986 23/03 23/1986 03/1986
#> 2 2 Ope Security Brown 23/03/1986 23/03 23/1986 03/1986
#> 3 3 Ope Security Brown 23/03/1968 23/03 23/1968 03/1968
#> 4 4 Ope Publishing Green 01/02/1985 01/02 01/1985 02/1985
#> 5 5 Ope Publishing Teal 02/01/1985 02/01 02/1985 01/1985
#> 6 6 Ope Publishing Grey 11/03/1964 11/03 11/1964 03/1964
#> 7 7 Ope Publishing White 11/03/1964 11/03 11/1964 03/1964
#> 8 8 Ope Procurement Black 11/10/1985 11/10 11/1985 10/1985
# 1 stage linkage
# stage 1 - name, and either department, hair colour or date of birth
cbind(
Opes,
record_group(Opes, rd_id, name, list("s1a"=c("department","hair_colour","date_of_birth")), display = FALSE)
) %>% select(-starts_with("db_"), -sn)
#> Record grouping complete - 0 record(s) assigned a group unique ID.
#> rd_id name department hair_colour date_of_birth pid pid_cri
#> 1 1 Ope Procurement Brown 23/03/1986 1 Criteria 1
#> 2 2 Ope Security Brown 23/03/1986 1 Criteria 1
#> 3 3 Ope Security Brown 23/03/1968 1 Criteria 1
#> 4 4 Ope Publishing Green 01/02/1985 4 Criteria 1
#> 5 5 Ope Publishing Teal 02/01/1985 4 Criteria 1
#> 6 6 Ope Publishing Grey 11/03/1964 4 Criteria 1
#> 7 7 Ope Publishing White 11/03/1964 4 Criteria 1
#> 8 8 Ope Procurement Black 11/10/1985 1 Criteria 1
# 1 stage linkage
# stage 1 - name, and either department or hair colour, and date of birth
cbind(
Opes,
record_group(Opes, rd_id, c(name),
list("s1a"=c("department","hair_colour"),
"s1b"=c("date_of_birth")), display = FALSE)
) %>% select(-starts_with("db_"), -sn)
#> Record grouping complete - 4 record(s) assigned a group unique ID.
#> rd_id name department hair_colour date_of_birth pid pid_cri
#> 1 1 Ope Procurement Brown 23/03/1986 1 Criteria 1
#> 2 2 Ope Security Brown 23/03/1986 1 Criteria 1
#> 3 3 Ope Security Brown 23/03/1968 3 None
#> 4 4 Ope Publishing Green 01/02/1985 4 None
#> 5 5 Ope Publishing Teal 02/01/1985 5 None
#> 6 6 Ope Publishing Grey 11/03/1964 6 Criteria 1
#> 7 7 Ope Publishing White 11/03/1964 6 Criteria 1
#> 8 8 Ope Procurement Black 11/10/1985 8 None
# 1 stage linkage
# stage 1 - name, and either department or hair colour, and either day and month of birth, day and year of birth or month and year of birth date of birth
cbind(
Opes,
record_group(Opes, rd_id, c(name),
list("s1a"=c("department","hair_colour"),
"s1b"=c("db_pt1","db_pt2","db_pt3")), display = FALSE)
) %>% select(-date_of_birth, -sn)
#> Record grouping complete - 3 record(s) assigned a group unique ID.
#> rd_id name department hair_colour db_pt1 db_pt2 db_pt3 pid pid_cri
#> 1 1 Ope Procurement Brown 23/03 23/1986 03/1986 1 Criteria 1
#> 2 2 Ope Security Brown 23/03 23/1986 03/1986 1 Criteria 1
#> 3 3 Ope Security Brown 23/03 23/1968 03/1968 1 Criteria 1
#> 4 4 Ope Publishing Green 01/02 01/1985 02/1985 4 None
#> 5 5 Ope Publishing Teal 02/01 02/1985 01/1985 5 None
#> 6 6 Ope Publishing Grey 11/03 11/1964 03/1964 6 Criteria 1
#> 7 7 Ope Publishing White 11/03 11/1964 03/1964 6 Criteria 1
#> 8 8 Ope Procurement Black 11/10 11/1985 10/1985 8 None
# 1 stage linkage
# stage 1 - name, and department, and hair colour, and either day and month of birth, day and year of birth or month and year of birth date of birth
cbind(
Opes,
record_group(Opes, rd_id, c(name),
list("s1a"=c("department"),
"s1c"=c("hair_colour"),
"s1b"=c("db_pt1","db_pt2","db_pt3")), display = FALSE)
) %>% select(-starts_with("db_"), -sn)
#> Record grouping complete - 6 record(s) assigned a group unique ID.
#> rd_id name department hair_colour date_of_birth pid pid_cri
#> 1 1 Ope Procurement Brown 23/03/1986 1 None
#> 2 2 Ope Security Brown 23/03/1986 2 Criteria 1
#> 3 3 Ope Security Brown 23/03/1968 2 Criteria 1
#> 4 4 Ope Publishing Green 01/02/1985 4 None
#> 5 5 Ope Publishing Teal 02/01/1985 5 None
#> 6 6 Ope Publishing Grey 11/03/1964 6 None
#> 7 7 Ope Publishing White 11/03/1964 7 None
#> 8 8 Ope Procurement Black 11/10/1985 8 None
Note that using sub_criteria
costs additional processing time, so it should be avoided when unnecessary. For example, the two implementations below will yield the same result however, the second will take less time. This is indicated in the displayed messages. This time difference is more noticeable with very large datasets.
# 1 stage linkage
# stage 1 - name, and date of birth, and department and hair colour
cbind(
Opes,
record_group(Opes, rd_id, name,
list("s1a"=c("department"),
"s1b"=c("hair_colour"),
"s1c"=c("date_of_birth")), display = TRUE)
) %>% select(-starts_with("db_"))
#>
#> Group criteria 1 - `name`
#> Matching criteria 1: iteration 2
#> Matching criteria 1: iteration 3
#> Matching criteria 1: iteration 4
#> Matching criteria 1: iteration 5
#> Matching criteria 1: iteration 6
#> Matching criteria 1: iteration 7
#> Matching criteria 1: iteration 8
#> 8 of 8 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 8 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 8.
#>
#> Record grouping complete - 8 record(s) assigned a group unique ID.
#> rd_id name department hair_colour date_of_birth sn pid pid_cri
#> 1 1 Ope Procurement Brown 23/03/1986 1 1 None
#> 2 2 Ope Security Brown 23/03/1986 2 2 None
#> 3 3 Ope Security Brown 23/03/1968 3 3 None
#> 4 4 Ope Publishing Green 01/02/1985 4 4 None
#> 5 5 Ope Publishing Teal 02/01/1985 5 5 None
#> 6 6 Ope Publishing Grey 11/03/1964 6 6 None
#> 7 7 Ope Publishing White 11/03/1964 7 7 None
#> 8 8 Ope Procurement Black 11/10/1985 8 8 None
# 1 stage linkage
# stage 1 - name, and date of birth, and department and hair colour
Opes_b <- unite(Opes, cri, c(name, date_of_birth, department, hair_colour))
cbind(
Opes_b,
record_group(Opes_b, rd_id, c(cri), display = TRUE)
) %>% select(-starts_with("db_"))
#>
#> Group criteria 1 - `cri`
#> 8 of 8 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 8 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 8.
#>
#> Record grouping complete - 8 record(s) assigned a group unique ID.
#> rd_id cri sn pid pid_cri
#> 1 1 Ope_23/03/1986_Procurement_Brown 1 1 None
#> 2 2 Ope_23/03/1986_Security_Brown 2 2 None
#> 3 3 Ope_23/03/1968_Security_Brown 3 3 None
#> 4 4 Ope_01/02/1985_Publishing_Green 4 4 None
#> 5 5 Ope_02/01/1985_Publishing_Teal 5 5 None
#> 6 6 Ope_11/03/1964_Publishing_Grey 6 6 None
#> 7 7 Ope_11/03/1964_Publishing_White 7 7 None
#> 8 8 Ope_11/10/1985_Procurement_Black 8 8 None
Records can be matched in two ways; exact matches as in the examples above, or matching a range a values. The latter is done by converting the range of values to a number_line
object, and the gid
argument/slot set to the actual value. This number_line
object is then used as a sub_criteria
argument. number_line
objects are considered a match if they overlap. See the example below.
library(lubridate)
Opes_c <- select(Opes, date_of_birth)
Opes_c$dummy_cri <- 1
Opes_c
#> # A tibble: 8 x 2
#> date_of_birth dummy_cri
#> <chr> <dbl>
#> 1 23/03/1986 1
#> 2 23/03/1986 1
#> 3 23/03/1968 1
#> 4 01/02/1985 1
#> 5 02/01/1985 1
#> 6 11/03/1964 1
#> 7 11/03/1964 1
#> 8 11/10/1985 1
# Match record within 3 months before or after a date
Opes_c$range <- expand_number_line(as.number_line(dmy(Opes_c$date_of_birth)), period(2, "years"), "end")
Opes_c$range@gid <- as.numeric(dmy(Opes_c$date_of_birth))
bind_cols(Opes_c,
record_group(Opes_c, criteria = dummy_cri, sub_criteria = list(s1="range")))
#>
#> Group criteria 1 - `dummy_cri`
#> Matching criteria 1: iteration 2
#> Matching criteria 1: iteration 3
#> Matching criteria 1: iteration 4
#> Matching criteria 1: iteration 5
#> Matching criteria 1: iteration 6
#> Matching criteria 1: iteration 7
#> Matching criteria 1: iteration 8
#> Matching criteria 1: iteration 9
#> 8 of 8 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 2 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 2.
#>
#> Record grouping complete - 2 record(s) assigned a group unique ID.
#> # A tibble: 8 x 6
#> date_of_birth dummy_cri range sn pid pid_cri
#> <chr> <dbl> <numbr_ln> <int> <dbl> <chr>
#> 1 23/03/1986 1 1986-03-23 -> 1988-03-23 1 4 Criteria 1
#> 2 23/03/1986 1 1986-03-23 -> 1988-03-23 2 4 Criteria 1
#> 3 23/03/1968 1 1968-03-23 -> 1970-03-23 3 3 None
#> 4 01/02/1985 1 1985-02-01 -> 1987-02-01 4 4 Criteria 1
#> 5 02/01/1985 1 1985-01-02 -> 1987-01-02 5 5 None
#> 6 11/03/1964 1 1964-03-11 -> 1966-03-11 6 6 Criteria 1
#> 7 11/03/1964 1 1964-03-11 -> 1966-03-11 7 6 Criteria 1
#> 8 11/10/1985 1 1985-10-11 -> 1987-10-11 8 4 Criteria 1
# Match record within 5 years younger or older than an age
Opes_c$age <- as.numeric(round((Sys.Date() - dmy(Opes_c$date_of_birth))/365.5)) # approximate age
Opes_c$range <- as.number_line(Opes_c$age)
Opes_c$range@gid <- Opes_c$age
Opes_c$range <- expand_number_line(Opes_c$range, 5, "end")
bind_cols(Opes_c,
record_group(Opes_c, criteria = dummy_cri, sub_criteria = list(s1="range")))
#>
#> Group criteria 1 - `dummy_cri`
#> Matching criteria 1: iteration 2
#> 8 of 8 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 0 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 0.
#>
#> Record grouping complete - 0 record(s) assigned a group unique ID.
#> # A tibble: 8 x 7
#> date_of_birth dummy_cri range age sn pid pid_cri
#> <chr> <dbl> <numbr_ln> <dbl> <int> <dbl> <chr>
#> 1 23/03/1986 1 34 -> 39 34 1 1 Criteria 1
#> 2 23/03/1986 1 34 -> 39 34 2 1 Criteria 1
#> 3 23/03/1968 1 51 -> 56 51 3 3 Criteria 1
#> 4 01/02/1985 1 35 -> 40 35 4 1 Criteria 1
#> 5 02/01/1985 1 35 -> 40 35 5 1 Criteria 1
#> 6 11/03/1964 1 56 -> 61 56 6 3 Criteria 1
#> 7 11/03/1964 1 56 -> 61 56 7 3 Criteria 1
#> 8 11/10/1985 1 34 -> 39 34 8 1 Criteria 1
Only use number_line
objects as a sub_criteria
. Do not directly use number_line
objects as a criteria
!. Instead, create a dummy criteria
(e.g. 1
for every row), and then use the range as a sub_criteria
for the dummy criteria
.
At each stage of record linkage, records are either assigned a new group ID or inherit an existing one. The following scenario explains how these happen;
It’s worth reiterating that record_group()
expects the criteria
to be listed in order of decreasing certainty. Therefore, existing group IDs can be inherited but will not be overwritten. This is because groups formed at earlier stages are considered more “certain” than those formed at subsequent stages.
The example below with patient_list
demonstrates this behaviour.
3
). Records 1 and 2 are excluded from grouping at this stage because of missing values. Record 5 is not assigned a group ID because it doesn’t match any other record.1
). Records 3 and 4 do not match on surnames but remain grouped together since they’ve previously been matched on forenames, which is more “certain” as listed in criteria
. Record 5 is still not assigned a group ID since it doesn’t match any other record on surnames.3
) because that was assigned at stage 1 ("Criteria 1"
) as opposed to record 2’s group ID which was assigned at stage 2 ("Criteria 2"
).data(patient_list_2); patient_list_2
#> # A tibble: 5 x 4
#> rd_id forename surname sex
#> <int> <chr> <chr> <chr>
#> 1 1 "" Jefferson Male
#> 2 2 "" Jefferson Female
#> 3 3 Tomi Abdul Male
#> 4 4 Tomi Abdulkareem Female
#> 5 5 Obi Nelson Male
cbind(
patient_list,
record_group(patient_list, rd_id, c(forename, surname, sex))
)
#>
#> Group criteria 1 - `forename`
#> 5 of 6 record(s) have been assigned a group ID. 1 record(s) not yet grouped.
#> 5 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 6.
#>
#> Group criteria 2 - `surname`
#> 6 of 6 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 0 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 0.
#>
#> Group criteria 3 - `sex`
#> 0 of 0 record(s) have been assigned a group ID. 0 record(s) not yet grouped.
#> 0 record(s) with unique group IDs untagged for possible matching in the next stage. The number of records not yet grouped is now 0.
#>
#> Record grouping complete - 0 record(s) assigned a group unique ID.
#> rd_id forename surname sex cri_1 cri_2 sn pid pid_cri
#> 1 1 James Green M James-Green Green-M 1 1 Criteria 2
#> 2 2 ESTHER Kulmar F ESTHER-Kulmar Kulmar-F 2 2 Criteria 2
#> 3 3 OBI F -OBI OBI-F 3 3 Criteria 2
#> 4 4 Jamey Green M Jamey-Green Green-M 4 1 Criteria 2
#> 5 5 Daniel Kulmar M Daniel-Kulmar Kulmar-M 5 2 Criteria 2
#> 6 6 Henry OBI M Henry-OBI OBI-M 6 3 Criteria 2
Records with missing values for a particular criteria
are excluded from that stage of record linkage. If a record has missing values for every listed criteria
, it’s assigned a unique group ID at the end of record linkage.
It’s common for databases to use specific characters or numbers to represent missing or unknown data e.g. N/A
, Nil
, 01/01/1100
, 111111
etc. These pseudo-missing values will need to be recoded to one of the two recognised by record_group()
- NA
or an empty string (""
). If this is not done, the function will assume the pseudo-missing values are valid values and therefore group them together. This can cause a continuous cascade of false matches as seen below.
patient_list_b <- patient_list_2
patient_list_b <- mutate(patient_list_b, forename =
ifelse(rd_id %in% 1:3, "Nil", forename))
# 2 stage linkage
# Stage 1 - forename
# Stage 2 - Surname
cbind(
patient_list_b,
record_group(patient_list_b, rd_id, c(forename, surname), display = FALSE)
)
#> Record grouping complete - 2 record(s) assigned a group unique ID.
#> rd_id forename surname sex sn pid pid_cri
#> 1 1 Nil Jefferson Male 1 1 Criteria 1
#> 2 2 Nil Jefferson Female 2 1 Criteria 1
#> 3 3 Nil Abdul Male 3 1 Criteria 1
#> 4 4 Tomi Abdulkareem Female 4 4 None
#> 5 5 Obi Nelson Male 5 5 None
# 2 stage linkage
# Stage 1 - forename
# Stage 2 - Surname and sex
patient_list_b <- mutate(patient_list_b, cri_2 = paste(surname,sex,sep=""))
cbind(
patient_list_b,
record_group(patient_list_b, rd_id, c(forename, cri_2), display = FALSE)
)
#> Record grouping complete - 2 record(s) assigned a group unique ID.
#> rd_id forename surname sex cri_2 sn pid pid_cri
#> 1 1 Nil Jefferson Male JeffersonMale 1 1 Criteria 1
#> 2 2 Nil Jefferson Female JeffersonFemale 2 1 Criteria 1
#> 3 3 Nil Abdul Male AbdulMale 3 1 Criteria 1
#> 4 4 Tomi Abdulkareem Female AbdulkareemFemale 4 4 None
#> 5 5 Obi Nelson Male NelsonMale 5 5 None
In the example above, records 1-3 are assigned a single group ID based on matching forenames ("Nil"
). Then records 4-6 are assigned the same group ID because they having matching surnames with either records 1-3. Even when stage 2 is changed to matching surnames and sex, records 2 and 3 are still “incorrectly” grouped together. Although, this is likely not the desired outcome, it’s the expected result given the parameters supplied to the function. This issue can be addressed by recoding "Nil"
to NA
or ""
.
# Using NA as the proxy for missing value
patient_list_b <- mutate(patient_list_b,forename = ifelse(forename=="Nil",NA,forename))
cbind(
patient_list_b,
record_group(patient_list_b, rd_id, c(forename, surname), display = FALSE)
)
#> Record grouping complete - 3 record(s) assigned a group unique ID.
#> rd_id forename surname sex cri_2 sn pid pid_cri
#> 1 1 <NA> Jefferson Male JeffersonMale 1 1 Criteria 2
#> 2 2 <NA> Jefferson Female JeffersonFemale 2 1 Criteria 2
#> 3 3 <NA> Abdul Male AbdulMale 3 3 None
#> 4 4 Tomi Abdulkareem Female AbdulkareemFemale 4 4 None
#> 5 5 Obi Nelson Male NelsonMale 5 5 None
# Using "" as the proxy for missing value
patient_list_b <- mutate(patient_list_b,forename = ifelse(is.na(forename),"",forename))
cbind(
patient_list_b,
record_group(patient_list_b, rd_id, c(forename, surname), display = FALSE)
)
#> Record grouping complete - 3 record(s) assigned a group unique ID.
#> rd_id forename surname sex cri_2 sn pid pid_cri
#> 1 1 Jefferson Male JeffersonMale 1 1 Criteria 2
#> 2 2 Jefferson Female JeffersonFemale 2 1 Criteria 2
#> 3 3 Abdul Male AbdulMale 3 3 None
#> 4 4 Tomi Abdulkareem Female AbdulkareemFemale 4 4 None
#> 5 5 Obi Nelson Male NelsonMale 5 5 None
As a general rule, the more unique a criteria, the earlier it should be listed in criteria
. Also, the set and ordering of criteria
is a personal choice but should also be practical given the dataset. For example, when linking a vehicular database with no existing identifier, vehicle colour alone is less practical than colour and brand name, which in turn is less practical than colour, brand name, make and model. However colour, brand name, make and model and 10 other parameters might be too strict and would need to be relaxed. On the other hand, the dataset could be so small that vehicle colour alone is sufficient as a criteria
. record_group()
aims to minimize false mismatches due to random errors in data entry or collection, or missing values. The choice and ordering of criteria
and sub_criteria
should balance the availability of alternative identifiers with their practicality as group identifiers.