The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Blocking records for deduplication

Maciej Beręsewicz

1 Setup

Read required packages.

library(blocking)
library(data.table)

Read the RLdata500 data (taken from the RecordLinkage package).

data(RLdata500)
head(RLdata500)
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd rec_id ent_id
CARSTEN MEIER 1949 7 22 1 34
GERD BAUER 1968 7 27 2 51
ROBERT HARTMANN 1930 4 30 3 115
STEFAN WOLFF 1957 9 2 4 189
RALF KRUEGER 1966 1 13 5 72
JUERGEN FRANKE 1929 7 4 6 142

This dataset contains 500 rows with 450 entities.

2 Blocking for deduplication

Now we create a new column that concatenates the information in each row.

RLdata500[, id_count :=.N, ent_id] ## how many times given unit occurs
RLdata500[, bm:=sprintf("%02d", bm)] ## add leading zeros to month
RLdata500[, bd:=sprintf("%02d", bd)] ## add leading zeros to day
RLdata500[, txt:=tolower(paste0(fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd))]
head(RLdata500)
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd rec_id ent_id id_count txt
CARSTEN MEIER 1949 07 22 1 34 1 carstenmeier19490722
GERD BAUER 1968 07 27 2 51 2 gerdbauer19680727
ROBERT HARTMANN 1930 04 30 3 115 1 roberthartmann19300430
STEFAN WOLFF 1957 09 02 4 189 1 stefanwolff19570902
RALF KRUEGER 1966 01 13 5 72 1 ralfkrueger19660113
JUERGEN FRANKE 1929 07 04 6 142 1 juergenfranke19290704

In the next step we use the newly created column in the blocking function. If we specify verbose, we get information about the progress.

df_blocks <- blocking(x = RLdata500$txt, ann = "nnd", verbose = 1, graph = TRUE, seed = 2024)
#> ===== creating tokens =====
#> ===== starting search (nnd, x, y: 500, 500, t: 429) =====
#> ===== creating graph =====

Results are as follows:

df_blocks
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 133.
#> Number of columns used for blocking: 429.
#> Reduction ratio: 0.9917.
#> ========================================================
#> Distribution of the size of the blocks:
#>  2  3  4  5  6  7  8  9 10 11 17 
#> 47 34 18 12  8  5  3  3  1  1  1

Structure of the object is as follows:

str(df_blocks,1)
#> List of 8
#>  $ result        :Classes 'data.table' and 'data.frame': 367 obs. of  4 variables:
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>  $ method        : chr "nnd"
#>  $ deduplication : logi TRUE
#>  $ representation: chr "shingles"
#>  $ metrics       : NULL
#>  $ confusion     : NULL
#>  $ colnames      : chr [1:429] "86" "ap" "av" "bf" ...
#>  $ graph         :Class 'igraph'  hidden list of 10
#>  - attr(*, "class")= chr "blocking"

Plot connections.

plot(df_blocks$graph, vertex.size=1, vertex.label = NA)

The resulting data.table has four columns:

head(df_blocks$result)
x y block dist
1 64 33 0.4737987
2 43 1 0.0807453
2 486 1 0.4102322
3 450 88 0.4326335
4 234 12 0.5256584
5 128 2 0.5133357

Create long data.table with information on blocks and units from original dataset.

df_block_melted <- melt(df_blocks$result, id.vars = c("block", "dist"))
df_block_melted_rec_block <- unique(df_block_melted[, .(rec_id=value, block)])
head(df_block_melted_rec_block)
rec_id block
1 33
2 1
3 88
4 12
5 2
6 33

We add block information to the final dataset.

RLdata500[df_block_melted_rec_block, on = "rec_id", block_id := i.block]
head(RLdata500)
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd rec_id ent_id id_count txt block_id
CARSTEN MEIER 1949 07 22 1 34 1 carstenmeier19490722 33
GERD BAUER 1968 07 27 2 51 2 gerdbauer19680727 1
ROBERT HARTMANN 1930 04 30 3 115 1 roberthartmann19300430 88
STEFAN WOLFF 1957 09 02 4 189 1 stefanwolff19570902 12
RALF KRUEGER 1966 01 13 5 72 1 ralfkrueger19660113 2
JUERGEN FRANKE 1929 07 04 6 142 1 juergenfranke19290704 33

We can check in how many blocks the same entities (ent_id) are observed. In our example, all the same entities are in the same blocks.

RLdata500[, .(uniq_blocks = uniqueN(block_id)), .(ent_id)][, .N, uniq_blocks]
uniq_blocks N
1 450

We can visualise the distances between units stored in the df_blocks$result data set. Clearly we have a mixture of two groups: matches (close to 0) and non-matches (close to 1).

hist(df_blocks$result$dist, xlab = "Distances", ylab = "Frequency", breaks = "fd",
     main = "Distances calculated between units")

Finally, we can visualise the result based on the information whether block contains matches or not.

df_for_density <- copy(df_block_melted[block %in% RLdata500$block_id])
df_for_density[, match:= block %in% RLdata500[id_count == 2]$block_id]

plot(density(df_for_density[match==FALSE]$dist), col = "blue", xlim = c(0, 0.8), 
     main = "Distribution of distances between\nclusters type (match=red, non-match=blue)")
lines(density(df_for_density[match==TRUE]$dist), col = "red", xlim = c(0, 0.8))

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.