The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Blocking records for deduplication

Maciej Beręsewicz

1 Setup

Read required packages.

library(blocking)
library(data.table)

Read the RLdata500 data (taken from the RecordLinkage package).

data(RLdata500)
head(RLdata500)

fname_c1	lname_c1	by	bm	bd	rec_id	ent_id
CARSTEN	MEIER	1949	7	22	1	34
GERD	BAUER	1968	7	27	2	51
ROBERT	HARTMANN	1930	4	30	3	115
STEFAN	WOLFF	1957	9	2	4	189
RALF	KRUEGER	1966	1	13	5	72
JUERGEN	FRANKE	1929	7	4	6	142

This dataset contains 500 rows with 450 entities.

2 Blocking for deduplication

Now we create a new column that concatenates the information in each row.

RLdata500[, id_count :=.N, ent_id] ## how many times given unit occurs
RLdata500[, bm:=sprintf("%02d", bm)] ## add leading zeros to month
RLdata500[, bd:=sprintf("%02d", bd)] ## add leading zeros to day
RLdata500[, txt:=tolower(paste0(fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd))]
head(RLdata500)

fname_c1	lname_c1	by	bm	bd	rec_id	ent_id	id_count	txt
CARSTEN	MEIER	1949	07	22	1	34	1	carstenmeier19490722
GERD	BAUER	1968	07	27	2	51	2	gerdbauer19680727
ROBERT	HARTMANN	1930	04	30	3	115	1	roberthartmann19300430
STEFAN	WOLFF	1957	09	02	4	189	1	stefanwolff19570902
RALF	KRUEGER	1966	01	13	5	72	1	ralfkrueger19660113
JUERGEN	FRANKE	1929	07	04	6	142	1	juergenfranke19290704

In the next step we use the newly created column in the blocking function. If we specify verbose, we get information about the progress.

df_blocks <- blocking(x = RLdata500$txt, ann = "nnd", verbose = 1, graph = TRUE, seed = 2024)
#> ===== creating tokens =====
#> ===== starting search (nnd, x, y: 500, 500, t: 429) =====
#> ===== creating graph =====

Results are as follows:

based on rnndescent we have created 133 blocks,
it was based on 429 columns (2 character shingles),
we have 45 blocks of 2 elements, 33 blocks of 3 elements, …, 1 block of 17 elements.

df_blocks
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 133.
#> Number of columns used for blocking: 429.
#> Reduction ratio: 0.9917.
#> ========================================================
#> Distribution of the size of the blocks:
#>  2  3  4  5  6  7  8  9 10 11 17 
#> 47 34 18 12  8  5  3  3  1  1  1

Structure of the object is as follows:

result – a data.table with identifiers and block IDs,
method – the method used,
deduplication – whether deduplication was applied,
representation – whether shingles or vectors were used,
metrics – standard metrics and based on the igraph::compare methods for comparing graphs (here NULL),
confusion – confusion matrix (here NULL),
colnames – column names used for the comparison,
graph – an igraph object mainly for visualisation.

str(df_blocks,1)
#> List of 8
#>  $ result        :Classes 'data.table' and 'data.frame': 367 obs. of  4 variables:
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>  $ method        : chr "nnd"
#>  $ deduplication : logi TRUE
#>  $ representation: chr "shingles"
#>  $ metrics       : NULL
#>  $ confusion     : NULL
#>  $ colnames      : chr [1:429] "86" "ap" "av" "bf" ...
#>  $ graph         :Class 'igraph'  hidden list of 10
#>  - attr(*, "class")= chr "blocking"

Plot connections.

plot(df_blocks$graph, vertex.size=1, vertex.label = NA)

The resulting data.table has four columns:

x – reference dataset (i.e. RLdata500) – this may not contain all units of RLdata500,
y - query (each row of RLdata500) – this may not contain all units of RLdata500,
block – the block ID,
dist – distance between objects.

head(df_blocks$result)

x	y	block	dist
1	64	33	0.4737987
2	43	1	0.0807453
2	486	1	0.4102322
3	450	88	0.4326335
4	234	12	0.5256584
5	128	2	0.5133357

Create long data.table with information on blocks and units from original dataset.

df_block_melted <- melt(df_blocks$result, id.vars = c("block", "dist"))
df_block_melted_rec_block <- unique(df_block_melted[, .(rec_id=value, block)])
head(df_block_melted_rec_block)

rec_id	block
1	33
2	1
3	88
4	12
5	2
6	33

We add block information to the final dataset.

RLdata500[df_block_melted_rec_block, on = "rec_id", block_id := i.block]
head(RLdata500)

fname_c1	lname_c1	by	bm	bd	rec_id	ent_id	id_count	txt	block_id
CARSTEN	MEIER	1949	07	22	1	34	1	carstenmeier19490722	33
GERD	BAUER	1968	07	27	2	51	2	gerdbauer19680727	1
ROBERT	HARTMANN	1930	04	30	3	115	1	roberthartmann19300430	88
STEFAN	WOLFF	1957	09	02	4	189	1	stefanwolff19570902	12
RALF	KRUEGER	1966	01	13	5	72	1	ralfkrueger19660113	2
JUERGEN	FRANKE	1929	07	04	6	142	1	juergenfranke19290704	33

We can check in how many blocks the same entities (ent_id) are observed. In our example, all the same entities are in the same blocks.

RLdata500[, .(uniq_blocks = uniqueN(block_id)), .(ent_id)][, .N, uniq_blocks]

uniq_blocks	N
1	450

We can visualise the distances between units stored in the df_blocks$result data set. Clearly we have a mixture of two groups: matches (close to 0) and non-matches (close to 1).

hist(df_blocks$result$dist, xlab = "Distances", ylab = "Frequency", breaks = "fd",
     main = "Distances calculated between units")

Finally, we can visualise the result based on the information whether block contains matches or not.

df_for_density <- copy(df_block_melted[block %in% RLdata500$block_id])
df_for_density[, match:= block %in% RLdata500[id_count == 2]$block_id]

plot(density(df_for_density[match==FALSE]$dist), col = "blue", xlim = c(0, 0.8), 
     main = "Distribution of distances between\nclusters type (match=red, non-match=blue)")
lines(density(df_for_density[match==TRUE]$dist), col = "red", xlim = c(0, 0.8))

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.