The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Read required packages.
Read the RLdata500 data (taken from the RecordLinkage
package).
| fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id |
|---|---|---|---|---|---|---|---|---|
| CARSTEN | MEIER | 1949 | 7 | 22 | 1 | 34 | ||
| GERD | BAUER | 1968 | 7 | 27 | 2 | 51 | ||
| ROBERT | HARTMANN | 1930 | 4 | 30 | 3 | 115 | ||
| STEFAN | WOLFF | 1957 | 9 | 2 | 4 | 189 | ||
| RALF | KRUEGER | 1966 | 1 | 13 | 5 | 72 | ||
| JUERGEN | FRANKE | 1929 | 7 | 4 | 6 | 142 |
This dataset contains 500 rows with 450 entities.
Now we create a new column that concatenates the information in each row.
RLdata500[, id_count :=.N, ent_id] ## how many times given unit occurs
RLdata500[, bm:=sprintf("%02d", bm)] ## add leading zeros to month
RLdata500[, bd:=sprintf("%02d", bd)] ## add leading zeros to day
RLdata500[, txt:=tolower(paste0(fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd))]
head(RLdata500)| fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id | id_count | txt |
|---|---|---|---|---|---|---|---|---|---|---|
| CARSTEN | MEIER | 1949 | 07 | 22 | 1 | 34 | 1 | carstenmeier19490722 | ||
| GERD | BAUER | 1968 | 07 | 27 | 2 | 51 | 2 | gerdbauer19680727 | ||
| ROBERT | HARTMANN | 1930 | 04 | 30 | 3 | 115 | 1 | roberthartmann19300430 | ||
| STEFAN | WOLFF | 1957 | 09 | 02 | 4 | 189 | 1 | stefanwolff19570902 | ||
| RALF | KRUEGER | 1966 | 01 | 13 | 5 | 72 | 1 | ralfkrueger19660113 | ||
| JUERGEN | FRANKE | 1929 | 07 | 04 | 6 | 142 | 1 | juergenfranke19290704 |
In the next step we use the newly created column in the
blocking function. If we specify verbose, we get
information about the progress.
df_blocks <- blocking(x = RLdata500$txt, ann = "nnd", verbose = 1, graph = TRUE, seed = 2024)
#> ===== creating tokens =====
#> ===== starting search (nnd, x, y: 500, 500, t: 429) =====
#> ===== creating graph =====Results are as follows:
rnndescent we have created 133 blocks,df_blocks
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 133.
#> Number of columns used for blocking: 429.
#> Reduction ratio: 0.9917.
#> ========================================================
#> Distribution of the size of the blocks:
#> 2 3 4 5 6 7 8 9 10 11 17
#> 47 34 18 12 8 5 3 3 1 1 1Structure of the object is as follows:
result – a data.table with identifiers and
block IDs,method – the method used,deduplication – whether deduplication was applied,representation – whether shingles or vectors were
used,metrics – standard metrics and based on the
igraph::compare methods for comparing graphs (here
NULL),confusion – confusion matrix (here NULL),colnames – column names used for the comparison,graph – an igraph object mainly for
visualisation.str(df_blocks,1)
#> List of 8
#> $ result :Classes 'data.table' and 'data.frame': 367 obs. of 4 variables:
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> $ method : chr "nnd"
#> $ deduplication : logi TRUE
#> $ representation: chr "shingles"
#> $ metrics : NULL
#> $ confusion : NULL
#> $ colnames : chr [1:429] "86" "ap" "av" "bf" ...
#> $ graph :Class 'igraph' hidden list of 10
#> - attr(*, "class")= chr "blocking"Plot connections.
The resulting data.table has four columns:
x – reference dataset (i.e. RLdata500) –
this may not contain all units of RLdata500,y - query (each row of RLdata500) – this
may not contain all units of RLdata500,block – the block ID,dist – distance between objects.| x | y | block | dist |
|---|---|---|---|
| 1 | 64 | 33 | 0.4737987 |
| 2 | 43 | 1 | 0.0807453 |
| 2 | 486 | 1 | 0.4102322 |
| 3 | 450 | 88 | 0.4326335 |
| 4 | 234 | 12 | 0.5256584 |
| 5 | 128 | 2 | 0.5133357 |
Create long data.table with information on blocks and
units from original dataset.
df_block_melted <- melt(df_blocks$result, id.vars = c("block", "dist"))
df_block_melted_rec_block <- unique(df_block_melted[, .(rec_id=value, block)])
head(df_block_melted_rec_block)| rec_id | block |
|---|---|
| 1 | 33 |
| 2 | 1 |
| 3 | 88 |
| 4 | 12 |
| 5 | 2 |
| 6 | 33 |
We add block information to the final dataset.
| fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id | id_count | txt | block_id |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CARSTEN | MEIER | 1949 | 07 | 22 | 1 | 34 | 1 | carstenmeier19490722 | 33 | ||
| GERD | BAUER | 1968 | 07 | 27 | 2 | 51 | 2 | gerdbauer19680727 | 1 | ||
| ROBERT | HARTMANN | 1930 | 04 | 30 | 3 | 115 | 1 | roberthartmann19300430 | 88 | ||
| STEFAN | WOLFF | 1957 | 09 | 02 | 4 | 189 | 1 | stefanwolff19570902 | 12 | ||
| RALF | KRUEGER | 1966 | 01 | 13 | 5 | 72 | 1 | ralfkrueger19660113 | 2 | ||
| JUERGEN | FRANKE | 1929 | 07 | 04 | 6 | 142 | 1 | juergenfranke19290704 | 33 |
We can check in how many blocks the same entities
(ent_id) are observed. In our example, all the same
entities are in the same blocks.
| uniq_blocks | N |
|---|---|
| 1 | 450 |
We can visualise the distances between units stored in the
df_blocks$result data set. Clearly we have a mixture of two
groups: matches (close to 0) and non-matches (close to 1).
hist(df_blocks$result$dist, xlab = "Distances", ylab = "Frequency", breaks = "fd",
main = "Distances calculated between units")Finally, we can visualise the result based on the information whether block contains matches or not.
df_for_density <- copy(df_block_melted[block %in% RLdata500$block_id])
df_for_density[, match:= block %in% RLdata500[id_count == 2]$block_id]
plot(density(df_for_density[match==FALSE]$dist), col = "blue", xlim = c(0, 0.8),
main = "Distribution of distances between\nclusters type (match=red, non-match=blue)")
lines(density(df_for_density[match==TRUE]$dist), col = "red", xlim = c(0, 0.8))These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.