The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Read required packages.
Read the RLdata500
data (taken from the RecordLinkage
package).
fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id |
---|---|---|---|---|---|---|---|---|
CARSTEN | MEIER | 1949 | 7 | 22 | 1 | 34 | ||
GERD | BAUER | 1968 | 7 | 27 | 2 | 51 | ||
ROBERT | HARTMANN | 1930 | 4 | 30 | 3 | 115 | ||
STEFAN | WOLFF | 1957 | 9 | 2 | 4 | 189 | ||
RALF | KRUEGER | 1966 | 1 | 13 | 5 | 72 | ||
JUERGEN | FRANKE | 1929 | 7 | 4 | 6 | 142 |
This dataset contains 500 rows with 450 entities.
Now we create a new column that concatenates the information in each row.
RLdata500[, id_count :=.N, ent_id] ## how many times given unit occurs
RLdata500[, bm:=sprintf("%02d", bm)] ## add leading zeros to month
RLdata500[, bd:=sprintf("%02d", bd)] ## add leading zeros to day
RLdata500[, txt:=tolower(paste0(fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd))]
head(RLdata500)
fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id | id_count | txt |
---|---|---|---|---|---|---|---|---|---|---|
CARSTEN | MEIER | 1949 | 07 | 22 | 1 | 34 | 1 | carstenmeier19490722 | ||
GERD | BAUER | 1968 | 07 | 27 | 2 | 51 | 2 | gerdbauer19680727 | ||
ROBERT | HARTMANN | 1930 | 04 | 30 | 3 | 115 | 1 | roberthartmann19300430 | ||
STEFAN | WOLFF | 1957 | 09 | 02 | 4 | 189 | 1 | stefanwolff19570902 | ||
RALF | KRUEGER | 1966 | 01 | 13 | 5 | 72 | 1 | ralfkrueger19660113 | ||
JUERGEN | FRANKE | 1929 | 07 | 04 | 6 | 142 | 1 | juergenfranke19290704 |
In the next step we use the newly created column in the
blocking
function. If we specify verbose, we get
information about the progress.
df_blocks <- blocking(x = RLdata500$txt, ann = "nnd", verbose = 1, graph = TRUE, seed = 2024)
#> ===== creating tokens =====
#> ===== starting search (nnd, x, y: 500, 500, t: 429) =====
#> ===== creating graph =====
Results are as follows:
rnndescent
we have created 133 blocks,df_blocks
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 133.
#> Number of columns used for blocking: 429.
#> Reduction ratio: 0.9917.
#> ========================================================
#> Distribution of the size of the blocks:
#> 2 3 4 5 6 7 8 9 10 11 17
#> 47 34 18 12 8 5 3 3 1 1 1
Structure of the object is as follows:
result
– a data.table
with identifiers and
block IDs,method
– the method used,deduplication
– whether deduplication was applied,representation
– whether shingles or vectors were
used,metrics
– standard metrics and based on the
igraph::compare
methods for comparing graphs (here
NULL),confusion
– confusion matrix (here NULL),colnames
– column names used for the comparison,graph
– an igraph
object mainly for
visualisation.str(df_blocks,1)
#> List of 8
#> $ result :Classes 'data.table' and 'data.frame': 367 obs. of 4 variables:
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> $ method : chr "nnd"
#> $ deduplication : logi TRUE
#> $ representation: chr "shingles"
#> $ metrics : NULL
#> $ confusion : NULL
#> $ colnames : chr [1:429] "86" "ap" "av" "bf" ...
#> $ graph :Class 'igraph' hidden list of 10
#> - attr(*, "class")= chr "blocking"
Plot connections.
The resulting data.table
has four columns:
x
– reference dataset (i.e. RLdata500
) –
this may not contain all units of RLdata500
,y
- query (each row of RLdata500
) – this
may not contain all units of RLdata500
,block
– the block ID,dist
– distance between objects.x | y | block | dist |
---|---|---|---|
1 | 64 | 33 | 0.4737987 |
2 | 43 | 1 | 0.0807453 |
2 | 486 | 1 | 0.4102322 |
3 | 450 | 88 | 0.4326335 |
4 | 234 | 12 | 0.5256584 |
5 | 128 | 2 | 0.5133357 |
Create long data.table
with information on blocks and
units from original dataset.
df_block_melted <- melt(df_blocks$result, id.vars = c("block", "dist"))
df_block_melted_rec_block <- unique(df_block_melted[, .(rec_id=value, block)])
head(df_block_melted_rec_block)
rec_id | block |
---|---|
1 | 33 |
2 | 1 |
3 | 88 |
4 | 12 |
5 | 2 |
6 | 33 |
We add block information to the final dataset.
fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id | id_count | txt | block_id |
---|---|---|---|---|---|---|---|---|---|---|---|
CARSTEN | MEIER | 1949 | 07 | 22 | 1 | 34 | 1 | carstenmeier19490722 | 33 | ||
GERD | BAUER | 1968 | 07 | 27 | 2 | 51 | 2 | gerdbauer19680727 | 1 | ||
ROBERT | HARTMANN | 1930 | 04 | 30 | 3 | 115 | 1 | roberthartmann19300430 | 88 | ||
STEFAN | WOLFF | 1957 | 09 | 02 | 4 | 189 | 1 | stefanwolff19570902 | 12 | ||
RALF | KRUEGER | 1966 | 01 | 13 | 5 | 72 | 1 | ralfkrueger19660113 | 2 | ||
JUERGEN | FRANKE | 1929 | 07 | 04 | 6 | 142 | 1 | juergenfranke19290704 | 33 |
We can check in how many blocks the same entities
(ent_id
) are observed. In our example, all the same
entities are in the same blocks.
uniq_blocks | N |
---|---|
1 | 450 |
We can visualise the distances between units stored in the
df_blocks$result
data set. Clearly we have a mixture of two
groups: matches (close to 0) and non-matches (close to 1).
hist(df_blocks$result$dist, xlab = "Distances", ylab = "Frequency", breaks = "fd",
main = "Distances calculated between units")
Finally, we can visualise the result based on the information whether block contains matches or not.
df_for_density <- copy(df_block_melted[block %in% RLdata500$block_id])
df_for_density[, match:= block %in% RLdata500[id_count == 2]$block_id]
plot(density(df_for_density[match==FALSE]$dist), col = "blue", xlim = c(0, 0.8),
main = "Distribution of distances between\nclusters type (match=red, non-match=blue)")
lines(density(df_for_density[match==TRUE]$dist), col = "red", xlim = c(0, 0.8))
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.