The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Pairwise comparisons for document similarity

Lincoln Mullen

2020-05-15

The most straightforward way to compare documents within a corpus is to compare each document to every other document.

First we will load the corpus and tokenize it with shingled n-grams.

library(textreuse)
dir <- system.file("extdata/ats", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5,
                          progress = FALSE)

We can use any of the comparison functions to compare two documents in the corpus. (Note that these functions, when applied to documents, compare their hashed tokens and not the tokens directly.)

jaccard_similarity(corpus[["remember00palm"]], 
                   corpus[["remembermeorholy00palm"]])
## [1] 0.7006189

The pairwise_compare() function applies a comparison function to each pair of documents in a corpus. The result is a matrix with the scores for each comparison.

comparisons <- pairwise_compare(corpus, jaccard_similarity, progress = FALSE)
comparisons[1:4, 1:4]
##                        calltounconv00baxt gospeltruth00whit
## calltounconv00baxt                     NA             0.001
## gospeltruth00whit                      NA                NA
## lifeofrevrichard00baxt                 NA                NA
##                        lifeofrevrichard00baxt
## calltounconv00baxt                      0.281
## gospeltruth00whit                       0.000
## lifeofrevrichard00baxt                     NA

If you prefer, you can convert the matrix of all comparisons to a data frame of pairs and scores. Here we create the data frame and keep only the pairs with scores above a significant value.

candidates <- pairwise_candidates(comparisons)
candidates[candidates$score > 0.1, ]
## # A tibble: 3 x 3
##   a                     b                      score
##   <chr>                 <chr>                  <dbl>
## 1 calltounconv00baxt    lifeofrevrichard00baxt 0.281
## 2 practicalthought00nev thoughtsonpopery00nevi 0.463
## 3 remember00palm        remembermeorholy00palm 0.701

The pairwise comparison method is inadequate for a corpus of any size, however. For a corpus of size \(n\), the number of comparisons (assuming the comparisons are commutative) is \(\frac{n^2 - n}{2}\). A corpus of 100 documents would require 4,950 comparisons; a corpus of 1,000 documents would require 499,500 comparisons. A better approach for corpora of any appreciable size is to use the minhash/LSH algorithms described in another vignette:

vignette("minhash", package = "textreuse")

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.