The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
{rbm25}
is a light wrapper around the rust bm25
crate. It
provides a simple interface to the Okapi BM25 algorithm
for text search.
Note the package does not provide any text preprocessing, this needs to be done before using the package.
You can install the development version of rbm25 like so:
# Development Version
# devtools::install_github("DavZim/rbm25")
# CRAN release
install.packages("rbm25")
The package exposes an R6 class BM25
that can be used to
query a text corpus. For simplicity, there is also a
bm25_score()
function that wraps the BM25
class.
library(rbm25)
# create a text corpus, where we want to find the closest matches for a query
<- c(
corpus_original "The rabbit munched the orange carrot.",
"The snake hugged the green lizard.",
"The hedgehog impaled the orange orange.",
"The squirrel buried the brown nut."
)
# text preprocessing: tolower, remove punctuation, remove stopwords
# note this is just an example and not the best way for larger amounts of text
<- c("the", "a", "an", "and")
stopwords <- corpus_original |>
corpus tolower() |>
gsub(pattern = "[[:punct:]]", replacement = "") |>
gsub(pattern = paste0("\\b(", paste(stopwords, collapse = "|"), ") *\\b"),
replacement = "") |>
trimws()
# define some metadata for the text corpus, e.g., the original text and the source
<- data.frame(
metadata text_original = corpus_original,
source = c("book1", "book2", "book3", "book4")
)
Using the BM25 Class
<- BM25$new(data = corpus, metadata = metadata)
bm
bm#> <BM25 (k1: 1.20, b: 0.75)> with 4 documents (language: 'Detect')
#> - Data & Metadata
#> text metadata.text_original
#> 1 rabbit munched orange carrot The rabbit munched the orange carrot.
#> 2 snake hugged green lizard The snake hugged the green lizard.
#> 3 hedgehog impaled orange orange The hedgehog impaled the orange orange.
#> 4 squirrel buried brown nut The squirrel buried the brown nut.
#> metadata.source
#> 1 book1
#> 2 book2
#> 3 book3
#> 4 book4
# note that query returns the values sorted by rank
$query(query = "orange", max_n = 2)
bm#> id score rank text
#> 1 3 0.4904281 1 hedgehog impaled orange orange
#> 2 1 0.3566750 2 rabbit munched orange carrot
#> text_original source
#> 1 The hedgehog impaled the orange orange. book3
#> 2 The rabbit munched the orange carrot. book1
Using the bm25_score()
function
# note that bm25_score returns the score in the order of the input data
<- bm25_score(data = corpus, query = "orange")
scores data.frame(text = corpus, scores_orange = scores)
#> text scores_orange
#> 1 rabbit munched orange carrot 0.3566750
#> 2 snake hugged green lizard 0.0000000
#> 3 hedgehog impaled orange orange 0.4904281
#> 4 squirrel buried brown nut 0.0000000
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.