The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
kgrams
provides tools for training and evaluating k-gram language
models, including several probability smoothing methods, perplexity
computations, random text generation and more. It is based on an C++
back-end which makes kgrams
fast, coupled with an
accessible R API which aims at streamlining the process of model
building, and can be suitable for small- and medium-sized NLP
experiments, baseline model building, and for pedagogical purposes.
If you have no idea about what k-gram models are and didn’t get here by accident, you can check out my hands-on tutorial post on k-gram language models using R at DataScience+.
You can install the latest release of kgrams
from CRAN with:
install.packages("kgrams")
You can install the development version from my R-universe with:
install.packages("kgrams", repos = "https://vgherard.r-universe.dev/")
This example shows how to train a modified Kneser-Ney 4-gram model on
Shakespeare’s play “Much Ado About Nothing” using
kgrams
.
library(kgrams)
# Get k-gram frequency counts from text, for k = 1:4
<- kgram_freqs(kgrams::much_ado, N = 4)
freqs # Build modified Kneser-Ney 4-gram model, with discount parameters D1, D2, D3.
<- language_model(freqs, smoother = "mkn", D1 = 0.25, D2 = 0.5, D3 = 0.75) mkn
We can now use this language_model
to compute sentence
and word continuation probabilities:
# Compute sentence probabilities
probability(c("did he break out into tears ?",
"we are predicting sentence probabilities ."
), model = mkn
)#> [1] 2.466856e-04 1.184963e-20
# Compute word continuation probabilities
probability(c("tears", "pieces") %|% "did he break out into", model = mkn)
#> [1] 9.389238e-01 3.834498e-07
Here are some sentences sampled from the language model’s
distribution at temperatures t = c(1, 0.1, 10)
:
# Sample sentences from the language model at different temperatures
set.seed(840)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 1)
#> [1] "i have studied eight or nine truly by your office [...] (truncated output)"
#> [2] "ere you go : <EOS>"
#> [3] "don pedro welcome signior : <EOS>"
sample_sentences(model = mkn, n = 3, max_length = 10, t = 0.1)
#> [1] "i will not be sworn but love may transform me [...] (truncated output)"
#> [2] "i will not fail . <EOS>"
#> [3] "i will go to benedick and counsel him to fight [...] (truncated output)"
sample_sentences(model = mkn, n = 3, max_length = 10, t = 10)
#> [1] "july cham's incite start ancientry effect torture tore pains endings [...] (truncated output)"
#> [2] "lastly gallants happiness publish margaret what by spots commodity wake [...] (truncated output)"
#> [3] "born all's 'fool' nest praise hurt messina build afar dancing [...] (truncated output)"
For further help, you can consult the reference page of the
kgrams
website or open an issue on
the GitHub repository of kgrams
. A vignette is available on
the website, illustrating the process of building language models
in-depth.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.