The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This repository contains an R package which is an Rcpp wrapper around the sentencepiece C++ library
The R package allows you to
install.packages("sentencepiece")
remotes::install_github("bnosac/sentencepiece")
Look to the documentation of the functions
help(package = "sentencepiece")
library(sentencepiece)
dl <- sentencepiece_download_model("English", vocab_size = 50000)
model <- sentencepiece_load_model(dl$file_model)
model
Sentencepiece model
size of the vocabulary: 50000
model stored at: C:/Users/Jan/Documents/R/win-library/3.5/sentencepiece/models/en.wiki.bpe.vs50000.model
txt <- c("Give me back my Money or I'll call the police.",
"Talk to the hand because the face don't want to hear it any more.")
txt <- tolower(txt)
sentencepiece_encode(model, txt, type = "subwords")
[[1]]
[1] "▁give" "▁me" "▁back" "▁my" "▁money" "▁or" "▁i" "'" "ll" "▁call" "▁the" "▁police" "."
[[2]]
[1] "▁talk" "▁to" "▁the" "▁hand" "▁because" "▁the" "▁face" "▁don" "'" "t" "▁want" "▁to" "▁hear" "▁it" "▁any" "▁more" "."
sentencepiece_encode(model, txt, type = "ids")
[[1]]
[1] 3090 352 810 1241 2795 127 386 49937 1188 612 7 2142 49935
[[2]]
[1] 4252 42 7 1197 936 7 3227 1616 49937 49915 4451 42 6800 107 756 407 49935
library(tokenizers.bpe)
library(sentencepiece)
data(belgium_parliament, package = "tokenizers.bpe")
x <- subset(belgium_parliament, language == "french")
writeLines(text = x$text, con = "traindata.txt")
model <- sentencepiece("traindata.txt", type = "bpe", coverage = 0.999, vocab_size = 5000,
model_dir = getwd(), verbose = FALSE)
model
Sentencepiece model
size of the vocabulary: 5000
model stored at: sentencepiece.model
str(model$vocabulary)
'data.frame': 5000 obs. of 2 variables:
$ id : int 0 1 2 3 4 5 6 7 8 9 ...
$ subword: chr "<unk>" "<s>" "</s>" "es" ...
text <- c("L'appartement est grand & vraiment bien situe en plein centre",
"Proportion de femmes dans les situations de famille monoparentale.")
sentencepiece_encode(model, x = text, type = "subwords")
[[1]]
[1] "▁L" "'" "app" "ar" "tement" "▁est" "▁grand" "▁" "&" "▁v" "r" "ai" "ment" "▁bien" "▁situe" "▁en" "▁plein" "▁centre"
[[2]]
[1] "▁Pro" "por" "tion" "▁de" "▁femmes" "▁dans" "▁les" "▁situations" "▁de" "▁famille" "▁mon" "op" "ar" "ent" "ale" "."
sentencepiece_encode(model, x = text, type = "ids")
[[1]]
[1] 75 4951 252 31 461 109 960 4934 0 49 4941 34 32 585 4225 44 3356 1915
[[2]]
[1] 1362 4159 25 9 2060 93 40 3825 9 2923 705 247 31 19 116 4953
x <- sentencepiece_encode(model, x = text, type = "ids")
sentencepiece_decode(model, x)
[[1]]
[1] "L'appartement est grand ⁇ vraiment bien situe en plein centre"
[[2]]
[1] "Proportion de femmes dans les situations de famille monoparentale."
Need support in text mining? Contact BNOSAC: http://www.bnosac.be
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.