The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
tok provides bindings to the 🤗tokenizers library. It uses the same Rust libraries that powers the Python implementation.
We still don’t provide the full API of tokenizers. Please open a issue if there’s a feature you are missing.
You can install tok from CRAN using:
install.packages("tok")
Installing tok from source requires working Rust toolchain. We recommend using rustup.
On Windows, you’ll also have to add the
i686-pc-windows-gnu
and x86_64-pc-windows-gnu
targets:
rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu
Once Rust is working, you can install this package via:
::install_github("dfalbel/tok") remotes
We still don’t have complete support for the 🤗tokenizers API. Please open an issue if you need a feature that is currently not implemented.
tok
can be used to load and use tokenizers that have
been previously serialized. For example, HuggingFace model weights are
usually accompanied by a ‘tokenizer.json’ file that can be loaded with
this library.
To load a pre-trained tokenizer from a json file, use:
<- testthat::test_path("assets/tokenizer.json")
path <- tok::tokenizer$from_file(path) tok
Use the encode
method to tokenize sentendes and
decode
to transform them back.
<- tok$encode("hello world")
enc $decode(enc$ids)
tok#> [1] "hello world"
You can also load any tokenizer available in HuggingFace hub by using
the from_pretrained
static method. For example, let’s load
the GPT2 tokenizer with:
<- tok::tokenizer$from_pretrained("gpt2")
tok <- tok$encode("hello world")
enc $decode(enc$ids)
tok#> [1] "hello world"
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.