The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Title: Data for Wordpiece-Style Tokenization
Version: 2.0.0
Description: Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from https://huggingface.co/bert-base-cased/resolve/main/vocab.txt and https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt and parsed into an R-friendly format.
License: Apache License (≥ 2)
Encoding: UTF-8
RoxygenNote: 7.1.2
URL: https://github.com/macmillancontentscience/wordpiece.data
BugReports: https://github.com/macmillancontentscience/wordpiece.data/issues
Depends: R (≥ 3.5.0)
Suggests: testthat (≥ 3.0.0)
Config/testthat/edition: 3
NeedsCompilation: no
Packaged: 2022-03-03 15:50:03 UTC; jonth
Author: Jonathan Bratt ORCID iD [aut], Jon Harmon ORCID iD [aut, cre], Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph], Google, Inc [cph] (original BERT vocabularies)
Maintainer: Jon Harmon <jonthegeek@gmail.com>
Repository: CRAN
Date/Publication: 2022-03-03 16:20:02 UTC

Generate the inst path

Description

Generate the inst path

Usage

.get_path(filetype, n_tokens)

Arguments

filetype

Character scalar; the type of file, like "uncased".

n_tokens

Integer scalar; The number of tokens used for that file.

Value

Character scalar; the path to the file.


Load an RDS from inst Dir

Description

Load an RDS from inst Dir

Usage

.load_inst_rds(filetype, n_tokens)

Arguments

filetype

Character scalar; the type of file, like "uncased".

n_tokens

Integer scalar; The number of tokens used for that file.

Value

The R object.


Load a wordpiece Vocabulary

Description

A wordpiece vocabulary is a named integer vector with class "wordpiece_vocabulary". The names of the vector are the tokens, and the values are the integer identifiers of those tokens. The vocabulary is 0-indexed for compatibility with Python implementations.

Usage

wordpiece_vocab(cased = FALSE)

Arguments

cased

Logical; load the uncased vocabulary, or the cased vocabulary?

Value

A wordpiece_vocabulary.

Examples

head(wordpiece_vocab())
head(wordpiece_vocab(cased = TRUE))

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.