Repository Mirror for your Cloud Server and Webhosting

Title:

A Byte-Pair-Encoding (BPE) Tokenizer for OpenAI's Large Language Models

Version:

0.0.7

Description:

A thin wrapper around the tiktoken-rs crate, allowing to encode text into Byte-Pair-Encoding (BPE) tokens and decode tokens back to text. This is useful to understand how Large Language Models (LLMs) perceive text.

License:

MIT + file LICENSE

URL:

https://davzim.github.io/rtiktoken/, https://github.com/DavZim/rtiktoken/

BugReports:

https://github.com/DavZim/rtiktoken/issues

Suggests:

testthat (≥ 3.0.0)

SystemRequirements:

Cargo (Rust's package manager), rustc >= 1.65.0

Encoding:

UTF-8

RoxygenNote:

7.3.2

Config/rextendr/version:

0.3.1.9001

Config/testthat/edition:

Config/rtiktoken/MSRV:

1.65.0

Depends:

R (≥ 4.2)

NeedsCompilation:

yes

Packaged:

2025-04-14 20:20:47 UTC; david

Author:

David Zimmermann-Kollenda [aut, cre], Roger Zurawicki [aut] (tiktoken-rs Rust library), Authors of the dependent Rust crates [aut] (see AUTHORS file)

Maintainer:

David Zimmermann-Kollenda <david_j_zimmermann@hotmail.com>

Repository:

CRAN

Date/Publication:

2025-04-14 22:50:02 UTC

Decodes tokens back to text

Description

Decodes tokens back to text

Usage

decode_tokens(tokens, model)

Arguments

tokens

a vector of tokens to decode, or a list of tokens

model

a model to use for tokenization, either a model name, e.g., ⁠gpt-4o⁠ or a tokenizer, e.g., o200k_base. See also available tokenizers.

Value

a character string of the decoded tokens or a vector or strings

Examples

tokens <- get_tokens("Hello World", "gpt-4o")
tokens
decode_tokens(tokens, "gpt-4o")

tokens <- get_tokens(c("Hello World", "Alice Bob Charlie"), "gpt-4o")
tokens
decode_tokens(tokens, "gpt-4o")

Returns the number of tokens in a text

Description

Returns the number of tokens in a text

Usage

get_token_count(text, model)

Arguments

text

a character string to encode to tokens, can be a vector

model

a model to use for tokenization, either a model name, e.g., ⁠gpt-4o⁠ or a tokenizer, e.g., o200k_base. See also available tokenizers.

Value

the number of tokens in the text, vector of integers

Examples

get_token_count("Hello World", "gpt-4o")

Converts text to tokens

Description

Converts text to tokens

Usage

get_tokens(text, model)

Arguments

text

a character string to encode to tokens, can be a vector

model

a model to use for tokenization, either a model name, e.g., ⁠gpt-4o⁠ or a tokenizer, e.g., o200k_base. See also available tokenizers.

Value

a vector of tokens for the given text as integer

Examples

get_tokens("Hello World", "gpt-4o")
get_tokens("Hello World", "o200k_base")

Gets the name of the tokenizer used by a model

Description

Gets the name of the tokenizer used by a model

Usage

model_to_tokenizer(model)

Arguments

model

the model to use, e.g., ⁠gpt-4o⁠

Value

the tokenizer used by the model

Examples

model_to_tokenizer("gpt-4o")
model_to_tokenizer("gpt-4-1106-preview")
model_to_tokenizer("text-davinci-002")
model_to_tokenizer("text-embedding-ada-002")
model_to_tokenizer("text-embedding-3-small")

Decodes tokens back to text

Description

Usage

Arguments

Value

See Also

Examples

Returns the number of tokens in a text

Description

Usage

Arguments

Value

See Also

Examples

Converts text to tokens

Description

Usage

Arguments

Value

See Also

Examples

Gets the name of the tokenizer used by a model

Description

Usage

Arguments

Value

Examples