Fast, Consistent Tokenization of Natural Language Text [R package tokenizers version 0.3.0]

Lincoln Mullen

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

tokenizers: Fast, Consistent Tokenization of Natural Language Text

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.

Version:	0.3.0
Depends:	R (≥ 3.1.3)
Imports:	stringi (≥ 1.0.1), Rcpp (≥ 0.12.3), SnowballC (≥ 0.5.1)
LinkingTo:	Rcpp
Suggests:	covr, knitr, rmarkdown, stopwords (≥ 0.9.0), testthat
Published:	2022-12-22
DOI:	10.32614/CRAN.package.tokenizers
Author:	Lincoln Mullen [aut, cre], Os Keyes [ctb], Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb], Kenneth Benoit [ctb]
Maintainer:	Lincoln Mullen <lincoln at lincolnmullen.com>
BugReports:	https://github.com/ropensci/tokenizers/issues
License:	MIT + file LICENSE
URL:	https://docs.ropensci.org/tokenizers/, https://github.com/ropensci/tokenizers
NeedsCompilation:	yes
Citation:	tokenizers citation info
Materials:	README, NEWS
In views:	NaturalLanguageProcessing
CRAN checks:	tokenizers results

Documentation:

Reference manual:	tokenizers.html , tokenizers.pdf
Vignettes:	Introduction to the tokenizers Package (source, R code) The Text Interchange Formats and the tokenizers Package (source, R code)

Downloads:

Package source:	tokenizers_0.3.0.tar.gz
Windows binaries:	r-devel: tokenizers_0.3.0.zip, r-release: tokenizers_0.3.0.zip, r-oldrel: tokenizers_0.3.0.zip
macOS binaries:	r-release (arm64): tokenizers_0.3.0.tgz, r-oldrel (arm64): tokenizers_0.3.0.tgz, r-release (x86_64): tokenizers_0.3.0.tgz, r-oldrel (x86_64): tokenizers_0.3.0.tgz
Old sources:	tokenizers archive

Reverse dependencies:

Reverse imports:	blocking, covfefe, deeplr, DeepPINCS, pdfsearch, pkgmatch, proustr, rslp, rwig, textrecipes, tidytext, ttgsea, wactor, WhatsR
Reverse suggests:	edgarWebR, llmshieldr, sumup, torchdatasets
Reverse enhances:	quanteda

Linking:

Please use the canonical form https://CRAN.R-project.org/package=tokenizers to link to this page.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.