The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

tokenizers: Fast, Consistent Tokenization of Natural Language Text

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.

Version: 0.3.0
Depends: R (≥ 3.1.3)
Imports: stringi (≥ 1.0.1), Rcpp (≥ 0.12.3), SnowballC (≥ 0.5.1)
LinkingTo: Rcpp
Suggests: covr, knitr, rmarkdown, stopwords (≥ 0.9.0), testthat
Published: 2022-12-22
Author: Lincoln Mullen ORCID iD [aut, cre], Os Keyes ORCID iD [ctb], Dmitriy Selivanov [ctb], Jeffrey Arnold ORCID iD [ctb], Kenneth Benoit ORCID iD [ctb]
Maintainer: Lincoln Mullen <lincoln at lincolnmullen.com>
BugReports: https://github.com/ropensci/tokenizers/issues
License: MIT + file LICENSE
URL: https://docs.ropensci.org/tokenizers/, https://github.com/ropensci/tokenizers
NeedsCompilation: yes
Citation: tokenizers citation info
Materials: README NEWS
In views: NaturalLanguageProcessing
CRAN checks: tokenizers results

Documentation:

Reference manual: tokenizers.pdf
Vignettes: Introduction to the tokenizers Package
The Text Interchange Formats and the tokenizers Package

Downloads:

Package source: tokenizers_0.3.0.tar.gz
Windows binaries: r-devel: tokenizers_0.3.0.zip, r-release: tokenizers_0.3.0.zip, r-oldrel: tokenizers_0.3.0.zip
macOS binaries: r-release (arm64): tokenizers_0.3.0.tgz, r-oldrel (arm64): tokenizers_0.3.0.tgz, r-release (x86_64): tokenizers_0.3.0.tgz, r-oldrel (x86_64): tokenizers_0.3.0.tgz
Old sources: tokenizers archive

Reverse dependencies:

Reverse imports: covfefe, deeplr, DeepPINCS, DramaAnalysis, pdfsearch, proustr, rslp, textrecipes, tidypmc, tidytext, ttgsea, wactor, WhatsR
Reverse suggests: cwbtools, edgarWebR, torchdatasets
Reverse enhances: quanteda

Linking:

Please use the canonical form https://CRAN.R-project.org/package=tokenizers to link to this page.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.