The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.
Version: | 0.3.0 |
Depends: | R (≥ 3.1.3) |
Imports: | stringi (≥ 1.0.1), Rcpp (≥ 0.12.3), SnowballC (≥ 0.5.1) |
LinkingTo: | Rcpp |
Suggests: | covr, knitr, rmarkdown, stopwords (≥ 0.9.0), testthat |
Published: | 2022-12-22 |
DOI: | 10.32614/CRAN.package.tokenizers |
Author: | Lincoln Mullen [aut, cre], Os Keyes [ctb], Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb], Kenneth Benoit [ctb] |
Maintainer: | Lincoln Mullen <lincoln at lincolnmullen.com> |
BugReports: | https://github.com/ropensci/tokenizers/issues |
License: | MIT + file LICENSE |
URL: | https://docs.ropensci.org/tokenizers/, https://github.com/ropensci/tokenizers |
NeedsCompilation: | yes |
Citation: | tokenizers citation info |
Materials: | README NEWS |
In views: | NaturalLanguageProcessing |
CRAN checks: | tokenizers results |
Reference manual: | tokenizers.pdf |
Vignettes: |
Introduction to the tokenizers Package The Text Interchange Formats and the tokenizers Package |
Package source: | tokenizers_0.3.0.tar.gz |
Windows binaries: | r-devel: tokenizers_0.3.0.zip, r-release: tokenizers_0.3.0.zip, r-oldrel: tokenizers_0.3.0.zip |
macOS binaries: | r-release (arm64): tokenizers_0.3.0.tgz, r-oldrel (arm64): tokenizers_0.3.0.tgz, r-release (x86_64): tokenizers_0.3.0.tgz, r-oldrel (x86_64): tokenizers_0.3.0.tgz |
Old sources: | tokenizers archive |
Reverse imports: | covfefe, deeplr, DeepPINCS, DramaAnalysis, pdfsearch, proustr, rslp, textrecipes, tidypmc, tidytext, ttgsea, wactor, WhatsR |
Reverse suggests: | edgarWebR, torchdatasets |
Reverse enhances: | quanteda |
Please use the canonical form https://CRAN.R-project.org/package=tokenizers to link to this page.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.