The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
tokenize_tweets()
function, which is no
longer supported.tokenize_ptb()
function for Penn Treebank
tokenizations (@jrnold) (#12).chunk_text()
to split long documents
into pieces (#30).tokenize_tweets()
preserves usernames,
hashtags, and URLS (@kbenoit) (#44).stopwords()
function has been removed in favor of
using the stopwords package (#46).tif
package. (#49)tokenize_skip_ngrams
has been improved to generate
unigrams and bigrams, according to the skip definition (#24).tokenizers
supports (@ironholds) (#26).tokenize_skip_ngrams
now supports stopwords (#31).NA
consistently (#33).tokenize_words()
gains arguments to preserve or strip
punctuation and numbers (#48).tokenize_skip_ngrams()
and
tokenize_ngrams()
to return properly marked UTF8 strings on
Windows (@patperry)
(#58).tokenize_tweets()
now removes stopwords prior to
stripping punctuation, making its behavior more consistent with
tokenize_words()
(#76).tokenize_character_shingles()
tokenizer.tokenize_words()
and
tokenize_word_stems()
.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.