The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Documentation for tidy methods for all steps has been improved to describe the return value more accurately. (#262)
Calling ?tidy.step_*()
now sends you to the
documentation for step_*()
where the outcome is documented.
(#261)
step_textfeatures()
has been made faster and more
robust. (#265)
step_clean_levels()
where it would produce
NAs for character columns. (#274)textfeatures has been removed from Suggests. (#255)
step_textfeatures()
no longer returns a politeness
feature. (#254)
step_untokenize()
and step_normalization()
now returns factors instead of strings. (#247)step_clean_names()
now throw an informative error if
needed non-standard role columns are missing during bake()
.
(#235)
The keep_original_cols
argument has been added to
step_tokenmerge
. This change should mean that every step
that produces new columns has the keep_original_cols
argument. (#242)
Many internal changes to improve consistency and slight speed increases.
Fixed bug where step_dummy_hash()
and
step_texthash()
would add new columns before old columns.
(#235)
Fixed bug where vocabulary_size
wasn’t tunable in
step_tokenize_bpe()
. (#239)
Steps with tunable arguments now have those arguments listed in the documentation.
All steps that add new columns will now informatively error if name collision occurs.
step_tf()
wasn’t tunable for
weight
argument.Setting token = "tweets"
in
step_tokenize()
have been deprecated due to
tokenizers::tokenize_tweets()
being deprecated.
(#209)
step_sequence_onehot()
,
step_dummy_hash()
, step_dummy_texthash()
now
return integers. step_tf()
returns integer when
weight_scheme
is "binary"
or
"raw count"
.
All steps now have required_pkgs()
methods.
if (require(...))
code.Remove use of okc_text in vignette
Fix bug in printing of tokenlists
step_tfidf()
now correctly saves the idf values and
applies them to the testing data set.
tidy.step_tfidf()
now returns calculated IDF
weights.
step_dummy_hash()
generates binary indicators
(possibly signed) from simple factor or character vectors.
step_tokenize()
has gotten a couple of cousin
functions step_tokenize_bpe()
,
step_tokenize_sentencepiece()
and
step_tokenize_wordpiece()
which wraps {tokenizers.bpe},
{sentencepiece} and {wordpiece} respectively (#147).
Added all_tokenized()
and
all_tokenized_predictors()
to more easily select tokenized
columns (#132).
Use show_tokens()
to more easily debug a recipe
involving tokenization.
Reorganize documentation for all recipe step tidy
methods (#126).
Steps now have a dedicated subsection detailing what happens when
tidy()
is applied. (#163)
All recipe steps now officially support empty selections to be more aligned with dplyr and other packages that use tidyselect (#141).
step_ngram()
has been given a speed increase to put
it in line with other packages performance.
step_tokenize()
will now try to error if vocabulary
size is too low when using engine = "tokenizers.bpe"
(#119).
Warning given by step_tokenfilter()
when filtering
failed to apply now correctly refers to the right argument name
(#137).
step_tf()
now returns 0 instead of NaN when there
aren’t any tokens present (#118).
step_tokenfilter()
now has a new argument
filter_fun
will takes a function which can be used to
filter tokens. (#164)
tidy.step_stem()
now correctly shows if custom
stemmer was used.
Added keep_original_cols
argument to
step_lda
, step_texthash()
,
step_tf()
, step_tfidf()
,
step_word_embeddings()
, step_dummy_hash()
,
step_sequence_onehot()
, and
step_textfeatures()
(#139).
prefix
argument now creates names according
to the pattern prefix_variablename_name/number
. (#124)step_tokenfilter()
and
step_sequence_onehot()
that sometimes caused crashes in R
4.1.0.step_lda()
now takes a tokenlist instead of a character
variable. See readme for more detail.step_sequence_onehot()
now takes tokenlists as
input.step_tokenize()
.step_tokenize()
.step_clean_names()
and
step_clean_levels()
. (#101)step_ngram()
gained an argument
min_num_tokens
to be able to return multiple n-grams
together. (#90)step_text_normalization()
to perform unicode
normalization on character vectors. (#86)step_word_embeddings()
got a argument
aggregation_default
to specify value in cases where no
words matches embedding.step_tokenize()
got an engine
argument to
specify packages other then tokenizers to tokenize.spacyr
have been added as an engine to
step_tokenize()
.step_lemma()
has been added to extract lemma attribute
from tokenlists.step_pos_filter()
has been added to allow filtering of
tokens bases on their pat of speech tags.step_ngram()
has been added to generate ngrams from
tokenlists.step_stem()
not correctly uses the options argument.
(Thanks to @grayskripko for finding bug, #64)step_word2vec()
have been changed to
step_lda()
to reflect what is actually happening.step_word_embeddings()
has been added. Allows for use
of pre-trained word embeddings to convert token columns to vectors in a
high-dimensional “meaning” space. (@jonthegeek, #20)step_tfidf()
calculations are slightly changed due to
flaw in original implementation
https://github.com/dselivanov/text2vec/issues/280.step_textfeatures()
have been added, allows for
multiple numerical features to be pulled from text.step_sequence_onehot()
have been added, allows for one
hot encoding of sequences of fixed width.step_word2vec()
have been added, calculates word2vec
dimensions.step_tokenmerge()
have been added, combines multiple
list columns into one list-columns.step_texthash()
now correctly accepts
signed
argument.step_tf()
and
step_tfidf()
.First CRAN version
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.