The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
textrecipes contain extra steps for the recipes
package for preprocessing text data.
You can install the released version of textrecipes from CRAN with:
Install the development version from GitHub with:
In the following example we will go through the steps needed, to convert a character variable to the TF-IDF of its tokenized words after removing stopwords, and, limiting ourself to only the 10 most used words. The preprocessing will be conducted on the variable medium
and artist
.
library(recipes)
library(textrecipes)
library(modeldata)
data("tate_text")
okc_rec <- recipe(~ medium + artist, data = tate_text) %>%
step_tokenize(medium, artist) %>%
step_stopwords(medium, artist) %>%
step_tokenfilter(medium, artist, max_tokens = 10) %>%
step_tfidf(medium, artist)
okc_obj <- okc_rec %>%
prep()
str(bake(okc_obj, tate_text))
#> tibble [4,284 × 20] (S3: tbl_df/tbl/data.frame)
#> $ tfidf_medium_colour : num [1:4284] 2.31 0 0 0 0 ...
#> $ tfidf_medium_etching : num [1:4284] 0 0.86 0.86 0.86 0 ...
#> $ tfidf_medium_gelatin : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_medium_lithograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_medium_paint : num [1:4284] 0 0 0 0 2.35 ...
#> $ tfidf_medium_paper : num [1:4284] 0 0.422 0.422 0.422 0 ...
#> $ tfidf_medium_photograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_medium_print : num [1:4284] 0 0 0 0 0 ...
#> $ tfidf_medium_screenprint: num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_medium_silver : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_akram : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_beuys : num [1:4284] 0 0 0 0 0 ...
#> $ tfidf_artist_ferrari : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_john : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_joseph : num [1:4284] 0 0 0 0 0 ...
#> $ tfidf_artist_león : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_richard : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_schütte : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_thomas : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tfidf_artist_zaatari : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
As of version 0.4.0, step_lda()
no longer accepts character variables and instead takes tokenlist variables.
the following recipe
can be replaced with the following recipe to achive the same results
lda_tokenizer <- function(x) text2vec::word_tokenizer(tolower(x))
recipe(~text_var, data = data) %>%
step_tokenize(text_var,
custom_token = lda_tokenizer
) %>%
step_lda(text_var)
This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
For questions and discussions about tidymodels packages, modeling, and machine learning, please post on RStudio Community.
If you think you have encountered a bug, please submit an issue.
Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code.
Check out further details on contributing guidelines for tidymodels packages and how to get help.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.