In this vignette, we shall use the Gutenbergr package for downloading “Little Women” by Louisa May Alcott, and Monkeylearn public modules to learn a bit about its contents without reading it.
Note that Monkeylearn modules we use here were not tested on books, so the results are not optimal.
library("monkeylearn")
library("gutenbergr")
library("dplyr")
little_women <- gutenberg_download(c(514),
meta_fields = "title")
We will now use the tidytext package for getting whole paragraphs.
Then we will paste them together to get one string and then split it by the word “chapter” in order to get a reasonable number of text fragments, that we hope to be able to send as a single text to Monkeylearn API, which we can if each string is smaller than 50kB.
library("tidytext")
little_women <- little_women %>%
unnest_tokens(paragraph, text, token = "paragraphs") %>%
summarize(whole_text = paste(paragraph, collapse = " "))
chapters <- strsplit( little_women$whole_text, "[Cc]hapter")[[1]]
little_women_chapters <- tibble::tibble(
chapter = 1:length(chapters),
text = chapters
)
all(nchar(little_women_chapters$text, type = "bytes") < 50000)
All chapters have the right size to be sent to the API. The API accepts 20 texts per call, but monkeylearn
functions can split a vector of text automatically so we can submit the whole vector little_women_chapters$text
without further ado.
A first question we could ask ourselves about the book is who its main characters are, and where it takes place.
entities <- monkeylearn_extract(request = little_women_chapters$text,
extractor_id = "ex_isnnZRbS",
verbose = TRUE)
entities %>%
group_by(entity, tag) %>%
summarize(n_occurences = n()) %>%
arrange(desc(n_occurences)) %>%
filter(n_occurences > 5) %>%
knitr::kable()
keywords <- monkeylearn_extract(request = little_women_chapters$text,
extractor_id = "ex_y7BPYzNG",
params = list(max_keywords = 3))
keywords %>%
group_by(keyword) %>%
summarize(n_occurences = sum(count)) %>%
arrange(desc(n_occurences)) %>%
filter(n_occurences > 10) %>%
knitr::kable()
Interestingly here the keyword extraction is better at finding who the main characters are (yes, I have read the book).
In this table the number of occurences is the total count for the keyword in the book.
topics <- monkeylearn_classify(little_women_chapters$text,
classifier_id = "cl_5icAVzKR")
topics %>%
group_by(label) %>%
summarize(n_occurences = n()) %>%
filter(n_occurences > 1) %>%
arrange(desc(n_occurences)) %>%
knitr::kable()
Here, occurences means number of times the topic was found in the table.
As a summary, using these three modules I was reminded of the book and of the movie, but I am less sure I could have been able to talk about the book using only these results while not having read it.