The digital library Wikisource, a sister projet of Wikipedia, hosts books in the public domain in almost all languages. More than 100’000 books are accessible in English, Spanish, French, German, Russian or Chinese.
The wikisourcer R package helps you download any book or page from Wikisource. The text is downloaded in a tidy data frame, so it can be analyzed within the tidyverse ecosystem.
To download Voltaire’s philosophical novel Candide, simply paste the url of the table of content into the wikisource_book
function. Note that the book is already classified by chapter with the page
variable.
## # A tibble: 894 x 5
## text page language url title
## <chr> <int> <chr> <chr> <chr>
## 1 "" 1 en https://en.wikisour… Cand…
## 2 In the country of Westphalia… 1 en https://en.wikisour… Cand…
## 3 "The Baron was one of the mo… 1 en https://en.wikisour… Cand…
## 4 My Lady Baroness, who weighe… 1 en https://en.wikisour… Cand…
## 5 Master Pangloss taught the m… 1 en https://en.wikisour… Cand…
## 6 "\"It is demonstrable,\" sai… 1 en https://en.wikisour… Cand…
## 7 Candide listened attentively… 1 en https://en.wikisour… Cand…
## 8 One day when Miss Cunegund w… 1 en https://en.wikisour… Cand…
## 9 On her way back she happened… 1 en https://en.wikisour… Cand…
## 10 "" 1 en https://en.wikisour… Cand…
## # ... with 884 more rows
Multiple books can easily be downoaded using the purrr
package. For example, we can download Candide in French, English, Spanish and Italian.
library(purrr)
fr <- "https://fr.wikisource.org/wiki/Candide,_ou_l%E2%80%99Optimisme/Garnier_1877"
en <- "https://en.wikisource.org/wiki/Candide"
es <- "https://es.wikisource.org/wiki/C%C3%A1ndido,_o_el_optimismo"
it <- "https://it.wikisource.org/wiki/Candido"
urls <- c(fr, en, es, it)
candide <- purrr::map_df(urls, wikisource_book)
Before making a text analysis, the text should be cleaned from remaining Wikisource metadata.
library(stringr)
library(dplyr)
candide_cleaned <- candide %>%
filter(!str_detect(text, "CHAPITRE|↑")) %>% #clean French
filter(!str_detect(text, "CAPITULO")) %>% #clean Spanish
filter(!str_detect(text, "../|IncludiIntestazione|Romanzi|^\\d+")) #clean Italian
We can now compare the number of words in each chapter by language.
library(tidytext)
library(ggplot2)
candide_cleaned %>%
tidytext::unnest_tokens(word, text) %>%
count(page, language, sort = TRUE) %>%
ggplot(aes(x = as.factor(page), y = n, fill = language)) +
geom_col(position = "dodge") +
theme_minimal() +
labs(x = "chapter", y = "number of words",
title = "Multilingual Text analysis of Voltaire's Candide")
The wikisource_book
function sometimes doesn’t work. It happens when the main url path differs from the ones of the linked urls. This issue can easily be fixed using the wikisource_page
function.
The wikisource_page
function has two arguments, i.e. the Wikisource url and an optional title for the page. For example, we can download Sonnet 18 of William Shakespeare.
library(wikisourcer)
wikisource_page("https://en.wikisource.org/wiki/Sonnet_18_(Shakespeare)", "Sonnet 18")
## # A tibble: 26 x 4
## text page language url
## <chr> <chr> <chr> <chr>
## 1 "" Sonnet… en https://en.wikisource.org/…
## 2 "" Sonnet… en https://en.wikisource.org/…
## 3 Shall I compare thee to a… Sonnet… en https://en.wikisource.org/…
## 4 Thou art more lovely and … Sonnet… en https://en.wikisource.org/…
## 5 Rough winds do shake the … Sonnet… en https://en.wikisource.org/…
## 6 And summer's lease hath a… Sonnet… en https://en.wikisource.org/…
## 7 Sometime too hot the eye … Sonnet… en https://en.wikisource.org/…
## 8 And often is his gold com… Sonnet… en https://en.wikisource.org/…
## 9 And every fair from fair … Sonnet… en https://en.wikisource.org/…
## 10 By chance, or nature's ch… Sonnet… en https://en.wikisource.org/…
## # ... with 16 more rows
Let’s try to download the 154 Sonnets of William Shakespeare using wikisource_book
.
## Warning in wikisource_book("https://en.wikisource.org/wiki/The_Sonnets"):
## Could not download a book at https://en.wikisource.org/wiki/The_Sonnets
## # A tibble: 0 x 1
## # ... with 1 variable: title <chr>
The download failed because the main wiki url wiki/The_Sonnets
differs from the wiki path of the pages, i.e. wiki/Sonnet_
.
We have to use the wikisource_page
function to download the 154 Sonnets.
Note that the base R function paste0
is very useful to create a list of urls. We will also use paste0
to name the pages for the second argument of the wikisource_page
function.
urls <- paste0("https://en.wikisource.org/wiki/Sonnet_", 1:154, "_(Shakespeare)") #154 urls
sonnets <- purrr::map2_df(urls, paste0("Sonnet ", 1:154), wikisource_page)
sonnets
## # A tibble: 3,275 x 4
## text page language url
## <chr> <chr> <chr> <chr>
## 1 "" Sonnet… en https://en.wikisource.org…
## 2 "" Sonnet… en https://en.wikisource.org…
## 3 From fairest creatures we … Sonnet… en https://en.wikisource.org…
## 4 That thereby beauty's rose… Sonnet… en https://en.wikisource.org…
## 5 But as the riper should by… Sonnet… en https://en.wikisource.org…
## 6 His tender heir might bear… Sonnet… en https://en.wikisource.org…
## 7 But thou, contracted to th… Sonnet… en https://en.wikisource.org…
## 8 Feed'st thy light's flame … Sonnet… en https://en.wikisource.org…
## 9 Making a famine where abun… Sonnet… en https://en.wikisource.org…
## 10 Thyself thy foe, to thy sw… Sonnet… en https://en.wikisource.org…
## # ... with 3,265 more rows
We can make a text similarity analysis. Which sonnets are the closest to each others in terms of words used?
library(widyr)
library(SnowballC)
library(igraph)
library(ggraph)
sonnets_similarity <- sonnets %>%
filter(!str_detect(text, "public domain|Public domain")) %>% #clean text
tidytext::unnest_tokens(word, text) %>%
anti_join(tidytext::get_stopwords("en")) %>%
anti_join(data_frame(word = c("thy", "thou", "thee"))) %>% #old English stopwords
mutate(wordStem = SnowballC::wordStem(word)) %>% #Stemming
count(page, wordStem) %>%
widyr::pairwise_similarity(page, wordStem, n) %>%
filter(similarity > 0.3)
# themes by sonnet
theme <- data_frame(page = unique(sonnets$page),
theme = c(rep("Procreation", times = 17), rep("Fair Youth", times = 60),
rep("Rival Poet", times = 9), rep("Fair Youth", times = 12),
rep("Irregular", times = 1), rep("Fair Youth", times = 26),
rep("Irregular", times = 1), rep("Dark Lady", times = 28))) %>%
filter(page %in% sonnets_similarity$item1 |
page %in% sonnets_similarity$item2)
set.seed(1234)
sonnets_similarity %>%
graph_from_data_frame(vertices = theme) %>%
ggraph() +
geom_edge_link(aes(edge_alpha = similarity)) +
geom_node_point(aes(color = theme), size = 3) +
geom_node_text(aes(label = name), size = 3.5, check_overlap = TRUE, vjust = 1) +
theme_void() +
labs(title = "Closest Shakespeare's Sonnets to each others in terms of words used")
The wikisourcer functions work with other wiki websites.
For example, the website Bibliowiki hosts texts and images in the public domain under Canadian copyright law. We have therefore access to George Orwell’s novel Nineteen Eighty-Four.
Let’s make a sentiment analysis of Orwell’s dystopian novel 1984.
library(tidyr)
orwell_sent <- orwell %>%
filter(page != 25) %>% #remove appendix page
unnest_tokens(word, text) %>%
inner_join(get_sentiments("bing")) %>%
anti_join(get_stopwords("en")) %>%
count(page, sentiment) %>%
spread(key = sentiment, value = n) %>%
mutate(sentiment = positive - negative)
ggplot(orwell_sent, aes(page, sentiment)) +
geom_col() +
geom_smooth(method = "loess", se = FALSE) +
scale_x_continuous(breaks = c(1:24)) +
theme_minimal() +
labs(title = "Sentiment Analysis of Orwell's 1984",
subtitle = "Positive-negative words difference, by chapter",
x = "chapter", y = "sentiment score")
The overall negative sentiment score reflects plainly the dark and pessimistic tone of the novel.