The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest.
The function gutenberg_download()
downloads one or more works from Project Gutenberg based on their ID. For example, the book “Frankenstein; or The Modern Prometheus”, by Mary Wollstonecraft Shelly, has ID 84 (see the URL here), so gutenberg_download(84)
downloads this text.
library(dplyr)
library(gutenbergr)
frankenstein <- gutenberg_download(84)
frankenstein
## Source: local data frame [7,244 x 2]
##
## gutenberg_id text
## (int) (chr)
## 1 84 Frankenstein,
## 2 84
## 3 84 or the Modern Prometheus
## 4 84
## 5 84
## 6 84 by
## 7 84
## 8 84 Mary Wollstonecraft (Godwin) Shelley
## 9 84
## 10 84
## .. ... ...
Notice it is returned as a tbl_df (a type of data frame), including two variables: gutenberg_id
(useful if multiple books are returned), and a character vector of the text, one row per line. Notice that the header and footer added by Project Gutenberg (visible here) have been stripped away.
Provide a vector of IDs to download multiple books. For example, to download Dracula (book 345) along with Frankenstein, do:
frankenstein_dracula <- gutenberg_download(c(84, 345), meta_fields = "title")
frankenstein_dracula
## Source: local data frame [22,812 x 3]
##
## gutenberg_id text
## (int) (chr)
## 1 84 Frankenstein,
## 2 84
## 3 84 or the Modern Prometheus
## 4 84
## 5 84
## 6 84 by
## 7 84
## 8 84 Mary Wollstonecraft (Godwin) Shelley
## 9 84
## 10 84
## .. ... ...
## title
## (chr)
## 1 Frankenstein; Or, The Modern Prometheus
## 2 Frankenstein; Or, The Modern Prometheus
## 3 Frankenstein; Or, The Modern Prometheus
## 4 Frankenstein; Or, The Modern Prometheus
## 5 Frankenstein; Or, The Modern Prometheus
## 6 Frankenstein; Or, The Modern Prometheus
## 7 Frankenstein; Or, The Modern Prometheus
## 8 Frankenstein; Or, The Modern Prometheus
## 9 Frankenstein; Or, The Modern Prometheus
## 10 Frankenstein; Or, The Modern Prometheus
## .. ...
Notice that the meta_fields
argument allows us to add one or more additional fields from the gutenberg_metadata
to the downloaded text, such as title or author.
frankenstein_dracula %>%
count(title)
## Source: local data frame [2 x 2]
##
## title n
## (chr) (int)
## 1 Dracula 15568
## 2 Frankenstein; Or, The Modern Prometheus 7244
This package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.
The dataset gutenberg_metadata
contains information about each work, pairing Gutenberg ID with title, author, language, etc:
gutenberg_metadata
## Source: local data frame [51,877 x 8]
##
## gutenberg_id
## (int)
## 1 0
## 2 1
## 3 2
## 4 3
## 5 4
## 6 5
## 7 6
## 8 7
## 9 8
## 10 9
## .. ...
## title
## (chr)
## 1 NA
## 2 The Declaration of Independence of the United States of America
## 3 The United States Bill of Rights\r\nThe Ten Original Amendments to the Constitution of the United States
## 4 John F. Kennedy's Inaugural Address
## 5 Lincoln's Gettysburg Address\r\nGiven November 19, 1863 on the battlefield near Gettysburg, Pennsylvania, USA
## 6 The United States Constitution
## 7 Give Me Liberty or Give Me Death
## 8 The Mayflower Compact
## 9 Abraham Lincoln's Second Inaugural Address
## 10 Abraham Lincoln's First Inaugural Address
## .. ...
## Variables not shown: author (chr), gutenberg_author_id (int), language
## (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)
For example, you could find the Gutenberg ID of Wuthering Heights by doing:
gutenberg_metadata %>%
filter(title == "Wuthering Heights")
## Source: local data frame [1 x 8]
##
## gutenberg_id title author gutenberg_author_id
## (int) (chr) (chr) (int)
## 1 768 Wuthering Heights Brontë, Emily 405
## language gutenberg_bookshelf
## (chr) (chr)
## 1 en Gothic Fiction/Best Books Ever Listings/Movie Books
## Variables not shown: rights (chr), has_text (lgl)
In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works()
function does this pre-filtering:
gutenberg_works()
## Source: local data frame [40,278 x 8]
##
## gutenberg_id
## (int)
## 1 0
## 2 1
## 3 2
## 4 3
## 5 4
## 6 5
## 7 6
## 8 7
## 9 8
## 10 9
## .. ...
## title
## (chr)
## 1 NA
## 2 The Declaration of Independence of the United States of America
## 3 The United States Bill of Rights\r\nThe Ten Original Amendments to the Constitution of the United States
## 4 John F. Kennedy's Inaugural Address
## 5 Lincoln's Gettysburg Address\r\nGiven November 19, 1863 on the battlefield near Gettysburg, Pennsylvania, USA
## 6 The United States Constitution
## 7 Give Me Liberty or Give Me Death
## 8 The Mayflower Compact
## 9 Abraham Lincoln's Second Inaugural Address
## 10 Abraham Lincoln's First Inaugural Address
## .. ...
## Variables not shown: author (chr), gutenberg_author_id (int), language
## (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)
It also allows you to perform filtering as an argument:
gutenberg_works(author == "Austen, Jane")
## Source: local data frame [10 x 8]
##
## gutenberg_id
## (int)
## 1 105
## 2 121
## 3 141
## 4 158
## 5 161
## 6 946
## 7 1212
## 8 1342
## 9 31100
## 10 42078
## title
## (chr)
## 1 Persuasion
## 2 Northanger Abbey
## 3 Mansfield Park
## 4 Emma
## 5 Sense and Sensibility
## 6 Lady Susan
## 7 Love and Freindship [sic]
## 8 Pride and Prejudice
## 9 The Complete Project Gutenberg Works of Jane Austen\nA Linked Index of all PG Editions of Jane Austen
## 10 The Letters of Jane Austen\r\nSelected from the compilation of her great nephew, Edward, Lord Bradbourne
## author
## (chr)
## 1 Austen, Jane
## 2 Austen, Jane
## 3 Austen, Jane
## 4 Austen, Jane
## 5 Austen, Jane
## 6 Austen, Jane
## 7 Austen, Jane
## 8 Austen, Jane
## 9 Austen, Jane
## 10 Austen, Jane
## Variables not shown: gutenberg_author_id (int), language (chr),
## gutenberg_bookshelf (chr), rights (chr), has_text (lgl)
gutenberg_subjects
contains pairings of works with Library of Congress subjects and topics. “lcc” means Library of Congress Classification, while “lcsh” means Library of Congress subject headings:
gutenberg_subjects
## Source: local data frame [139,914 x 3]
##
## gutenberg_id subject_type
## (int) (chr)
## 1 1 lcc
## 2 1 lcc
## 3 1 lcsh
## 4 1 lcsh
## 5 2 lcc
## 6 2 lcsh
## 7 2 lcc
## 8 2 lcsh
## 9 3 lcc
## 10 3 lcsh
## .. ... ...
## subject
## (chr)
## 1 JK
## 2 E201
## 3 United States. Declaration of Independence
## 4 United States -- History -- Revolution, 1775-1783 -- Sources
## 5 JK
## 6 United States. Constitution. 1st-10th Amendments
## 7 KF
## 8 Civil rights -- United States -- Sources
## 9 E838
## 10 United States -- Foreign relations -- 1961-1963
## .. ...
This is useful for extracting texts from a particular topic or genre, such as horror, or a particular character, such as Sherlock Holmes. The gutenberg_id
column can then be used to download these texts or to link with other metadata.
gutenberg_subjects %>%
filter(subject == "Horror tales")
## Source: local data frame [91 x 3]
##
## gutenberg_id subject_type subject
## (int) (chr) (chr)
## 1 42 lcsh Horror tales
## 2 43 lcsh Horror tales
## 3 84 lcsh Horror tales
## 4 175 lcsh Horror tales
## 5 345 lcsh Horror tales
## 6 355 lcsh Horror tales
## 7 389 lcsh Horror tales
## 8 601 lcsh Horror tales
## 9 696 lcsh Horror tales
## 10 792 lcsh Horror tales
## .. ... ... ...
gutenberg_subjects %>%
filter(grepl("Holmes, Sherlock", subject))
## Source: local data frame [47 x 3]
##
## gutenberg_id subject_type
## (int) (chr)
## 1 108 lcsh
## 2 221 lcsh
## 3 244 lcsh
## 4 834 lcsh
## 5 1661 lcsh
## 6 2097 lcsh
## 7 2343 lcsh
## 8 2344 lcsh
## 9 2345 lcsh
## 10 2346 lcsh
## .. ... ...
## subject
## (chr)
## 1 Holmes, Sherlock (Fictitious character) -- Fiction
## 2 Holmes, Sherlock (Fictitious character) -- Fiction
## 3 Holmes, Sherlock (Fictitious character) -- Fiction
## 4 Holmes, Sherlock (Fictitious character) -- Fiction
## 5 Holmes, Sherlock (Fictitious character) -- Fiction
## 6 Holmes, Sherlock (Fictitious character) -- Fiction
## 7 Holmes, Sherlock (Fictitious character) -- Fiction
## 8 Holmes, Sherlock (Fictitious character) -- Fiction
## 9 Holmes, Sherlock (Fictitious character) -- Fiction
## 10 Holmes, Sherlock (Fictitious character) -- Fiction
## .. ...
gutenberg_authors
contains information about each author, such as aliases and birth/death year:
gutenberg_authors
## Source: local data frame [16,218 x 7]
##
## gutenberg_author_id author
## (int) (chr)
## 1 1 United States
## 2 3 Lincoln, Abraham
## 3 4 Henry, Patrick
## 4 5 Adam, Paul
## 5 7 Carroll, Lewis
## 6 8 United States. Central Intelligence Agency
## 7 9 Melville, Herman
## 8 10 Barrie, J. M. (James Matthew)
## 9 12 Smith, Joseph, Jr.
## 10 14 Madison, James
## .. ... ...
## alias birthdate deathdate
## (chr) (int) (int)
## 1 NA NA NA
## 2 NA 1809 1865
## 3 NA 1736 1799
## 4 NA NA NA
## 5 Dodgson, Charles Lutwidge 1832 1898
## 6 NA NA NA
## 7 Melville, Hermann 1819 1891
## 8 Barrie, James Matthew 1860 1937
## 9 Smith, Joseph 1805 1844
## 10 United States President (1809-1817) 1751 1836
## .. ... ... ...
## Variables not shown: wikipedia (chr), aliases (chr)
What’s next after retrieving a book’s text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.
library(tidytext)
words <- frankenstein_dracula %>%
unnest_tokens(word, text)
words
## Source: local data frame [237,752 x 3]
##
## gutenberg_id title word
## (int) (chr) (chr)
## 1 84 Frankenstein; Or, The Modern Prometheus frankenstein
## 2 84 Frankenstein; Or, The Modern Prometheus or
## 3 84 Frankenstein; Or, The Modern Prometheus the
## 4 84 Frankenstein; Or, The Modern Prometheus modern
## 5 84 Frankenstein; Or, The Modern Prometheus prometheus
## 6 84 Frankenstein; Or, The Modern Prometheus by
## 7 84 Frankenstein; Or, The Modern Prometheus mary
## 8 84 Frankenstein; Or, The Modern Prometheus wollstonecraft
## 9 84 Frankenstein; Or, The Modern Prometheus godwin
## 10 84 Frankenstein; Or, The Modern Prometheus shelley
## .. ... ... ...
word_counts <- words %>%
anti_join(stop_words, by = "word") %>%
count(title, word, sort = TRUE)
word_counts
## Source: local data frame [15,638 x 3]
## Groups: title [2]
##
## title word n
## (chr) (chr) (int)
## 1 Dracula time 390
## 2 Dracula van 323
## 3 Dracula night 310
## 4 Dracula helsing 301
## 5 Dracula dear 224
## 6 Dracula lucy 223
## 7 Dracula day 220
## 8 Dracula hand 210
## 9 Dracula mina 210
## 10 Dracula door 200
## .. ... ... ...