The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest.

Downloading books by ID

The function gutenberg_download() downloads one or more works from Project Gutenberg based on their ID. For example, the book “Frankenstein; or The Modern Prometheus”, by Mary Wollstonecraft Shelly, has ID 84 (see the URL here), so gutenberg_download(84) downloads this text.

library(dplyr)
library(gutenbergr)

frankenstein <- gutenberg_download(84)

frankenstein
## Source: local data frame [7,244 x 2]
## 
##    gutenberg_id                                 text
##           (int)                                (chr)
## 1            84                        Frankenstein,
## 2            84                                     
## 3            84             or the Modern Prometheus
## 4            84                                     
## 5            84                                     
## 6            84                                   by
## 7            84                                     
## 8            84 Mary Wollstonecraft (Godwin) Shelley
## 9            84                                     
## 10           84                                     
## ..          ...                                  ...

Notice it is returned as a tbl_df (a type of data frame), including two variables: gutenberg_id (useful if multiple books are returned), and a character vector of the text, one row per line. Notice that the header and footer added by Project Gutenberg (visible here) have been stripped away.

Provide a vector of IDs to download multiple books. For example, to download Dracula (book 345) along with Frankenstein, do:

frankenstein_dracula <- gutenberg_download(c(84, 345), meta_fields = "title")

frankenstein_dracula
## Source: local data frame [22,812 x 3]
## 
##    gutenberg_id                                 text
##           (int)                                (chr)
## 1            84                        Frankenstein,
## 2            84                                     
## 3            84             or the Modern Prometheus
## 4            84                                     
## 5            84                                     
## 6            84                                   by
## 7            84                                     
## 8            84 Mary Wollstonecraft (Godwin) Shelley
## 9            84                                     
## 10           84                                     
## ..          ...                                  ...
##                                      title
##                                      (chr)
## 1  Frankenstein; Or, The Modern Prometheus
## 2  Frankenstein; Or, The Modern Prometheus
## 3  Frankenstein; Or, The Modern Prometheus
## 4  Frankenstein; Or, The Modern Prometheus
## 5  Frankenstein; Or, The Modern Prometheus
## 6  Frankenstein; Or, The Modern Prometheus
## 7  Frankenstein; Or, The Modern Prometheus
## 8  Frankenstein; Or, The Modern Prometheus
## 9  Frankenstein; Or, The Modern Prometheus
## 10 Frankenstein; Or, The Modern Prometheus
## ..                                     ...

Notice that the meta_fields argument allows us to add one or more additional fields from the gutenberg_metadata to the downloaded text, such as title or author.

frankenstein_dracula %>%
  count(title)
## Source: local data frame [2 x 2]
## 
##                                     title     n
##                                     (chr) (int)
## 1                                 Dracula 15568
## 2 Frankenstein; Or, The Modern Prometheus  7244

Project Gutenberg Metadata

This package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

gutenberg_metadata
## Source: local data frame [51,877 x 8]
## 
##    gutenberg_id
##           (int)
## 1             0
## 2             1
## 3             2
## 4             3
## 5             4
## 6             5
## 7             6
## 8             7
## 9             8
## 10            9
## ..          ...
##                                                                                                            title
##                                                                                                            (chr)
## 1                                                                                                             NA
## 2                                                The Declaration of Independence of the United States of America
## 3       The United States Bill of Rights\r\nThe Ten Original Amendments to the Constitution of the United States
## 4                                                                            John F. Kennedy's Inaugural Address
## 5  Lincoln's Gettysburg Address\r\nGiven November 19, 1863 on the battlefield near Gettysburg, Pennsylvania, USA
## 6                                                                                 The United States Constitution
## 7                                                                               Give Me Liberty or Give Me Death
## 8                                                                                          The Mayflower Compact
## 9                                                                     Abraham Lincoln's Second Inaugural Address
## 10                                                                     Abraham Lincoln's First Inaugural Address
## ..                                                                                                           ...
## Variables not shown: author (chr), gutenberg_author_id (int), language
##   (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)

For example, you could find the Gutenberg ID of Wuthering Heights by doing:

gutenberg_metadata %>%
  filter(title == "Wuthering Heights")
## Source: local data frame [1 x 8]
## 
##   gutenberg_id             title        author gutenberg_author_id
##          (int)             (chr)         (chr)               (int)
## 1          768 Wuthering Heights Brontë, Emily                 405
##   language                                 gutenberg_bookshelf
##      (chr)                                               (chr)
## 1       en Gothic Fiction/Best Books Ever Listings/Movie Books
## Variables not shown: rights (chr), has_text (lgl)

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works() function does this pre-filtering:

gutenberg_works()
## Source: local data frame [40,278 x 8]
## 
##    gutenberg_id
##           (int)
## 1             0
## 2             1
## 3             2
## 4             3
## 5             4
## 6             5
## 7             6
## 8             7
## 9             8
## 10            9
## ..          ...
##                                                                                                            title
##                                                                                                            (chr)
## 1                                                                                                             NA
## 2                                                The Declaration of Independence of the United States of America
## 3       The United States Bill of Rights\r\nThe Ten Original Amendments to the Constitution of the United States
## 4                                                                            John F. Kennedy's Inaugural Address
## 5  Lincoln's Gettysburg Address\r\nGiven November 19, 1863 on the battlefield near Gettysburg, Pennsylvania, USA
## 6                                                                                 The United States Constitution
## 7                                                                               Give Me Liberty or Give Me Death
## 8                                                                                          The Mayflower Compact
## 9                                                                     Abraham Lincoln's Second Inaugural Address
## 10                                                                     Abraham Lincoln's First Inaugural Address
## ..                                                                                                           ...
## Variables not shown: author (chr), gutenberg_author_id (int), language
##   (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)

It also allows you to perform filtering as an argument:

gutenberg_works(author == "Austen, Jane")
## Source: local data frame [10 x 8]
## 
##    gutenberg_id
##           (int)
## 1           105
## 2           121
## 3           141
## 4           158
## 5           161
## 6           946
## 7          1212
## 8          1342
## 9         31100
## 10        42078
##                                                                                                       title
##                                                                                                       (chr)
## 1                                                                                                Persuasion
## 2                                                                                          Northanger Abbey
## 3                                                                                            Mansfield Park
## 4                                                                                                      Emma
## 5                                                                                     Sense and Sensibility
## 6                                                                                                Lady Susan
## 7                                                                                 Love and Freindship [sic]
## 8                                                                                       Pride and Prejudice
## 9     The Complete Project Gutenberg Works of Jane Austen\nA Linked Index of all PG Editions of Jane Austen
## 10 The Letters of Jane Austen\r\nSelected from the compilation of her great nephew, Edward, Lord Bradbourne
##          author
##           (chr)
## 1  Austen, Jane
## 2  Austen, Jane
## 3  Austen, Jane
## 4  Austen, Jane
## 5  Austen, Jane
## 6  Austen, Jane
## 7  Austen, Jane
## 8  Austen, Jane
## 9  Austen, Jane
## 10 Austen, Jane
## Variables not shown: gutenberg_author_id (int), language (chr),
##   gutenberg_bookshelf (chr), rights (chr), has_text (lgl)

Other meta-datasets

gutenberg_subjects contains pairings of works with Library of Congress subjects and topics. “lcc” means Library of Congress Classification, while “lcsh” means Library of Congress subject headings:

gutenberg_subjects
## Source: local data frame [139,914 x 3]
## 
##    gutenberg_id subject_type
##           (int)        (chr)
## 1             1          lcc
## 2             1          lcc
## 3             1         lcsh
## 4             1         lcsh
## 5             2          lcc
## 6             2         lcsh
## 7             2          lcc
## 8             2         lcsh
## 9             3          lcc
## 10            3         lcsh
## ..          ...          ...
##                                                         subject
##                                                           (chr)
## 1                                                            JK
## 2                                                          E201
## 3                    United States. Declaration of Independence
## 4  United States -- History -- Revolution, 1775-1783 -- Sources
## 5                                                            JK
## 6              United States. Constitution. 1st-10th Amendments
## 7                                                            KF
## 8                      Civil rights -- United States -- Sources
## 9                                                          E838
## 10              United States -- Foreign relations -- 1961-1963
## ..                                                          ...

This is useful for extracting texts from a particular topic or genre, such as horror, or a particular character, such as Sherlock Holmes. The gutenberg_id column can then be used to download these texts or to link with other metadata.

gutenberg_subjects %>%
  filter(subject == "Horror tales")
## Source: local data frame [91 x 3]
## 
##    gutenberg_id subject_type      subject
##           (int)        (chr)        (chr)
## 1            42         lcsh Horror tales
## 2            43         lcsh Horror tales
## 3            84         lcsh Horror tales
## 4           175         lcsh Horror tales
## 5           345         lcsh Horror tales
## 6           355         lcsh Horror tales
## 7           389         lcsh Horror tales
## 8           601         lcsh Horror tales
## 9           696         lcsh Horror tales
## 10          792         lcsh Horror tales
## ..          ...          ...          ...
gutenberg_subjects %>%
  filter(grepl("Holmes, Sherlock", subject))
## Source: local data frame [47 x 3]
## 
##    gutenberg_id subject_type
##           (int)        (chr)
## 1           108         lcsh
## 2           221         lcsh
## 3           244         lcsh
## 4           834         lcsh
## 5          1661         lcsh
## 6          2097         lcsh
## 7          2343         lcsh
## 8          2344         lcsh
## 9          2345         lcsh
## 10         2346         lcsh
## ..          ...          ...
##                                               subject
##                                                 (chr)
## 1  Holmes, Sherlock (Fictitious character) -- Fiction
## 2  Holmes, Sherlock (Fictitious character) -- Fiction
## 3  Holmes, Sherlock (Fictitious character) -- Fiction
## 4  Holmes, Sherlock (Fictitious character) -- Fiction
## 5  Holmes, Sherlock (Fictitious character) -- Fiction
## 6  Holmes, Sherlock (Fictitious character) -- Fiction
## 7  Holmes, Sherlock (Fictitious character) -- Fiction
## 8  Holmes, Sherlock (Fictitious character) -- Fiction
## 9  Holmes, Sherlock (Fictitious character) -- Fiction
## 10 Holmes, Sherlock (Fictitious character) -- Fiction
## ..                                                ...

gutenberg_authors contains information about each author, such as aliases and birth/death year:

gutenberg_authors
## Source: local data frame [16,218 x 7]
## 
##    gutenberg_author_id                                     author
##                  (int)                                      (chr)
## 1                    1                              United States
## 2                    3                           Lincoln, Abraham
## 3                    4                             Henry, Patrick
## 4                    5                                 Adam, Paul
## 5                    7                             Carroll, Lewis
## 6                    8 United States. Central Intelligence Agency
## 7                    9                           Melville, Herman
## 8                   10              Barrie, J. M. (James Matthew)
## 9                   12                         Smith, Joseph, Jr.
## 10                  14                             Madison, James
## ..                 ...                                        ...
##                                  alias birthdate deathdate
##                                  (chr)     (int)     (int)
## 1                                   NA        NA        NA
## 2                                   NA      1809      1865
## 3                                   NA      1736      1799
## 4                                   NA        NA        NA
## 5            Dodgson, Charles Lutwidge      1832      1898
## 6                                   NA        NA        NA
## 7                    Melville, Hermann      1819      1891
## 8                Barrie, James Matthew      1860      1937
## 9                        Smith, Joseph      1805      1844
## 10 United States President (1809-1817)      1751      1836
## ..                                 ...       ...       ...
## Variables not shown: wikipedia (chr), aliases (chr)

Analysis

What’s next after retrieving a book’s text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.

library(tidytext)

words <- frankenstein_dracula %>%
  unnest_tokens(word, text)

words
## Source: local data frame [237,752 x 3]
## 
##    gutenberg_id                                   title           word
##           (int)                                   (chr)          (chr)
## 1            84 Frankenstein; Or, The Modern Prometheus   frankenstein
## 2            84 Frankenstein; Or, The Modern Prometheus             or
## 3            84 Frankenstein; Or, The Modern Prometheus            the
## 4            84 Frankenstein; Or, The Modern Prometheus         modern
## 5            84 Frankenstein; Or, The Modern Prometheus     prometheus
## 6            84 Frankenstein; Or, The Modern Prometheus             by
## 7            84 Frankenstein; Or, The Modern Prometheus           mary
## 8            84 Frankenstein; Or, The Modern Prometheus wollstonecraft
## 9            84 Frankenstein; Or, The Modern Prometheus         godwin
## 10           84 Frankenstein; Or, The Modern Prometheus        shelley
## ..          ...                                     ...            ...
word_counts <- words %>%
  anti_join(stop_words, by = "word") %>%
  count(title, word, sort = TRUE)

word_counts
## Source: local data frame [15,638 x 3]
## Groups: title [2]
## 
##      title    word     n
##      (chr)   (chr) (int)
## 1  Dracula    time   390
## 2  Dracula     van   323
## 3  Dracula   night   310
## 4  Dracula helsing   301
## 5  Dracula    dear   224
## 6  Dracula    lucy   223
## 7  Dracula     day   220
## 8  Dracula    hand   210
## 9  Dracula    mina   210
## 10 Dracula    door   200
## ..     ...     ...   ...