The epubr
package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame.
E-book formatting is non-standard enough across all literature that no function can curate parsed e-book content across an arbitrary collection of e-books, in completely general form, resulting in a singular, consistently formatted output containing all the same variables.
EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with epubr
.
Text is read ‘as is’. Additional text cleaning should be performed by the user at their discretion, such as with functions from packages like tm
or qdap
.
Bram Stoker’s Dracula novel sourced from Project Gutenberg is a good example of an EPUB file with unfortunate formatting. The first thing that stands out is the naming convention using item
followed by some ordered digits does not differentiate sections like the book preamble from the chapters. The numbering also starts in a weird place. But it is actually worse than this. Notice that sections are not broken into chapters; they can begin and end in the middle of chapters!
These annoyances aside, the metadata and contents can still be read into a convenient table. Text mining analyses can still be performed on the overall book, if not so easily on individual chapters.
Here a single file is read with epub
. The output of the returned primary data frame and the book text data frame that is nested within its data
column are shown.
library(epubr)
file <- system.file("dracula.epub", package = "epubr")
(x <- epub(file))
#> # A tibble: 1 x 9
#> rights identifier creator title language subject date source data
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lis>
#> 1 Public~ http://www.~ Bram St~ Drac~ en Horror~ 1995~ http:/~ <tib~
x$data[[1]]
#> # A tibble: 15 x 4
#> section text nword nchar
#> <chr> <chr> <int> <int>
#> 1 item6 "The Project Gutenberg EBook of Dracula,~ 11252 60972
#> 2 item7 "But I am not in heart to describe beaut~ 13740 71798
#> 3 item8 "“ ‘Lucy, you are an honest-hearted girl~ 12356 65522
#> 4 item9 "CHAPTER VIIIMINA MURRAY’S JOURNAL\nSame~ 12042 62724
#> 5 item10 "CHAPTER X\nLetter, Dr. Seward to Hon. A~ 12599 66678
#> 6 item11 "Once again we went through that ghastly~ 11919 62949
#> 7 item12 "CHAPTER XIVMINA HARKER’S JOURNAL\n23 Se~ 12003 62234
#> 8 item13 "CHAPTER XVIDR. SEWARD’S DIARY—continued~ 13812 72903
#> 9 item14 "“Thus when we find the habitation of th~ 13201 69779
#> 10 item15 "“I see,” I said. “You want big things t~ 12706 66921
#> 11 item16 "CHAPTER XXIIIDR. SEWARD’S DIARY\n3 Octo~ 11818 61550
#> 12 item17 "CHAPTER XXVDR. SEWARD’S DIARY\n11 Octob~ 12989 68564
#> 13 item18 " \nLater.—Dr. Van Helsing has returned.~ 8356 43464
#> 14 item19 "End of the Project Gutenberg EBook of D~ 2669 18541
#> 15 coverpage-wrapper "" 0 0
The file
argument may be a vector of EPUB files. There is one row for each book.
The above examples jump right in, but it can be helpful to inspect file metadata before reading a large number of books into memory. Formatting may differ across books. It can be helpful to know what fields to expect, the degree of consistency, and what content you may want to drop during the file reading process. epub_meta
strictly parses file metadata and does not read the e-book text.
epub_meta(file)
#> # A tibble: 1 x 8
#> rights identifier creator title language subject date source
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Public d~ http://www.g~ Bram St~ Drac~ en Horror~ 1995~ http://ww~
This provides the big picture, though it will not reveal the internal breakdown of book section naming conventions that were seen in the first epub
example.
file
can also be a vector for epub_meta
. Whenever file
is a vector, the fields (columns) returned are the union of all fields detected across all EPUB files. Any books (rows) that do not have a field found in another book return NA
for that row and column.
There are three optional arguments that can be provided to epub
to:
Unless you have a collection of well-formatted and similarly formatted EPUB files, these arguments may not be helpful and can be ignored, especially chapter detection.
Selecting fields is straightforward. All fields found are returned unless a vector of fields is provided.
epub(file, fields = c("title", "creator", "file"))
#> # A tibble: 1 x 4
#> title creator file data
#> <chr> <chr> <chr> <list>
#> 1 Dracula Bram Stoker dracula.epub <tibble [15 x 4]>
Note that file
was not a field identified in the metadata. This is a special case. Including file
will include the basename
of the input file. This is helpful when you want to retain file names and source
is included in the metadata but may represent something else. Some fields like data
are always returned and do not need to be specified in fields
.
Filtering out unwanted sections, or rows of the nested data frame, uses a regular expression pattern. Matched rows are dropped. This is where knowing the naming conventions used in the e-books in file
, or at least knowing they are satisfactorily consistent and predictable for a collection, helps with removing extraneous clutter.
One section that can be discarded is the cover. For many books it can be helpful to use a pattern like "^(C|c)ov"
to drop any sections whose IDs begin with Cov
, cov
, and may be that abbreviation or the full word. For this book, cov
suffices. The nested data frame has one less row than before.
epub(file, drop_sections = "cov")$data[[1]]
#> # A tibble: 14 x 4
#> section text nword nchar
#> <chr> <chr> <int> <int>
#> 1 item6 "The Project Gutenberg EBook of Dracula, by Bram S~ 11252 60972
#> 2 item7 "But I am not in heart to describe beauty, for whe~ 13740 71798
#> 3 item8 "“ ‘Lucy, you are an honest-hearted girl, I know. ~ 12356 65522
#> 4 item9 "CHAPTER VIIIMINA MURRAY’S JOURNAL\nSame day, 11 o~ 12042 62724
#> 5 item10 "CHAPTER X\nLetter, Dr. Seward to Hon. Arthur Holm~ 12599 66678
#> 6 item11 "Once again we went through that ghastly operation~ 11919 62949
#> 7 item12 "CHAPTER XIVMINA HARKER’S JOURNAL\n23 September.—J~ 12003 62234
#> 8 item13 "CHAPTER XVIDR. SEWARD’S DIARY—continued\nIT was j~ 13812 72903
#> 9 item14 "“Thus when we find the habitation of this man-tha~ 13201 69779
#> 10 item15 "“I see,” I said. “You want big things that you ca~ 12706 66921
#> 11 item16 "CHAPTER XXIIIDR. SEWARD’S DIARY\n3 October.—The t~ 11818 61550
#> 12 item17 "CHAPTER XXVDR. SEWARD’S DIARY\n11 October, Evenin~ 12989 68564
#> 13 item18 " \nLater.—Dr. Van Helsing has returned. He has go~ 8356 43464
#> 14 item19 "End of the Project Gutenberg EBook of Dracula, by~ 2669 18541
This e-book unfortunately does not have great formatting. For the sake of example, pretend that chapters are known to be sections beginning with item
and followed by two digits, using the pattern ^item\\d\\d
. This does two things. It adds a new metadata column to the primary data frame called nchap
giving the estimated number of chapters in the book. In the nested data frame containing the parsed e-book text, the section
column is conditionally mutated to reflect a new, consistent chapter naming convention for the identified chapters and a logical is_chapter
column is added.
x <- epub(file, drop_sections = "cov", chapter_pattern = "^item\\d\\d")
x
#> # A tibble: 1 x 10
#> rights identifier creator title language subject date source nchap
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 Public~ http://www.~ Bram S~ Drac~ en Horror~ 1995~ http://~ 10
#> # ... with 1 more variable: data <list>
x$data[[1]]
#> # A tibble: 14 x 5
#> section text is_chapter nword nchar
#> <chr> <chr> <lgl> <int> <int>
#> 1 item6 "The Project Gutenberg EBook of Dracula~ FALSE 11252 60972
#> 2 item7 "But I am not in heart to describe beau~ FALSE 13740 71798
#> 3 item8 "“ ‘Lucy, you are an honest-hearted gir~ FALSE 12356 65522
#> 4 item9 "CHAPTER VIIIMINA MURRAY’S JOURNAL\nSam~ FALSE 12042 62724
#> 5 ch01 "CHAPTER X\nLetter, Dr. Seward to Hon. ~ TRUE 12599 66678
#> 6 ch02 "Once again we went through that ghastl~ TRUE 11919 62949
#> 7 ch03 "CHAPTER XIVMINA HARKER’S JOURNAL\n23 S~ TRUE 12003 62234
#> 8 ch04 "CHAPTER XVIDR. SEWARD’S DIARY—continue~ TRUE 13812 72903
#> 9 ch05 "“Thus when we find the habitation of t~ TRUE 13201 69779
#> 10 ch06 "“I see,” I said. “You want big things ~ TRUE 12706 66921
#> 11 ch07 "CHAPTER XXIIIDR. SEWARD’S DIARY\n3 Oct~ TRUE 11818 61550
#> 12 ch08 "CHAPTER XXVDR. SEWARD’S DIARY\n11 Octo~ TRUE 12989 68564
#> 13 ch09 " \nLater.—Dr. Van Helsing has returned~ TRUE 8356 43464
#> 14 ch10 "End of the Project Gutenberg EBook of ~ TRUE 2669 18541
Also note that not all books have chapters. Make sure an optional argument makes sense to use with a given e-book.
Some e-books have formatting that puts chapter sections completely out of order even when they may be easily separable from other book sections and this can be another roadblock, as you may correctly identify and distinguish chapters from other book sections like cover, title, copyright and acknowledgements pages, but you will number the chapters incorrectly.
There are some developmental options that can get around issues like this in certain edge cases and where certain reasonable conditions can be met. For example, a second pass can be attempted internally in a call to epub
to cross reference guessed chapter sections with the presence of something like CHAPTER
(or some other secondary regular expression pattern) appearing at the beginning of the actual section text. All is not necessarily lost when file metadata formatting is not useful for a given e-book.
These developmental arguments are currently undocumented, though they can be explored if you are inclined to read the package source code and pass additional arguments to ...
. They have been tested successfully on many e-books, but certainly not a representative sample of all e-books. The approaches these arguments use may also change before they are formally supported and explicitly added to a future version of the package.
Ultimately though, everything depends on the quality of the EPUB file. Some publishers are better than others. Formatting standards may also change over time.
Separate from using epub_meta
and epub
, you can call epub_unzip
directly if all you want to do is extract the files from the .epub
file archive. By default the archive files are extracted to tempdir()
so you may to change this with the exdir
argument.
bookdir <- file.path(tempdir(), "dracula")
epub_unzip(file, exdir = bookdir)
list.files(bookdir, recursive = TRUE)
#> [1] "META-INF/container.xml"
#> [2] "OEBPS/0.css"
#> [3] "OEBPS/1.css"
#> [4] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-0.htm.html"
#> [5] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-1.htm.html"
#> [6] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-10.htm.html"
#> [7] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-11.htm.html"
#> [8] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-12.htm.html"
#> [9] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-13.htm.html"
#> [10] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-2.htm.html"
#> [11] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-3.htm.html"
#> [12] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-4.htm.html"
#> [13] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-5.htm.html"
#> [14] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-6.htm.html"
#> [15] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-7.htm.html"
#> [16] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-8.htm.html"
#> [17] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-9.htm.html"
#> [18] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@images@colophon.png"
#> [19] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@images@cover.jpg"
#> [20] "OEBPS/content.opf"
#> [21] "OEBPS/pgepub.css"
#> [22] "OEBPS/toc.ncx"
#> [23] "OEBPS/wrap0000.html"
#> [24] "mimetype"