Introduction to jstor

Thomas Klebel

2018-07-15

The tool Data for Research (DfR) by JSTOR is a valuable source for citation analysis and text mining. jstor provides functions and suggests workflows for importing datasets from DfR.

When using DfR, requests for datasets can be made for small excerpts (max. 25,000 records) or large ones, which require an agreement between the researcher and JSTOR. jstor was developed to deal with very large datasets which require an agreement, but can be used with smaller ones as well.

The most important set of functions is a group of jst_get_* functions:

I will demonstrate their usage using the sample dataset which is provided by JSTOR on their website.

General Concept

All functions from the jst_get_* family which are concerned with meta data operate along the same lines:

  1. The file is read with xml2::read_xml().
  2. Content of the file is extracted via XPATH or CSS-expressions.
  3. The resulting data is returned in a tidy tibble.

The functions are similar in that all operate on single files (article, book, research report or pamphlet). Depending on the content of the file, the output of the functions might have one or multiple rows. jst_get_article always returns a tibble with one row: the core meta data (like title, id, or first page of the article) are single items, and only one article is processed at a time. Running jst_get_authors for the same article might give you a tibble with one or multiple rows, depending on the number of authors the article has. The same is true for jst_get_references and jst_get_footnotes. If a file has no data on references (they migth still exist, but JSTOR might not have parsed them), the output is only one row, with missing references. If there is data on references, each entry gets its own row. Note however, that the number of rows does not equal the number of references. References usually start with a title like “References”, which is obviously not a reference to another article. Be sure to think carefully about your assumptions and to check the content of your data before you make inferences.

Books work a bit differently. Searching for data on https://www.jstor.org/dfr/results lets you filter for books, which are acutally book chapters. If you receive data from DfR on a book chapter, you always get one xml-file with the whole book, including data on all chapters. Ngram or full-text data for the same entry however is processed only from single chapters1. Thus, the output of jst_get_book for a single file is similar to the one from jst_get_article: it is one row with general data about the book. jst_get_chapters gives you data on all chapters, and the resulting tibble therefore might have multiple rows.

The following sections showcase the different functions separately.

Application

Apart from jstor we only need to load dplyr for matching records and knitr for printing nice tables.

library(jstor)
library(dplyr)
library(knitr)

jst_get_article

The basic usage of the jst_get_* functions is very simple. They take only one argument, the path to the file to import:

The resulting object is a tibble with one row and 17 columns. The columns correspond to most of the elements documented here: http://www.jstor.org/dfr/about/technical-specifications.

The columns are:

Since the output from all functions are tibbles, the result is nicely formatted:

file_name journal_doi journal_jcode journal_pub_id journal_title article_doi article_pub_id article_jcode article_type article_title volume issue language pub_day pub_month pub_year first_page last_page page_range
sample_with_references NA tranamermicrsoci NA Transactions of the American Microscopical Society 10.2307/3221896 NA NA research-article On the Protozoa Parasitic in Frogs 41 2 eng 1 4 1922 59 76 59-76

jst_get_authors

Extracting the authors works in similar fashion:

file_name prefix given_name surname string_name suffix author_number
sample_with_references NA R. Kudo NA NA 1

Here we have the following columns:

The number of rows matches the number of authors – each author get its’ own row.

jst_get_references

We have two columns:

Here I display the first 5 entries for each column:

file_name references
sample_with_references Bibliography: Entamoeba ranarumn
sample_with_references DOBELL, C.C.1909 Researches on the intestinal Protozoa of frogs and toads. Quart. Jour. Micros.Sc., 53:201-276, 4 pl. and 1 textfig.
sample_with_references 1918 Are Entamoeba histolytica and Entamoeba ranarum the same species? An experi-mental study. Parasit., 10:294-310.
sample_with_references References: Leptotheca ohilmacheri
sample_with_references KUDO, R.1920 Studies on Myxosporidia. A Synopsis of Genera and Species of Myxosporidia.ill. Biol. Monogr., 5:243-503, 25 pl. and 2 textfig.

This example shows several things: file_name is identical among rows, since it identifies the article and all references came from one article.

The content of references (references) is in quite a raw state, quite often the result of digitising scans via OCR. Very often, the first entry in this column is something like “Bibliography” or “References”, which is simply the heading within the article. In the above example there are several headings, because the sample file doesn’t follow a typical convention (it was published in 1922).

Note, that there might be other content present like endnotes, in case the article used endnotes rather than footnotes.

jst_get_footnotes

file_name footnotes
sample_with_references NA

Very commonly, articles either have footnotes or references. The sample file used here does not have footnotes, therefore a simple tibble with missing footnotes is returned.

I will use another file to demonstrate footnotes.

file_name footnotes
sample_with_footnotes [Footnotes]
sample_with_footnotes 9Quarterly, vol. XIII, no. 1,entries for April 19 and 21.
sample_with_footnotes 10Quarterly, vol. XIII,no. 1, p. 8.
sample_with_footnotes 14Quarterly, vol. VIII, no. 1.Olympia Columbian, Sept. 11, 1852,
sample_with_footnotes 26Quarterly, vol. XII,no. 2, p. 141.
sample_with_footnotes 32Dr. David S. Maynard, later (March 31, 1852)
sample_with_footnotes 34Thomas Linklater, Shepherd, since October 6, 1849,

In general, you might need to combine jst_get_footnotes() with jst_get_references() to get all available information on citation data.

jst_get_full_text

The function to extract full texts can’t be demonstrated with proper data, since the full texts are only supplied upon special request with DfR. The function guesses the encoding of the specified file via readr::guess_encoding(), reads the whole file and returns a tibble with file_name, full_text and encoding.

I created a file that looks similar to files supplied by DfR with sample text:

file_name full_text encoding
sample_full_text Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laborisnisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sintobcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit animid est laborum. ASCII

Combining results

Different parts of meta-data can be combined by using dplyr::left_join().

Matching with authors

file_name article_title pub_year given_name surname
sample_with_references On the Protozoa Parasitic in Frogs 1922 R. Kudo

Matching with references

file_name article_title volume pub_year references
sample_with_references On the Protozoa Parasitic in Frogs 41 1922 Bibliography: Entamoeba ranarumn
sample_with_references On the Protozoa Parasitic in Frogs 41 1922 DOBELL, C.C.1909 Researches on the intestinal Protozoa of frogs and toads. Quart. Jour. Micros.Sc., 53:201-276, 4 pl. and 1 textfig.
sample_with_references On the Protozoa Parasitic in Frogs 41 1922 1918 Are Entamoeba histolytica and Entamoeba ranarum the same species? An experi-mental study. Parasit., 10:294-310.
sample_with_references On the Protozoa Parasitic in Frogs 41 1922 References: Leptotheca ohilmacheri
sample_with_references On the Protozoa Parasitic in Frogs 41 1922 KUDO, R.1920 Studies on Myxosporidia. A Synopsis of Genera and Species of Myxosporidia.ill. Biol. Monogr., 5:243-503, 25 pl. and 2 textfig.

Books

Quite recently DfR added book chapters to their stack. To import metadata about the books and chapters, jstor supplies jst_get_book and jst_get_chapters.

jst_get_book is very similar to jst_get_article. We obtain general information about the complete book:

jst_get_book(jst_example("sample_book.xml")) %>% knitr::kable()
book_id file_name discipline book_title book_subtitle pub_day pub_month pub_year isbn publisher_name publisher_location n_pages language
j.ctt24hdz7 sample_book Political Science The 2006 Military Takeover in Fiji A Coup to End All Coups? 30 4 2009 9781921536502; 9781921536519 ANU E Press Canberra NA eng

A single book might contain many chapters. jst_get_chapters extracts all of them. Due to this, the function is a bit slower than most of jstor’s other functions.

chapters <- jst_get_chapters(jst_example("sample_book.xml"))

str(chapters)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    36 obs. of  9 variables:
#>  $ book_id        : chr  "j.ctt24hdz7" "j.ctt24hdz7" "j.ctt24hdz7" "j.ctt24hdz7" ...
#>  $ file_name      : chr  "sample_book" "sample_book" "sample_book" "sample_book" ...
#>  $ part_id        : chr  "j.ctt24hdz7.1" "j.ctt24hdz7.2" "j.ctt24hdz7.3" "j.ctt24hdz7.4" ...
#>  $ part_label     : chr  NA NA NA NA ...
#>  $ part_title     : chr  "Front Matter" "Table of Contents" "Acronyms and abbreviations" "Authors’ biographies" ...
#>  $ part_subtitle  : chr  NA NA NA NA ...
#>  $ authors        : chr  NA NA NA NA ...
#>  $ abstract       : chr  NA NA NA NA ...
#>  $ part_first_page: chr  "i" "v" "vii" "xi" ...

Without the abstracts (they are rather long) the first 10 chapters look like this:

chapters %>% 
  select(-abstract) %>% 
  head(10) %>% 
  kable()
book_id file_name part_id part_label part_title part_subtitle authors part_first_page
j.ctt24hdz7 sample_book j.ctt24hdz7.1 NA Front Matter NA NA i
j.ctt24hdz7 sample_book j.ctt24hdz7.2 NA Table of Contents NA NA v
j.ctt24hdz7 sample_book j.ctt24hdz7.3 NA Acronyms and abbreviations NA NA vii
j.ctt24hdz7 sample_book j.ctt24hdz7.4 NA Authors’ biographies NA NA xi
j.ctt24hdz7 sample_book j.ctt24hdz7.5 1. The enigmas of Fiji’s good governance coup NA NA 3
j.ctt24hdz7 sample_book j.ctt24hdz7.6 2. ‘Anxiety, uncertainty and fear in our land’: Fiji’s road to military coup, 2006 NA 21
j.ctt24hdz7 sample_book j.ctt24hdz7.7 3. Fiji’s December 2006 coup: Who, what, where and why? NA 43
j.ctt24hdz7 sample_book j.ctt24hdz7.8 4. ‘This process of political readjustment’: The aftermath of the 2006 Fiji Coup NA 67
j.ctt24hdz7 sample_book j.ctt24hdz7.9 5. The changing role of the Great Council of Chiefs NA NA 97
j.ctt24hdz7 sample_book j.ctt24hdz7.10 6. The Fiji military and ethno-nationalism: Analyzing the paradox NA 117

Since extracting all authors for all chapters needs considerably more time, by default authors are not extracted. You can import them like so:

author_chap <- jst_get_chapters(jst_example("sample_book.xml"), authors = TRUE) 

The authors are supplied in a list column:

class(author_chap$authors)
#> [1] "list"

You can expand this list with tidyr::unnest:

author_chap %>% 
  tidyr::unnest() %>% 
  select(part_id, given_name, surname) %>% 
  head(10) %>% 
  kable()
part_id given_name surname
j.ctt24hdz7.1 NA NA
j.ctt24hdz7.2 NA NA
j.ctt24hdz7.3 NA NA
j.ctt24hdz7.4 NA NA
j.ctt24hdz7.5 Jon Fraenkel
j.ctt24hdz7.5 Stewart Firth
j.ctt24hdz7.6 Brij V. Lal
j.ctt24hdz7.7 Jon Fraenkel
j.ctt24hdz7.8 Brij V. Lal
j.ctt24hdz7.9 Robert Norton

You can learn more about the concept of list-columns in Hadley Wickham’s book R for Data Science.


  1. See the technical specifications for more detail.