How a quanteda corpus works

Describe the object, how the goal is to store unchanged, original texts, whose only processing has been to convert their encoding to a common format.

A quanteda corpus can store settings, metadata, document variables, and be indexed. It can be linked to dictionaries, collocation lists, and custom stop words.

Currently available corpus sources

quanteda has tools for getting texts into a corpus from a variety of sources:

From a character object already in memory

The simplest case is to create a corpus from a vector of texts already in memory in R. This gives the advanced R user complete flexbility with his or her choice of text inputs, as there are almost endless ways to get a vector of texts into R.

If we already have the texts in this form, we can call the corpus constructor function directly. We can demonstrate this on the built-in character vector of 57 US president inaugural speeches called inaugTexts.

str(inaugTexts)  # this gives us some information about the object
#>  Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
#>  - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
myCorpus <- corpus(inaugTexts)  # build the corpus
summary(myCorpus, n=5)
#> Corpus consisting of 57 documents, showing 5 documents.
#> 
#>             Text Types Tokens Sentences
#>  1789-Washington   594   1429        23
#>  1793-Washington    90    135         4
#>       1797-Adams   794   2318        37
#>   1801-Jefferson   681   1726        41
#>   1805-Jefferson   775   2166        45
#> 
#> Source:  /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/RtmpWCKmvH/Rbuild1744472712f61/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Tue Jun  2 10:47:00 2015.
#> Notes:   .

If we wanted, we could add some document-level variables – what quanteda calls docvars – to this corpus. We can do this using the R’s substring() function to extract characters from a name – in this case, the name of the character vector inaugTexts. This works using our fixed starting and ending positions with substring() because these names are a very regular format of YYYY-PresidentName.

docvars(myCorpus, "President") <- substring(names(inaugTexts), 6)
docvars(myCorpus, "Year") <- as.integer(substring(names(inaugTexts), 1, 4))
summary(myCorpus, n=5)
#> Corpus consisting of 57 documents, showing 5 documents.
#> 
#>             Text Types Tokens Sentences  President Year
#>  1789-Washington   594   1429        23 Washington 1789
#>  1793-Washington    90    135         4 Washington 1793
#>       1797-Adams   794   2318        37      Adams 1797
#>   1801-Jefferson   681   1726        41  Jefferson 1801
#>   1805-Jefferson   775   2166        45  Jefferson 1805
#> 
#> Source:  /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/RtmpWCKmvH/Rbuild1744472712f61/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Tue Jun  2 10:47:00 2015.
#> Notes:   .

If we wanted to tag each document with additional meta-data not considered a document variable of interest for analysis, but rather something that we need to know as an attribute of the document, we could also add those to our corpus.

language(myCorpus) <- "english"
metadoc(myCorpus, "docsource")  <- paste("inaugTexts", 1:ndoc(myCorpus), sep="_")
summary(myCorpus, n=5, showmeta=TRUE)
#> Corpus consisting of 57 documents, showing 5 documents.
#> 
#>             Text Types Tokens Sentences  President Year _language
#>  1789-Washington   594   1429        23 Washington 1789   english
#>  1793-Washington    90    135         4 Washington 1793   english
#>       1797-Adams   794   2318        37      Adams 1797   english
#>   1801-Jefferson   681   1726        41  Jefferson 1801   english
#>   1805-Jefferson   775   2166        45  Jefferson 1805   english
#>    _docsource
#>  inaugTexts_1
#>  inaugTexts_2
#>  inaugTexts_3
#>  inaugTexts_4
#>  inaugTexts_5
#> 
#> Source:  /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/RtmpWCKmvH/Rbuild1744472712f61/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Tue Jun  2 10:47:00 2015.
#> Notes:   .

The last command, metadoc, allows you to define your own document meta-data fields. The two docmeta fields language and encoding are so common that quanteda has shortened accessor and replacement functions for manipulating these: encoding() and language(). Note that in assiging just the single value of "english", R has recycled the value until it matches the number of documents in the corpus. In creating a simple tag for our custom metadoc field docsource, we used the quanteda function ndoc() to retrieve the number of documents in our corpus. This function is deliberately designed to work in a way similar to functions you may already use in R, such as nrow() and ncol().

Tools for handling corpus objects

Adding two corpus objects together

The + operator provides a simple method for concatenating two corpus objects. If they contain different sets of document-level variables, these will be stitched together in a fashion that guarantees that no information is lost. Corpus-level medata data is also concatenated.

library(quanteda)
mycorpus1 <- corpus(inaugTexts[1:5], note="First five inaug speeches")
mycorpus2 <- corpus(inaugTexts[6:10], note="Next five inaug speeches")
mycorpus3 <- mycorpus1 + mycorpus2
summary(mycorpus3)
#> Corpus consisting of 10 documents.
#> 
#>             Text Types Tokens Sentences
#>  1789-Washington   594   1429        23
#>  1793-Washington    90    135         4
#>       1797-Adams   794   2318        37
#>   1801-Jefferson   681   1726        41
#>   1805-Jefferson   775   2166        45
#>     1809-Madison   520   1175        21
#>     1813-Madison   518   1210        33
#>      1817-Monroe   980   3370       122
#>      1821-Monroe  1190   4457       131
#>       1825-Adams   962   2915        74
#> 
#> Source:  Combination of corpuses mycorpus1 and mycorpus2.
#> Created: Tue Jun  2 10:47:00 2015.
#> Notes:   First five inaug speeches Next five inaug speeches.

Extracting a subset of a corpus

subset

Indexing a corpus

Coming soon

Managing settings in a corpus

Coming soon

Redefining document units

segment

changeunits

Methods for analyzing a corpus directly

Getting simple information

print

summary

ndoc and nfeature

Extracting data

texts docvars metacorpus metadoc

Exploring a corpus

kwic

Dispersion plots – coming soon.

Operations on the corpus texts

Creating a corpus fr

Often, texts aren’t available as pre-made R character vectors, and we need to load them from an external source. To do this, we first create a source for the documents, which defines how they are loaded from the source into the corpus. The source may be a character vector, a directory of text files, a zip file, a twitter search, or several external package formats such as tm’s VCorpus.

Once a source has been defined, we make a new corpus by calling the corpus constructor with the source as the first argument. The corpus constructor also accepts arguments which can set some corpus metadata, and define how the document variables are set.

From a directory of files

A very common source of files for creating a corpus will be a set of text files found on a local (or remote) directory. To load texts in this way, we first define a source for the directory, and pass this source as an argument to the corpus constructor. We create a directory source by calling the directory function.

# Basic file import from directory
d <- textfile('~/Dropbox/QUANTESS/corpora/inaugural/*.txt')
myCorpus <- corpus(d)

If the document variables are specified in the filenames of the texts, we can read them by setting the docvarsfrom argument (docvarsfrom = "filenames") and specifiying how the filenames are formatted with the sep argument. For example, if the inaugural address texts were stored on disk in the format Year-President.txt (e.g. 1973-Nixon.txt), then we can load them and automatically populate the document variables. The docvarnames argument sets the names of the document variables — it must be the same length as the parts of the filenames.

# File import reading document variables from filenames
d <- textfile('~/Dropbox/QUANTESS/corpora/inaugural/*.txt')

# In this example the format of the filenames is `Year-President.txt`. 
# Because there are two variables in the filename, docvarnames must contain two names
myCorpus <- corpus(d, docvarsfrom="filenames", sep="-", docvarnames=c("Year", "President") )