The purpose of the package polmineR is to facilitate the interactive analysis of corpora using R. Apart from performance and usability, key considerations for developing the package are:
To provide a library with standard tasks such as concordances/keyword-in-context, cooccurrence statistics, or keyword extraction.
To keep the original text accessible and to offer a seamless integration of qualitative and quantitative steps in corpus analysis that facilitates validation.
To create a package that makes the creation and analysis of subcorpora (‘partitions’) easy. A particular strength of the package is to support research on synchronic and diachronic change of language use.
To offer performance for users with a standard infrastructure. The package picks up the idea of a three-tier software design. Corpus data are managed and indexed using the Corpus Workbench (CWB).
To support sharing consolidated and documented data, following the ideas of reproducible research.
The ‘polmineR’ package supplements R packages that are already widely used for text mining. The CRAN task view is a good place to learn about relevant packages, see CRAN. The polmineR is intended to be an interface between the Corpus Workbench (CWB), an efficient system for storing and querying large corpora, and existing packages for mining text with advanced statistical methods.
Apart from the speed of text processing, the Corpus Query Processor (CQP) and the CQP syntax provide a powerful and widely used syntax to query corpora. This is not an unique idea. Using a combination of R and the CWB implies a software architecture you will also find in the TXM project, or with CQPweb. The ‘polmineR’ package offers a library with the grammer of corpus analysis below a graphical user interface (GUI). It is a toolset to perform simple tasts efficiently as well as to implement complex workflows.
Advanced user will need a good understanding of the Corpus Workbench. The Corpus Encoding Tutorial is an authoritative text for that. The vignette of the rcqp package includes an excellent explanation of the CWB data-model.
The most important thing users need to now is the difference between “s” and “p” attributes. The CWB distinguishes structural attributes (s-attributes) that will contain the metainformation that can be used to generate subcorpora, and positional attributes (p-attributes). Typically, the p-attributes will be ‘word’, ‘pos’ (for part-of-speech) and ‘lemma’ (for the lemmatized word form).
The annex of the vignette includes a detailed explanation how to install polmineR on Windows, MacOS, and Linux. Once you have installed polmineR, check that the environment variable CORPUS_REGISTRY is set.
Sys.getenv("CORPUS_REGISTRY")
The CORPUS_REGISTRY environment variable supplies the directory with registry files that describe where the CWB will find the files of an indexed corpus, and the s- and p-attributes. See the annex for an explanation how to set the CORPUS_REGISTRY environment variable for the current R session, or permanently.
If the CORPUS_REGISTRY variable is set correctly, i.e. pointing to the directory with the registry files describing the corpora, load the polmineR.
library(polmineR)
If you want to use a CWB corpus packaged in a R data package, you can call ‘use’ with the name of the R package. To access the corpus in the data package, the CORPUS_REGISTRY environment variable will be reset. In the followings examples, the CWB encoded English version of the Europarl wrapped into a data package will be used.
use("europarl.en")
Note that the use-function will call the resetRegistry-function that can also be used to set again the original path to the directory with registry files. If you want to use the English Europarl corpus, you can download it from a repository at the PolMine server.
install.corpus("europarl.en", repo = "http://polmine.sowi.uni-due.de/packages")
This package may serve as an example how CWB indexed corpora can be shared using an adapted version of standard R functions. Data packages with corpora have a version number which may be important for reproducing results, they can include a vignette documenting the data, and functions to perform specialized tasks.
The standard interface used by the polmineR package to extract information from CWB indexed corpora is the package ‘rcqp’. The interface is defined in a class called ‘CQI’. To check which interface is used:
class(CQI)
## [1] "CQI.rcqp" "CQI.super" "R6"
If you see “CQI.perl” leading the character vector that is returned, something went wrong. Accessing the corpora using perl scripts incurs an incredible performence loss. Reset the interface as follows:
unlockBinding(env = getNamespace("polmineR"), sym = "CQI")
assign("CQI", CQI.rcqp$new(), envir = getNamespace("polmineR"))
lockBinding(env = getNamespace("polmineR"), sym = "CQI")
}
An alternative interface is provided by the package ‘polmineR.Rcpp’, [available at GitHub][https://github.com/PolMine/polmineR.Rcpp] so far. This (experimental) package includes a few functions that speed up tasks such as counting terms, or preparing partitions. To install and use ‘polmineR.Rcpp’:
devtools::install_github("PolMine/polmineR.Rcpp")
setCorpusWorkbenchInterface("Rcpp")
To switch to the interface offered by polmineR.Rcpp, proceed as follows:
unlockBinding(env = getNamespace("polmineR"), sym = "CQI")
assign("CQI", CQI.Rcpp$new(), envir = getNamespace("polmineR"))
lockBinding(env = getNamespace("polmineR"), sym = "CQI")
}
The checks performed when submitting a package at CRAN issue a note when the ‘unlockBinding’ function appears in the source code of the package. This is why a function for reset the interface is not included in the package.
Use the corpus-method to check whether which corpora are accessible. It should be the EUROPARL-EN corpus in our case (the names of CWB corpora are always written upper case).
corpus()
## corpus size template
## 1 EUROPARL-EN 39431862 FALSE
Many functions in the polmineR package use settings that are stored in the general options settings. You can see these settings as follows:
options()[grep("polmineR", names(options()))]
Several methods (such as kwic, or cooccurrences) will use these settings, if no explicit other value is provided. Here are a few examples how to change settings.
options("polmineR.left" = 15)
options("polmineR.right" = 15)
options("polmineR.mc" = FALSE)
To speed up computations, the polmineR package will sometimes try to use alternative, faster ways to access CWB corpora than the rcqp package. A few computations that are performance critical (for setting up partitions, or to count term frequencies) are implemented in the plugin package polmineR.Rcpp that has been mentioned before. It can be installed from GitHub. Note that it requires an installation of the Corpus Workbench to be present on your system (see installation instructions in the annex).
devtools::install_github("PolMineR/polmineR")
When loading the polmineR package, it is checked whether that plugin is present.
getOption("polmineR.Rcpp")
If you want to suppress using the polmineR.Rcpp functionality:
options("polmineR.Rcpp" = FALSE)
Core analytical tasks are implemented as methods (S4 class system), i.e. the bevaviour of the methods changes depending on the object that is supplied. Almost all methods can be applied to corpora as well as partitions (subcorpora). As an easy entry, methods applied to corpora are explained first.
The kwic method applied to the name of a corpus will return a KWIC object, output will be shown in the viewer pane of RStudio. You can include metadata from the corpus using the ‘meta’ parameter.
kwic("EUROPARL-EN", "Islam")
kwic("EUROPARL-EN", "Islam", meta = c("text_date", "speaker_name"))
You can also use the CQP query syntax for formulating queries. That way, you can find multi-word expressions, or match in a manner you may know from using regular expressions.
kwic("EUROPARL-EN", '"Geneva" "Convention"')
kwic("EUROPARL-EN", '"[Ss]ocial" "justice"')
Explaining the CQP syntax goes beyon this vignette. Consult the CQP tutorial to learn more about the CQP syntax.
You can count one or several hits in a corpus.
count("EUROPARL-EN", "France")
## query count freq
## 1: France 5517 0.0001399122
count("EUROPARL-EN", c("France", "Germany", "Britain", "Spain", "Italy", "Denmark", "Poland"))
## Warning in if (cqp == FALSE) {: Bedingung hat Länge > 1 und nur das erste
## Element wird benutzt
## query count freq
## 1: France 5517 1.399122e-04
## 2: Germany 4196 1.064114e-04
## 3: Britain 1708 4.331523e-05
## 4: Spain 3378 8.566676e-05
## 5: Italy 3209 8.138089e-05
## 6: Denmark 1615 4.095673e-05
## 7: Poland 1820 4.615557e-05
count("EUROPARL-EN", '"[pP]opulism"')
## query count freq
## 1: "[pP]opulism" 107 2.713542e-06
… get dispersions of counts accross one (or two) dimensions …
pop <- dispersion("EUROPARL-EN", "populism", sAttribute = "text_year", progress = FALSE)
popRegex <- dispersion("EUROPARL-EN", '"[pP]opulism"', sAttribute = "text_year", cqp = TRUE, progress = FALSE)
Note that it is a data.table that is returned. Visualising the result as a barplot …
barplot(height = popRegex[,count], names.arg = popRegex[,text_year], las = 2)
… get cooccurrence statistics …
br <- cooccurrences("EUROPARL-EN", query = "Brussels")
eu <- cooccurrences("EUROPARL-EN", query = '"European" "Union"', left = 10, right = 10)
subset(eu, rank_ll <= 100)@stat[["word"]][1:15]
## [1] "the" "of" "within" "States" "The"
## [6] "its" "countries" "in" "between" "Member"
## [11] "s" "relations" "'s" "policy" "citizens"
Easily creating partitions (i.e. subcorpora) based on s-attributes is a strength of the ‘polmineR’ package. So if we want to work with the speeches given in the European Parliament in 2006:
ep2006 <- partition("EUROPARL-EN", text_year = "2006")
To get some basic information about the partition that has been set up, the ‘show’-method can be used. It is also called when you simply type the name of the partition object.
ep2006
## ** partition object **
## corpus: EUROPARL-EN
## name:
## sAttributes: text_year = 2006
## cpos: 46 pairs of corpus positions
## size: 3100529 tokens
## count: not available
To evaluate s-attributes, regular expressions can be used.
barroso <- partition("EUROPARL-EN", speaker_name = "Barroso", regex = TRUE)
sAttributes(barroso, "speaker_name")
## [1] "Barroso" "José Manuel Barroso"
If you work with a flat XML structure, the order of the provided s-attributes may be relevant for speeding up the set up of the partition. For a nested XML, it is important that with the order, you move from ancestors to childs. For further information, see the documentation of the partition-function.
The cooccurrences-method can be applied to partition-objects.
ep2002 <- partition("EUROPARL-EN", text_year = "2006")
terror <- cooccurrences(ep2002, "terrorism", pAttribute = "lemma", left = 10, right = 10)
Note that is is possible to provide a query that uses the full CQP syntax. The statistical analysis of collocations to the query can be accessed as the slot “stat” of the context object.
terror@stat[1:10,][,.(lemma, count_partition, rank_ll)]
## lemma count_partition rank_ll
## 1: fight 1047 1
## 2: combat 623 2
## 3: crime 533 3
## 4: organise 441 4
## 5: war 449 5
## 6: threat 445 6
## 7: immigration 601 7
## 8: international 1876 8
## 9: human 2733 9
## 10: right 5711 10
To understand the occurance of a phenomenon, the distribution of query results across one or two dimensions will often be interesing. This is done via the ‘distribution’ function. The query may use the CQP syntax.
# one query / one dimension
oneQuery <- dispersion(ep2002, query = 'terrorism', "text_date", progress = FALSE)
# # multiple queries / one dimension
twoQueries <- dispersion(ep2002, query= c("war", "peace"), "text_date", progress = FALSE)
To identify the specific vocabulary of a corpus of interest, a statistical test based (chi square, or log likelihood) can be performed.
ep2002 <- partition("EUROPARL-EN", text_year = "2002")
ep2002 <- enrich(ep2002, pAttribute = "word")
epPre911 <- partition("EUROPARL-EN", text_year = as.character(1997:2001))
epPre911 <- enrich(epPre911, pAttribute = "word")
F <- features(ep2002, epPre911, included = FALSE)
subset(F, rank_chisquare <= 50)@stat[["word"]]
## [1] "2002" "Johannesburg" "Seville" "Barcelona"
## [5] "'s" "2003" "Copenhagen" "terrorism"
## [9] "02" "candidate" "Spanish" "2002/"
## [13] "Palestinian" "Kaliningrad" "," "Aznar"
## [17] "enlargement" "asbestos" "Danish" "Sharon"
## [21] "Prestige" "biofuels" "2004" "Convention"
## [25] "2003." "Galicia" "2001/" "137(1"
## [29] "Summit" "Iraq" "Criminal" "GM"
## [33] "Sky" "Presidency" "competences" "ICC"
## [37] "Galileo" "Cotonou" "Haarder" "Israel"
## [41] "sustainable" "however" "cod" "Valencia"
## [45] "Arafat" "mid-term" "COM(2001" "Afghanistan"
## [49] "Explanation" "medicinal"
For many applications, term-document matrices are the point of departure. The tm class TermDocumentMatrix serves as an input to several R packages implementing advanced text mining techniques. Obtaining this input from a corpus imported to the CWB will usually involve setting up a partitionBundle and then applying a method to get the matrix.
speakers <- partitionBundle(
"EUROPARL-EN", sAttribute = "speaker_id",
progress = FALSE, verbose = FALSE
)
speakers <- enrich(speakers, pAttribute = "word")
tdm <- as.TermDocumentMatrix(speakers, col = "count")
class(tdm) # to see what it is
show(tdm)
m <- as.matrix(tdm) # turn it into an ordinary matrix
m[c("Barroso", "Schulz"),]
The package includes many features that go beyond this vignette. It is a key aim in the project to develop respective documentation in the vignette and the man pages for the individual functions further. Feedback is very welcome!
At this stage, an easy way to install polmineR is available only for 32bit R. Usually, an R installation will include both 32bit and 64bit R. So if you want to keep things simple, make sure that you work with 32bit version. If you work with RStudio (highly recommended), the menu Tools > Global Options will open a dialogue where you can choose 32bit R.
Before installing polmineR, the package ‘rcqp’ needs to be installed. In turn, rcqp requires plyr, which should be installed first.
install.packages("plyr")
To avoid compiling C code in a package, packages with compiled binaries are very handy. Windows binaries for the rcqp package are not available at CRAN, but can be installed from a repository of packages entertained at the server of the PolMine project:
install.packages("rcqp", repos = "http://polmine.sowi.uni-due.de/packages", type = "win.binary")
To explain: Compiling the C code in the rcqp package on a windows machine is not yet possible. The package we offer uses a cross-compilation of these C libraries, i.e. binaries that have been prepared for windows on a MacOS/Linux machine.
Before proceeding to install polmineR, we install dependencies that are not installed automatically.
install.packages(pkgs = c("htmltools", "htmlwidgets", "magrittr", "iterators", "NLP"))
install.packages("rcqp")
The latest stable version of polmineR can now be installed from CRAN. Several other packages that polmineR depends on, or that dependencies depend on may be installed automatically.
install.packages("polmineR")
The development version of the package, which may include the most recent updates and features, can be installed from GitHub. The easiest way to do this is to use a mechanism offered by the package devtools.
install.packages("devtools")
devtools::install_github("PolMine/polmineR", ref = "dev")
The installation may throw warnings. There are three warnings you can ignore at this stage:
The configure script is for Linux/MacOS installation, its sole purpose is to pass tests for uploading the package to CRAN. As mentioned, windows binaries are not yet available for 64bit R at present, so that can be ignored. The environment variable “CORPUS_REGISTRY” can be set as follows in R:
Sys.setenv(CORPUS_REGISTRY = "C:/PATH/TO/YOUR/REGISTRY")
To set the environment variable CORPUS_REGISTRY permanently, see the instructions R offer how to find the file ‘.Renviron’ or ‘.Renviron.site’ when calling the help for the startup process(?Startup
).
Two important notes concerning problems with the CORPUS_REGISTRY environment variable that may cause serious headaches:
The path can not be processed, if there is any whitespace in the path pointing to the registry. Whitespace may occur in the user name (“C:/Users/Donald Duck/Documents”), for instance. We do not yet know any workaround to make rcqp/CWB process whitespace. The recommendation is to create a directory at a path without whitespace to keep the registry and the indexed_corpora (a directory such as “C:/cwb”).
If you keep data on another volume than your system files, your R packages etc. (eg. volume ‘C:’ for system files, and ‘D:’ for data and user files), make sure to set the working directory (setwd()
) is set to any directory on the volume with the directory defined via CORPUS_REGISTRY. CWB/rcqp will assume that the CORPUS_REGISTRY directory is on the same volume as the current working directory (which can be identified by calling getwd()
).
Finally: polmineR if optimized for working with RStudio. It you work with 32bit R, you may have to check in the settings of RStudio that it will call 32bit R. To be sure, check the startup message.
If everything works, check whether polmineR can be loaded.
library(polmineR)
corpus() # to see corpora available at your system
At this stage, 64 bit support is still experimental. Apart from an installation of 64 bit R, you will need to install Rtools, available here. Rtools is a collection of tools necessary to build and compile R packages on a Windows machine.
To interface to a core C library of the Corpus Workbench (CWB), you will need an installation of a 64 bit AND a 32 bit version of the CWB.
The “official” 32 bit version of the CWB is available here. Installation instructions are available at the CWB Website. The 32 bit version should be installed in the directory “C:Files”, with admin rights.
The 64 bit version, prepared by Andreas Blaette, is available here. Install this 64 bit CWB version to “C:Files (x86)”. In the unzipped downloaded zip file, you will find a bat file that will do the installation. Take care that you run the file with administrator rights. Without these rights, no files will be copied.
The interface to the Corpus Workbench is the package polmineR.Rcpp, available at GitHub. If you use git, you can clone that repository, otherwise, you can download a zip file.
The downloaded zip file needs to be unzipped again. Then, in the directory with the ‘polmineR.Rcpp’-directory, run:
R CMD build polmineR.Rcpp
R CMD INSTALL polmineR.Rcpp_0.1.0.tar.gz
If you read closely what is going on during the compilation, you will see a few warnings that libraries are not found. If creating the package is not aborted, nothing is wrong. R CMD build will look for the 64 bit files in the directory with the 32 bit dlls first and discover that they do not work for 64 bit, only then will it move to the correct location.
One polmineR.Rcpp is installed, proceed with the instructions for installing polmineR in a 32 bit context. Future binary releases of the polmineR.Rcpp package may make things easier. Anyway, the proof of concept is there that polmineR will work on a 64 bit Windows machine too.
Finally, you need to make sure that polmineR will interface to CWB indexed corpora using polmineR.Rcpp, and not with rcqp (the default). To set the interface accordingly:
setCorpusWorkbenchInterface("Rcpp")
To test whether corpora are available:
corpus()
## corpus size template
## 1 EUROPARL-EN 39431862 FALSE
If R is installed on your system, the easiest way to get and install polmineR is to download binaries for Mac from the repository we host at the PolMine server
install.packages("polmineR", repos = "http://polmine.sowi.uni-due.de", type = "binary")
If you want to work with the most recent development version, or if you want to to compile the package yourself, the procedure is as follows.
First, you will need an installation of Xcode, which you can get it via the Mac App Store.
To compile the C library in the rcqp package, there are system requirements that need to be fulfilled. Using a package manager such as Macports or brew makes things much easier.
If you want to use Macports, get it from https://www.macports.org/. After installing Macports, it is necessary to restart the computer.
We will also need the Command Line Tool for Xcode. It can be installed from a terminal with:
xcode-select -- install
Next, an update of Macports is necessary.
sudo port -v selfupdate
Now we can install the libraries rcqp will require. Again, from the terminal.
sudo port install glib2
sudo port install pkgconfig
sudo port install pcre
Alternatively, if you prefer brew (which we do), after installing brew:
brew -v install pkg-config
brew -v install glib --universal
brew -v install pcre --universal
brew -v install readline
One the system requirements are there, the next steps can be done from R. Before installing rcqp, and then polmineR, we install a few packages manually. In the R console:
install.packages(pkgs = c("RUnit", "devtools", "plyr", "tm“))
Now rcqp can be installed, and then polmineR
install.packages("rcqp")
install.packages("polmineR")
If you like to work with the development version, that can be installed from GitHub.
devtools::install_github("PolMine/polmineR", ref = "dev")
The pcre, glib and pkg-config libraries can be installed using apt-get.
sudo apt-get install libglib2.0-dev
sudo apt-get install libssl-dev
sudo apt-get install libcurl4-openssl-dev
The system requirements will now be fulfilled, to install dependencies for rcqp/polmineR, and then the two packages from within R.
install.packages("RUnit", "devtools", "plyr", "tm")
install.packages("rcqp")
install.packages("polmineR")
Indexed corpora can be stored in two different locations. The conventional way is to keep CWB corpora in a directory with two subdirectories, a ‘registry’ directory, and an ‘indexed_corpora’ directory. The files in the registry directory (‘registry’ in short) describe the main features of a corpus, and where it is stored. It is necessary to inform rcqp, the package used by polmineR to access corpora, about the registry directory. That is done using the CORPUS_REGISTRY environment variable. It needs be defined before loading rcqp and polmineR.
The CORPUS_REGISTRY environment variable can be set manually from the R console:
Sys.setenv(CORPUS_REGISTRY "/PATH/TO/YOUR/REGISTRY/DIRECTORY")
To check whether and how the environment variable is set:
Sys.getenv("CORPUS_REGISTRY")
## [1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/europarl.en/extdata/cwb/registry"
You can set the environment variable permanently to avoid having to set it each time before you want to use polmineR. A good way is to inlude the following line in the file .Renviron in your home directory:
CORPUS_REGISTRY="/PATH/TO/YOUR/REGISTRY/DIRECTORY"
There are a few other options to have environment variables set at every time you launch polmineR. To learn about these, use the help for the R startup procedure.
?Startup
library(polmineR)
CQI <- CQI.rcqp$new()
use("polmineR.sampleCorpus")