The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

OpenCPU

Andreas Blaette (andreas.blaette@uni-due.de)

2023-10-29

Objective

Sometimes, it is practically or legally not possible to move corpus data to a local machine. This vignette explains the usage of CWB corpora that are hosted on an OpenCPU server.

library(polmineR)
## polmineR is throttled to use 2 cores as required by CRAN Repository Policy. To get full performance:
## * Use `n_cores <- parallel::detectCores()` to detect the number of cores available on your machine
## * Set number of cores using `options('polmineR.cores' = n_cores - 1)` and `data.table::setDTthreads(n_cores - 1)`

Remote Corpora

Publicly Available Corpora

The GermaParl corpus is hosted on an OpenCPU server with the IP 132.252.238.66 (subject to change). To use the corpus, use the corpus()-method. The only difference is that you will need to supply the IP address using the argument server.

gparl <- corpus("GERMAPARL", server = "http://opencpu.politik.uni-due.de")

The gparl object is an object of class remote_corpus.

is(gparl)

Using polmineR core functionality

The polmineR at this stage exposes a limited set of its functionality for remote corpora. Simple investigations in the remote corpus are possible.

Get corpus size

size(gparl)

Get structural annotation (metadata)

s_attributes(gparl)

Subsetting

gparl2006 <- subset(gparl, year == "2006")

The returned object has the class remote_subcorpus.

is(gparl2006)

Simple count

count(gparl, query = "Integration")

The count()-method works for remote_subcorpus objects, too.

count(gparl2006, query = "Integration")

KWIC

kwic(gparl, query = "Islam", left = 15, right = 15, meta = c("speaker", "party", "date"))

Works for the remote_subcorpus, too.

kwic(gparl2006, query = "Islam", left = 15, right = 15, meta = c("speaker", "party", "date"))

Restricted Corpora

  1. Create directory for registry file-style files with credentials

  2. Create file with credentials for your corpus in this directory

Note: Filename is corpus id in lowercase

##
## registry entry for corpus GERMAPARLSAMPLE
##

# long descriptive name for the corpus
NAME "GermaParlSample"
# corpus ID (must be lowercase in registry!)
ID   germaparlsample
# path to binary data files
HOME http://localhost:8005
# optional info file (displayed by ",info;" command in CQP)
INFO https://zenodo.org/record/3823245#.XsrU-8ZCT_Q 

# corpus properties provide additional information about the corpus:
##:: user = "YOUR_USER_NAME"
##:: password = "YOUR_PASSWORD"
  1. Set environment variable “OPENCPU_REGISTRY” in .Renviron to dir just mentioned

  2. Get server whereabouts

x <- corpus("MIGPRESS_FAZ", server = "YOURSERVER", restricted = TRUE)

Next steps

Upcoming versions of polmineR will expose further functionality. This is a simple proof-of-concept!

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.