Adding an annotation of sentences as a structural attribute to an existing is a frequent scenario. This vignette offers a basic recipe for a corpus that already includes a part-of-speech annotation. The GermaParlSamplee corpus serves as an example.
In addition to the cwbtools package, we use the functionality of the RcppCWB package to decode the p-attribute ‘pos’. The same could be achieved using the higher-level get_token_stream()
method of the polmineR package, but we want to avoid creating an additional dependency of this package.
library(cwbtools)
library(RcppCWB)
For the purposes of this vigneette, ee will work with a temporary corpy of the corpus we wish to augment, so we create a temporary corpus directory.
<- fs::path(tempdir(), "registry_dir_tmp")
registry_dir_tmp <- fs::path(tempdir(), "corpus_dir_tmp")
corpus_dir_tmp
dir.create(path = registry_dir_tmp)
dir.create(path = corpus_dir_tmp)
<- Sys.getenv("CORPUS_REGISTRY")
regdir_envvar Sys.setenv(CORPUS_REGISTRY = registry_dir_tmp)
The GermaParlSample corpus can be downloaded from Zenodo as follows.
<- cwbtools::corpus_install(
is_corpus_available doi = "10.5281/zenodo.3823245",
registry_dir = registry_dir_tmp, corpus_dir = corpus_dir_tmp,
verbose = FALSE
)
list.files(file.path(corpus_dir_tmp, "germaparlsample"))
## [1] "date.avs" "date.avx" "date.rng" "interjection.avs"
## [5] "interjection.avx" "interjection.rng" "party.avs" "party.avx"
## [9] "party.rng" "pos.corpus.cnt" "pos.crc" "pos.crx"
## [13] "pos.hcd" "pos.huf" "pos.huf.syn" "pos.lexicon"
## [17] "pos.lexicon.idx" "pos.lexicon.srt" "speaker.avs" "speaker.avx"
## [21] "speaker.rng" "template.json" "word.corpus.cnt" "word.crc"
## [25] "word.crx" "word.hcd" "word.huf" "word.huf.syn"
## [29] "word.lexicon" "word.lexicon.idx" "word.lexicon.srt"
We generate the data for the sentence annotation from the part-of-speech annotation that is already present.
At first, we decode the p-attribute “pos”.
<- cl_attribute_size(
germaparl_size corpus = "GERMAPARLSAMPLE",
attribute = "word", attribute_type = "p"
)<- seq.int(from = 0L, to = germaparl_size - 1L)
cpos_vec <- cl_cpos2id(corpus = "GERMAPARLSAMPLE", p_attribute = "pos", cpos = cpos_vec)
ids <- cl_id2str(corpus = "GERMAPARLSAMPLE", p_attribute = "pos", id = ids) pos
The pos-tag for Stuttgart Tuebingen Tag Set (STTS) is a “$.”. From this information, we can generate a region matrix with start and end corpus positions of sentences easily.
<- grep("\\$\\.", pos)
sentence_end <- cut(x = cpos_vec, breaks = c(0L, sentence_end), include.lowest = TRUE, right = FALSE)
sentence_factor <- unname(split(x = cpos_vec, f = sentence_factor))
sentences_cpos <- do.call(rbind, lapply(sentences_cpos, function(cpos) c(cpos[1L], cpos[length(cpos)]))) region_matrix
So let us see what this looks like …
head(region_matrix)
## [,1] [,2]
## [1,] 0 9
## [2,] 10 25
## [3,] 26 35
## [4,] 36 52
## [5,] 53 64
## [6,] 65 86
And this is how the new annotation layer is written back to the corpus.
s_attribute_encode(
values = as.character(seq.int(from = 0L, to = nrow(region_matrix) - 1L)),
data_dir = registry_file_parse(corpus = "GERMAPARLSAMPLE")[["home"]],
s_attribute = "s",
corpus = "GERMAPARLSAMPLE",
region_matrix = region_matrix,
method = "R",
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
encoding = registry_file_parse(corpus = "GERMAPARLSAMPLE")[["properties"]][["charset"]],
delete = TRUE,
verbose = TRUE
)
## ... adding s-attribute 's' to registry
## Corpus to delete (ID): GERMAPARLSAMPLE
## Corpus name: GermaParlSample: Sample subset of the GermaParl corpus of plenary protocols of the German Bundestag
## Number of loads before reset: 5
## Number of loads resetted: 1
To see whether everything has worked, we get the left and right boundaries of the sentence with corpus position 60.
<- cl_cpos2lbound("GERMAPARLSAMPLE", cpos = 60, s_attribute = "s")
left <- cl_cpos2rbound("GERMAPARLSAMPLE", cpos = 60, s_attribute = "s")
right <- cl_cpos2id("GERMAPARLSAMPLE", cpos = left:right, p_attribute = "word")
ids cl_id2str("GERMAPARLSAMPLE", p_attribute = "word", id = ids)
## [1] "Ich" "bin" "am" "Sonntag" "," "dem"
## [7] "1." "Dezember" "1935" "," "geboren" "."
As a matter of housekeeping, we remove temporary directories and restore value of environment variable CORPUS_REGISTRY.
unlink(corpus_dir_tmp)
unlink(registry_dir_tmp)
Sys.setenv(CORPUS_REGISTRY = regdir_envvar)
Using the part-of-speech annotation is a basic approach to obtain the data for annotation sentences. An alternative would be to use the NLP annotation machinery of an integrated tool such as Stanford CoreNLP, or OpenNLP. But that’s a different story to be told.