The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
In order to get the most out of the package, we will show how to use the outcome of the annotation to improve topic modelling. The main use cases of using the outcome of the annotation is to
Let’s show this.
Let’s start by annotating some text in French. The annotated data.frame can next be used for basic text analytics.
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "fr")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback, doc_id = comments$id)
x <- as.data.frame(x)
The resulting data.frame has a field called upos
which is the Universal Parts of Speech tag and also a field called lemma
which is the root form of each token in the text. These 2 fields give us a broad range of topic modelling possibilities.
'data.frame': 30081 obs. of 14 variables:
$ doc_id : chr "47860059" "47860059" "47860059" "47860059" ...
$ paragraph_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ sentence_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ sentence : chr "Quelle excellent week end - Merci a David pour sa confiance, merci a son appart d etre aussi chouette, merci au"| __truncated__ "Quelle excellent week end - Merci a David pour sa confiance, merci a son appart d etre aussi chouette, merci au"| __truncated__ "Quelle excellent week end - Merci a David pour sa confiance, merci a son appart d etre aussi chouette, merci au"| __truncated__ "Quelle excellent week end - Merci a David pour sa confiance, merci a son appart d etre aussi chouette, merci au"| __truncated__ ...
$ token_id : chr "1" "2" "3" "4" ...
$ token : chr "Quelle" "excellent" "week" "end" ...
$ lemma : chr "quel" "excellent" "weekend" "end" ...
$ upos : chr "DET" "ADJ" "NOUN" "NOUN" ...
$ xpos : chr NA NA NA NA ...
$ feats : chr "Gender=Fem|Number=Sing" "Gender=Masc|Number=Sing" "Gender=Masc|Number=Sing" "Gender=Fem|Number=Sing" ...
$ head_token_id: chr "3" "3" "6" "3" ...
$ dep_rel : chr "det" "amod" "nsubj" "nmod" ...
$ deps : chr NA NA NA NA ...
$ misc : chr NA NA NA NA ...
You can easily go from this annotated data.frame to a document-term-matrix which is used by a lot of other text mining R packages. In this case, we will build topics at the sentence level.
The advantage of this package over other packages is that
If you want to do the same on adjectives + nouns, just change the above code where it says upos %in% c("NOUN")
in upos %in% c("NOUN", "ADJ")
and you are ready.
## Build document term matrix on nouns/adjectives only
dtf <- subset(x, upos %in% c("NOUN", "ADJ") &
!lemma %in% c("appartement", "appart", "eter", "tres"))
dtf <- document_term_frequencies(dtf, document = "topic_level_id", term = "lemma")
dtm <- document_term_matrix(x = dtf)
dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 5)
## Build topic model + get topic terminology
m <- LDA(dtm_clean, k = 4, method = "Gibbs",
control = list(nstart = 5, burnin = 2000, best = TRUE, seed = 1:5))
topicterminology <- predict(m, type = "terms", min_posterior = 0.025, min_terms = 5)
scores <- predict(m, newdata = dtm, type = "topics")
Once you have topics, visualising these can also be easily done with the igraph and ggraph packages. Below one possible plot is shown. It shows for a certain topic the co-occurrence of terms
library(igraph)
library(ggraph)
library(ggplot2)
x_topics <- merge(x, scores, by.x="topic_level_id", by.y="doc_id")
wordnetwork <- subset(x_topics, topic %in% 1 & lemma %in% topicterminology[[1]]$term)
wordnetwork <- cooccurrence(wordnetwork, group = c("topic_level_id"), term = "lemma")
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
geom_node_text(aes(label = name), col = "darkgreen", size = 4) +
theme_graph(base_family = "Arial Narrow") +
labs(title = "Words in topic 1 ", subtitle = "Nouns & Adjective cooccurrence")
Another possibility is showing correlations among the terms driving the topic for only documents of that topic.
topicterminology <- predict(m, type = "terms", min_posterior = 0.05, min_terms = 10)
termcorrs <- subset(x_topics, topic %in% 1 & lemma %in% topicterminology[[1]]$term)
termcorrs <- document_term_frequencies(termcorrs, document = "topic_level_id", term = "lemma")
termcorrs <- document_term_matrix(termcorrs)
termcorrs <- dtm_cor(termcorrs)
termcorrs[lower.tri(termcorrs)] <- NA
diag(termcorrs) <- NA
library(qgraph)
qgraph(termcorrs, layout = "spring", labels = colnames(termcorrs), directed = FALSE,
borders = FALSE, label.scale = FALSE, label.cex = 1, node.width = 0.5)
You mostly get better, more interpretable results in topic models if you include compound keywords in the model. Let’s show the steps how you can accomplish this.
keywords_rake
, keywords_phrases
, keywords_collocation
functions or with functionality from the textrank R package.txt_recode_ngram
to recode words to keywords. This will replace a sequence of words with its compound multi-word expression by first starting with words which contain more terms.In the below example, we are building a topic model on all nouns, all compound keywords which consists of nouns and adjectives and on all identified noun phrases.
## Find keywords with RAKE
keyw_rake <- keywords_rake(x,
term = "token", group = c("doc_id", "paragraph_id", "sentence_id"),
relevant = x$upos %in% c("NOUN", "ADJ"),
ngram_max = 3, n_min = 5)
## Find simple noun phrases
x$phrase_tag <- as_phrasemachine(x$upos, type = "upos")
keyw_nounphrases <- keywords_phrases(x$phrase_tag, term = x$token,
pattern = "(A|N)*N(P+D*(A|N)*N)*", is_regex = TRUE,
detailed = FALSE)
keyw_nounphrases <- subset(keyw_nounphrases, ngram > 1)
## Recode terms to keywords
x$term <- x$token
x$term <- txt_recode_ngram(x$term,
compound = keyw_rake$keyword, ngram = keyw_rake$ngram)
x$term <- txt_recode_ngram(x$term,
compound = keyw_nounphrases$keyword, ngram = keyw_nounphrases$ngram)
## Keep keyword or just plain nouns
x$term <- ifelse(x$upos %in% "NOUN", x$term,
ifelse(x$term %in% c(keyw_rake$keyword, keyw_nounphrases$keyword), x$term, NA))
## Build document/term/matrix
dtm <- document_term_frequencies(x, document = "topic_level_id", term = "term")
dtm <- document_term_matrix(x = dtm)
dtm <- dtm_remove_lowfreq(dtm, minfreq = 5)
Once we have our document/term/matrix, topic modelling is simple. Keep in mind that you need to tune your topic model, which is not done below. See the topicmodels and ldatuning R package which show you how to do that.
m <- LDA(dtm, k = 3, method = "Gibbs",
control = list(nstart = 5, burnin = 2000, best = TRUE, seed = 1:5))
You’ll see that the topic model now includes keywords
topicterminology <- predict(m, type = "terms", min_posterior = 0.10, min_terms = 3)
topicterminology
$topic_001
term prob
1 tout 0.08531097
2 passe 0.06280687
3 agreable 0.04439444
$topic_002
term prob
1 tres 0.18747444
2 pas 0.09615647
3 ete 0.04504566
$topic_003
term prob
1 appartement 0.15585207
2 situe 0.09286101
3 tres agreable 0.05560824
In the above exercise, nouns which were part of a compound multi-word-expression (mwe) were replaced by the multi-word-expression. Sometimes however, you want to keep the noun as well as the multi-word expression in the topic model even if the noun is always part of a multi-word expression. You can do this as follows.
## Recode tokens to keywords, if it is not in the list of tokens, set to NA
x$mwe <- txt_recode_ngram(x$token, compound = keyw_rake$keyword, ngram = keyw_rake$ngram)
x$mwe <- ifelse(x$mwe %in% keyw_rake$keyword, x$mwe, NA)
## nouns
x$term_noun <- ifelse(x$upos %in% "NOUN", x$token, NA)
## Build document/term/matrix
dtm <- document_term_frequencies(x, document = "topic_level_id", term = c("term_noun", "mwe"))
dtm <- document_term_matrix(x = dtm)
dtm <- dtm_remove_lowfreq(dtm, minfreq = 3)
m <- LDA(dtm, k = 3, method = "Gibbs",
control = list(nstart = 5, burnin = 2000, best = TRUE, seed = 1:5))
The textrank R package allows you to easily summarise text, it integrates well with this udpipe R package. It’s especially suited for finding a the most relevant sentences of documents of a certain (LDA) topic. More details on that package, see https://CRAN.R-project.org/package=textrank.
Need support in text mining. Contact BNOSAC: http://www.bnosac.be
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.