The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Install the R package.
install.packages("udpipe")
Get your language model and start annotating.
library(udpipe)
udmodel <- udpipe_download_model(language = "dutch")
udmodel <- udpipe_load_model(file = udmodel$file_model)
x <- udpipe_annotate(udmodel,
x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.")
x <- as.data.frame(x, detailed = TRUE)
Or just do as follows.
library(udpipe)
x <- udpipe(x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.",
object = "dutch")
The annotation returns paragraphs, sentences, tokens, the location of the token in the original text, morphology elements like the lemma, the universal part of speech tag and the treebank-specific parts of speech tag, morphosyntactic features and returns as well the dependency relationship. More information at https://universaldependencies.org/guidelines.html
str(x)
'data.frame': 18 obs. of 17 variables:
$ doc_id : chr "doc1" "doc1" "doc1" "doc1" ...
$ paragraph_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ sentence_id : int 1 1 1 1 1 1 1 1 1 2 ...
$ sentence : chr "Ik ging op reis en ik nam mee:" "Ik ging op reis en ik nam mee:" "Ik ging op reis en ik nam mee:" "Ik ging op reis en ik nam mee:" ...
$ start : int 1 4 9 12 17 20 23 27 30 32 ...
$ end : int 2 7 10 15 18 21 25 29 30 35 ...
$ term_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ token_id : chr "1" "2" "3" "4" ...
$ token : chr "Ik" "ging" "op" "reis" ...
$ lemma : chr "ik" "gaan" "op" "reis" ...
$ upos : chr "PRON" "VERB" "ADP" "NOUN" ...
$ xpos : chr "VNW|pers|pron|nomin|vol|1|ev" "WW|pv|verl|ev" "VZ|init" "N|soort|ev|basis|zijd|stan" ...
$ feats : chr "Case=Nom|Person=1|PronType=Prs" "Number=Sing|Tense=Past|VerbForm=Fin" NA "Gender=Com|Number=Sing" ...
$ head_token_id: chr "2" "0" "4" "2" ...
$ dep_rel : chr "nsubj" "root" "case" "obl" ...
$ deps : chr NA NA NA NA ...
$ misc : chr NA NA NA NA ...
Mark that it is important that the x
argument to udpipe_annotate
is in UTF-8 encoding. You can check the encoding of your text with Encoding('your text')
. You can convert your text to UTF-8, using standard R utilities: as in iconv('your text', from = 'latin1', to = 'UTF-8')
where you replace the from
part with whichever encoding you have your text in, possible your computers default as defined in localeToCharset()
. So annotation would look something like this if your text is not already in UTF-8 encoding:
udpipe_annotate(udmodel, x = iconv('your text', to = 'UTF-8'))
if your text is in the encoding of the current locale of your computer.udpipe_annotate(udmodel, x = iconv('your text', from = 'latin1', to = 'UTF-8'))
if your text is in latin1 encoding.udpipe_annotate(udmodel, x = iconv('your text', from = 'CP949', to = 'UTF-8'))
if your text is in CP949 encoding.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.