The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This is a package for Chinese text segmentation, keyword extraction and speech tagging.
You can use worker()
to initialize a worker, and then use []
or segment()
to do the segmentation.
## Loading required package: jiebaRD
## Using default settings to initialize a worker.
cutter = worker()
### Note: Can not display Chinese characters here.
segment( "This is a good day!" , cutter )
## [1] "This" "is" "a" "good" "day"
You can use file path as input.
## [1] "temp" "dat"
You can initialize multiple engines simultaneously.
cutter2 = worker(type = "mix",
dict = "some_path/jieba.dict.utf8",
hmm = "some_path/hmm_model.utf8",
user = "some_path/test.dict.utf8",
detect=T, symbol = F,
lines = 1e+05, output = NULL
)
cutter2 ### Print information of worker
Worker Type: Mix Segment
Detect Encoding : TRUE
Default Encoding: UTF-8
Keep Symbols : FALSE
Output Path :
Write File : TRUE
Max Read Lines : 1e+05
Fixed Model Components:
$dict
[1] "dict/jieba.dict.utf8"
$hmm
[1] "dict/hmm_model.utf8"
$user
[1] "dict/test.dict.utf8"
$detect $encoding $symbol $output $write $lines can be reset.
The public settings of the model can be modified by $
cutter$symbol = T
. Private settings are fixed when the engine is initialized, and you can get them by cutter$PrivateVarible
.
## [1] "UTF-8"
## [1] TRUE
## [1] FALSE
You can use custom dictionar. jiebaR is able to identify new words, but adding your own new words can ensure a higher accuracy. imewlconverter is a good tools for dictionary construction.
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/jiebaRD/dict"
Speech Tagging function [.tagger
and tagging
tag each word in a sentence after segmentation, using labels compatible with ictclas.
## eng eng
## "hello" "world"
Keyword Extraction worker use MixSegment model to cut word and use TF-IDF algorithm to find the keywords.
## 11.7392
## "fun"
Simhash worker can do keyword extraction and find the keywords from two inputs, and then computes Hamming distance between them.
## $simhash
## [1] "3804341492420753273"
##
## $keyword
## 11.7392
## "hello"
## $distance
## [1] 0
##
## $lhs
## 11.7392
## "hello"
##
## $rhs
## 11.7392
## "hello"
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.