Quick Start Guide

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Quick Start Guide - jiebaR

This is a package for Chinese text segmentation, keyword extraction and speech tagging.

Example

Text Segmentation

You can use worker() to initialize a worker, and then use [] or segment() to do the segmentation.

library(jiebaR)

## Loading required package: jiebaRD

##  Using default settings to initialize a worker.
cutter = worker()

###  Note: Can not display Chinese characters here.

segment( "This is a good day!" , cutter )

## [1] "This" "is"   "a"    "good" "day"

## OR cutter["This is a good day!"]

You can use file path as input.

segment( "./temp.dat" , cutter ) ### Auto encoding detection.

## [1] "temp" "dat"

You can initialize multiple engines simultaneously.

cutter2 = worker(type  = "mix",
                 dict = "some_path/jieba.dict.utf8",
                 hmm   = "some_path/hmm_model.utf8",
                 user  = "some_path/test.dict.utf8",
                 detect=T,      symbol = F,
                 lines = 1e+05, output = NULL
                 )
cutter2   ### Print information of worker

Worker Type:  Mix Segment

Detect Encoding :  TRUE
Default Encoding:  UTF-8
Keep Symbols    :  FALSE
Output Path     :
Write File      :  TRUE
Max Read Lines  :  1e+05

Fixed Model Components:

$dict
[1] "dict/jieba.dict.utf8"

$hmm
[1] "dict/hmm_model.utf8"

$user
[1] "dict/test.dict.utf8"

$detect $encoding $symbol $output $write $lines can be reset.

The public settings of the model can be modified by $ cutter$symbol = T. Private settings are fixed when the engine is initialized, and you can get them by cutter$PrivateVarible.

cutter$encoding

## [1] "UTF-8"

cutter$detect

## [1] TRUE

cutter$detect = F
cutter$detect

## [1] FALSE

You can use custom dictionar. jiebaR is able to identify new words, but adding your own new words can ensure a higher accuracy. imewlconverter is a good tools for dictionary construction.

show_dictpath() ### Show path

## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/jiebaRD/dict"

?edit_dict()   ### For more information

Speech Tagging

Speech Tagging function [.tagger and tagging tag each word in a sentence after segmentation, using labels compatible with ictclas.

words = "hello world"
tagger = worker("tag")
tagger[words]

##     eng     eng 
## "hello" "world"

Keyword Extraction

Keyword Extraction worker use MixSegment model to cut word and use TF-IDF algorithm to find the keywords.

keys = worker("keywords", topn = 1)
keys <= "words of fun"

## 11.7392 
##   "fun"

Simhash Distance

Simhash worker can do keyword extraction and find the keywords from two inputs, and then computes Hamming distance between them.

 words = "hello world"
 simhasher = worker("simhash",topn=1)
 simhasher[words]

## $simhash
## [1] "3804341492420753273"
## 
## $keyword
## 11.7392 
## "hello"

distance("hello world" , "hello world!" , simhasher)

## $distance
## [1] 0
## 
## $lhs
## 11.7392 
## "hello" 
## 
## $rhs
## 11.7392 
## "hello"

More Docs

See https://jiebaR.qinwf.com/

More Information and Issues

https://github.com/qinwf/jiebaR

https://github.com/yanyiwu/cppjieba

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.