The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Former kgram_freqs class is now called
sbo_kgram_freqs. The constructor kgram_freqs()
is still available as an alias to
sbo_kgram_freqs().
Former sbo_preds class is now substituted by two
classes:
- `sbo_predictor`: for interactive use
- `sbo_predtable`: for storing text predictors out of memory (e.g.
`save()` to file)sbo_predictor and sbo_predtable objects
are obtained by the homonym constructors, which are now S3 generics
accepting character input, as well as
sbo_kgram_freqs and sbo_predtable (for the
sbo_predictor() constructor) class objects. In particular,
these allow to directly train a text predictor without storing the
intermediate sbo_dictionary, and kgram_freqs
objects.
The behaviour of the dict argument in
kgram_freqs() and kgram_freqs_fast() has
changed, now accepting either a sbo_dictionary, a
character or a formula (see also ‘New
features’).
The sbo_predictor implementation dramatically
improves the speed of predict() (by a factor of x10). A
single call to predict() now allocates a few kBs of RAM
(whereas it previously allocated few MBs, c.f. issue #10).
Metadata of sbo_kgram_freqs and
sbo_pred* objects is now stored via attributes
(#11).
sbo_dictionary.word_coverage with generic constructors
and a preconfigured plot() method.kgram_freqs() and
sbo_pred*() can now be built also with a fixed target
coverage fraction of training corpus.prune() generic function for reducing -gram order
of kgram_freqs and sbo_predtable’s.summary() methods for
sbo_kgram_freqs and sbo_pred* objects;
correspondingly, the output of print() has been simplified
considerably (#5).sbo_kgram_freqs,
sbo_dictionary, sbo_predictor and
sbo_predtable can be constructed either through the
homonymous constructors, or through the aliases
kgram_freqs(), dictionary(),
predictor(), predtable().sbo now has SystemRequirements: C++11,
for correct integration with C++11 code (in particular
std::unordered_map).
Model training (with sbo_predictor()) is now
considerably faster, due to optimizations in the algorithm for building
Stupid Back-Off prediction tables.
The Stupid Back-Off algorithm is now thoroughly tested, and small
inconsistencies between the predict.kgram_freqs() and
predict.sbo_predictor() methods have been fixed,
including:
- Proper handling of unknown words
- Consistent handling of ties in prediction probabilities.Model evaluation in eval_sbo_predictor() is now
carried out by sampling a single sentence from each document in test
corpus.
Removed unnecessary dependencies from Depends and
Imports package fields.
erase argument
in preprocess() and kgram_freqs_fast(), c.f.
issue #17.kgramFreqs class, as per §1.6.4 of the
“Writing R extensions” guide.kgram_freqs_fast() for fast and memory efficient
kgram tokenization using the default text preprocessing utility.kgram_freqs(),
get_word_freqs(), preprocess(), and
predict.sbo_preds() has been entirely rewritten in
C++.tokenize_sentences() function for sentence level
tokenization.kgram_freqs() now accepts any user defined single
character EOS token, through the EOS argument.preproc argument to kgram_freqs()
and get_word_freqs(), for custom training corpus
preprocessing.dict argument of kgram_freqs() now
also accepts numeric values, allowing to build a dictionary directly
from the training corpus.predict method for sbo_kgram_freqs
class.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.