The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Former kgram_freqs
class is now called
sbo_kgram_freqs
. The constructor kgram_freqs()
is still available as an alias to
sbo_kgram_freqs()
.
Former sbo_preds
class is now substituted by two
classes:
- `sbo_predictor`: for interactive use
- `sbo_predtable`: for storing text predictors out of memory (e.g.
`save()` to file)
sbo_predictor
and sbo_predtable
objects
are obtained by the homonym constructors, which are now S3 generics
accepting character
input, as well as
sbo_kgram_freqs
and sbo_predtable
(for the
sbo_predictor()
constructor) class objects. In particular,
these allow to directly train a text predictor without storing the
intermediate sbo_dictionary
, and kgram_freqs
objects.
The behaviour of the dict
argument in
kgram_freqs()
and kgram_freqs_fast()
has
changed, now accepting either a sbo_dictionary
, a
character
or a formula
(see also ‘New
features’).
The sbo_predictor
implementation dramatically
improves the speed of predict()
(by a factor of x10). A
single call to predict()
now allocates a few kBs of RAM
(whereas it previously allocated few MBs, c.f. issue #10).
Metadata of sbo_kgram_freqs
and
sbo_pred*
objects is now stored via attributes
(#11).
sbo_dictionary
.word_coverage
with generic constructors
and a preconfigured plot()
method.kgram_freqs()
and
sbo_pred*()
can now be built also with a fixed target
coverage fraction of training corpus.prune()
generic function for reducing -gram order
of kgram_freqs
and sbo_predtable
’s.summary()
methods for
sbo_kgram_freqs
and sbo_pred*
objects;
correspondingly, the output of print()
has been simplified
considerably (#5).sbo_kgram_freqs
,
sbo_dictionary
, sbo_predictor
and
sbo_predtable
can be constructed either through the
homonymous constructors, or through the aliases
kgram_freqs()
, dictionary()
,
predictor()
, predtable()
.sbo
now has SystemRequirements: C++11
,
for correct integration with C++11 code (in particular
std::unordered_map
).
Model training (with sbo_predictor()
) is now
considerably faster, due to optimizations in the algorithm for building
Stupid Back-Off prediction tables.
The Stupid Back-Off algorithm is now thoroughly tested, and small
inconsistencies between the predict.kgram_freqs()
and
predict.sbo_predictor()
methods have been fixed,
including:
- Proper handling of unknown words
- Consistent handling of ties in prediction probabilities.
Model evaluation in eval_sbo_predictor()
is now
carried out by sampling a single sentence from each document in test
corpus.
Removed unnecessary dependencies from Depends
and
Imports
package fields.
erase
argument
in preprocess()
and kgram_freqs_fast()
, c.f.
issue #17.kgramFreqs
class, as per §1.6.4 of the
“Writing R extensions” guide.kgram_freqs_fast()
for fast and memory efficient
kgram tokenization using the default text preprocessing utility.kgram_freqs()
,
get_word_freqs()
, preprocess()
, and
predict.sbo_preds()
has been entirely rewritten in
C++.tokenize_sentences()
function for sentence level
tokenization.kgram_freqs()
now accepts any user defined single
character EOS token, through the EOS
argument.preproc
argument to kgram_freqs()
and get_word_freqs()
, for custom training corpus
preprocessing.dict
argument of kgram_freqs()
now
also accepts numeric values, allowing to build a dictionary directly
from the training corpus.predict
method for sbo_kgram_freqs
class.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.