The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
text2vec 0.6.5 (2023-10-16)
- fix test discovered with
Matrix==1.6-2
release
text2vec 0.6.4 (2023-02-15)
- update dependency
Matrix>=1.5-2
, fixes #338
text2vec 0.6.2 (2022-09-11)
- removed test which is not needed with Matrix package v 1.5
text2vec 0.6
- 2019-12-17
- breaking change - removed construction of a vocabulary in parallel on windows
- use
rsparse
package for SVD and GloVe factorizations
- updated RWMD implementation (hopefully bug free)
- 2018-09-10
- breaking change - changed IDF formula - see #280 for details.
- 2018-05-28
- Added
postag_lemma_tokenizer()
(wrapper around udpipe::udpipe_annotate
). Can be used as a drop-in replacement for more simple tokenizers in text2vec.
- 2018-05-25
- Made
combine_vocabularies()
part of public API - see #260 for details.
- 2018-05-10
- Added
coherence()
function for comprehensive coherence metrics. Thanks to Manuel Bickel ( @manuelbickel ) for conrtibution.
- 2018-05-02
- Fixed bug LSA model - document embeddings calculated as left singular vectors multiplied by singular values (not square root of values as before). Thanks to Sloane Simmons ( @singularperturbation )
- Now
fit_transform
and transform
methods in LDA model produce same results. Thanks to @jiunsiew for reporting. Also now LDA has n_iter_inference
parameter. It controls number of the samples from converged distribution for document-topic inference. This leads to more robust document-topic probabilities (reduced variance). Default value is 10.
- 2018-01-17
- more numerically robust PMI, LFMD - thanks to @andland. Also adds iteration number
iter
to collocation_stat
. iter
shows iteration number when collocation stats (and counters) were calculated.
text2vec 0.5.1 [2018-01-10]
- 2018-01-10
- removed rank* columns from
collocation_stat
- were never used internally. Users can easily calculate ranks themselves
- 2018-01-09
- Added Bi-Normal Separation transformation, thanks to Pavel Shashkin ( @pshashk )
- Added Dunning’s log-likelihood ratio for collocations, thanks to Chris Lee ( @Chrisss93 )
- Early stopping for collocations learning
- 2017-12-18
- fixed several bugs #219 #217 #205
- decreased number of dependencies - no more
magrittr
, uuid
, tokenizers
- removed distributed LDA which didn’t work correctly
- 2017-10-18
- Now tokenization is based on tokenizers and THE stringi packages.
- models API follow mlapi package. No API changes on
text2vec
side - we just put abstract scikit-learn
-like classes to a separate package in order to make them more reusable.
text2vec 0.5.0
- 2017-06-12
- Add additional filters to
prune_vocabulary
- filter by document counts
- Clean up LSA, fixed transform method. Added option to use randomized SVD algorithm from
irlba
.
- 2017-05-17
- 2017-05-17
- API breaking change - vocabulary format change - now plain
data.frame
with meta-information in attributes (stopwords, ngram, number of docs, etc).
- 2017-03-25
- No more rely on RcppModules
- API breaking change - removed
lda_c
from formats in DTM construction
- added
ifiles_parallel
, itoken_parallel
high-level functions for parallel computing
- API breaking change
chunks_numer
parameter renamed to n_chunks
- 2017-01-02
- API breaking change - removed
create_corpus
from public API, moved co-occurence related optons to create_tcm
from vecorizers
- add ability to add custom weights for co-occurence statistics calculations
- 2016-12-30
- Noticeable speedup (1.5x) and even more noticeable improvement on memory usage (2x less!) for
create_dtm
, create_tcm
. Now package relies on sparsepp library for underlying hash maps.
- 2016-10-30
- Collocations - detection of multi-word phrases using differend heuristics - PMI, gensim, LFMD.
- 2016-10-20
- Fixed bug in
as.lda_c()
function
text2vec 0.4.0
2016-10-03. See 0.4 milestone tags.
- Now under GPL (>= 2) Licence
- “immutable” iterators - no need to reinitialize them
- unified models interface
- New models: LSA, LDA, GloVe with L1 regularization
- Fast similarity and distances calculation: Cosine, Jaccard, Relaxed Word Mover’s Distance, Euclidean
- Better hadnling UTF-8 strings, thanks to @qinwf
- iterators and models rely on
R6
package
text2vec 0.3.0
- 2016-01-13 fix for #46, thanks to @buhrmann for reporting
- 2016-01-16 format of vocabulary changed.
- do not keep
doc_proportions
. see #52.
- add
stop_words
argument to prune_vocabulary
. signature also was changed.
- 2016-01-17 fix for #51. if iterator over tokens returns list with names, these names will be:
- stored as
attr(corpus, 'ids')
- rownames in dtm
- names for dtm list in
lda_c
format
- 2016-02-02 high level function for corpus and vocabulary construction.
- construction of vocabulary from list of
itoken
.
- construction of dtm from list of
itoken
.
- 2016-02-10 rename transformers
- now all transformers starts with
transform_*
- more intuitive + simpler usage with autocompletion
- 2016-03-29 (accumulated since 2016-02-10)
- rename
vocabulary
to create_vocabulary
.
- new functions
create_dtm
, create_tcm
.
- All core functions are able to benefit from multicore machines (user have to register parallel backend themselves)
- Fix for progress bars. Now they are able to reach 100% and ticks increased after computation.
ids
argument to itoken
. Simplifies assignement of ids to rows of DTM
create_vocabulary
now can handle stopwords
- see all updates here
- 2016-03-30 more robust
split_into()
util.
text2vec 0.2.0 (2016-01-10)
First CRAN release of text2vec.
- Fast text vectorization with stable streaming API on arbitrary n-grams.
- Functions for vocabulary extraction and management
- Hash vectorizer (based on digest murmurhash3)
- Vocabulary vectorizer
- GloVe algorithm word embeddings.
- Fast term-co-occurence matrix factorization via parallel async AdaGrad.
- All core functions written in C++.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.