text2vec is a package for which the main goal is to provide an efficient framework with concise API for text analysis and natural language processing (NLP) in R.
At the moment we cover two following topics:
Historically, most of the text-mining and NLP modelling was related to Bag-of-words or Bag-of-ngrams models. Despite of simplicity, these models usually demonstrates good performance on text categorization/classification tasks. But, in contrast to theoretical simplicity and practical efficiency, building bag-of-words models involves technical challenges. Especially within R
framework, because of its typical copy-on-modify semantics.
Lets briefly review some details of typical analysis pipeline:
Here we will discuss mostly first stage. Underlying texts can take a lot of space, but vectorized ones usually not, because they are stored in form of sparse matrices. In R it is not very easy (from reason above - copy-on-modify semantics) to iteratively grow DTM. So construction of such objects, even for small collections of documents, can become serious hedache for analysts and researchers. It involves reading the whole collection of text documents into RAM and process it as single vector, which easily increase memory consumption by factor of 2 to 4 (to tell the truth, this is quite optimistically). Fortunately, there is a better, text2vec way. Lets check how it works on simple example.
text2vec provides movie_review
dataset. It consists of 5000 movie review, each of which marked ad positive or negative.
library(text2vec)
data("movie_review")
set.seed(42L)
To represent documents in vector space, first of all we have to create term -> term_id
mappings. We use termin term instead of word, because actually it can be arbitrary ngram, not just single word. Having set of documents we want represent them as sparse matrix, where each row should corresponds to document and each column should corresponds to term. This can be done in 2 ways: using vocabulary, or by feature hashing (hashing trick).
Lets examine the first choice. He we collect unique terms from all documents and mark them with unique_id. vocabulary()
function designed specially for this purpose.
it <- itoken(movie_review[['review']], preprocess_function = tolower,
tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)
# using unigrams here
t1 <- Sys.time()
vocab <- vocabulary(src = it, ngram = c(1L, 1L))
print( difftime( Sys.time(), t1, units = 'sec'))
## Time difference of 0.8582668 secs
Now we can costruct Document-Term Matrix (DTM). Again, since all functions related to corpus construction have streaming API, we have to create iterator and provide it to create_vocab_corpus
function:
it <- itoken(movie_review[['review']], preprocess_function = tolower,
tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)
corpus <- create_vocab_corpus(it, vocabulary = vocab)
dtm <- get_dtm(corpus)
We got DTM matrix. Lets check its dimension:
dim(dtm)
## [1] 5000 42652
As you can see, it has 5000 rows (equal to number of documents) and 42652 columns (equal to number of unique terms). Now we are ready to fit our first model. Here we will use glmnet
package to fit logistic regression with L1 penalty.
library(glmnet)
t1 <- Sys.time()
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']],
family = 'binomial',
# lasso penalty
alpha = 1,
# interested area unded ROC curve
type.measure = "auc",
# 5-fold cross-validation
nfolds = 5,
# high value, less accurate, but faster training
thresh = 1e-3,
# again lower number iterations for faster training
# in this vignette
maxit = 1e3)
print( difftime( Sys.time(), t1, units = 'sec'))
## Time difference of 4.386506 secs
plot(fit)
print (paste("max AUC = ", round(max(fit$cvm), 4)))
## [1] "max AUC = 0.9199"
Note, that training time is quite high. We can reduce it and also significantly improve accuracy.
We will prune our vocabulary. For example we can find words “a”, “the”, “in” in almost all documents, but actually they don’t give any useful information. Usually they called stop words. But in contrast to them, corpus also contains very uncommon terms, which contained only in few documents. These terms also useless, because we don’t have sufficient statistics for them. Here we will filter them out:
# remove very common and uncommon words
pruned_vocab <- prune_vocabulary(vocab, term_count_min = 10,
doc_proportion_max = 0.5, doc_proportion_min = 0.001)
it <- itoken(movie_review[['review']], preprocess_function = tolower,
tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)
corpus <- create_vocab_corpus(it, vocabulary = pruned_vocab)
dtm <- get_dtm(corpus)
Also we can (and usually should!) apply TF-IDF transofrmation, which will increase weight for document-specific terms and decrease weight for widely used terms:
dtm <- dtm %>% tfidf_transformer
## idf scaling matrix not provided, calculating it form input matrix
dim(dtm)
## [1] 5000 7663
Now, lets fit out model again:
t1 <- Sys.time()
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']],
family = 'binomial',
# lasso penalty
alpha = 1,
# interested area unded ROC curve
type.measure = "auc",
# 5-fold cross-validation
nfolds = 5,
# high value, less accurate, but faster training
thresh = 1e-3,
# again lower number iterations for faster training
# in this vignette
maxit = 1e3)
print( difftime( Sys.time(), t1, units = 'sec'))
## Time difference of 2.457911 secs
plot(fit)
print (paste("max AUC = ", round(max(fit$cvm), 4)))
## [1] "max AUC = 0.9201"
As you can seem we obtain faster training, and larger AUC.
Also we can try to use ngrams instead of words. We will use up to 3-ngrams:
it <- itoken(movie_review[['review']], preprocess_function = tolower,
tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)
t1 <- Sys.time()
vocab <- vocabulary(src = it, ngram = c(1L, 3L))
print( difftime( Sys.time(), t1, units = 'sec'))
## Time difference of 4.119684 secs
vocab <- vocab %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001)
it <- itoken(movie_review[['review']], preprocess_function = tolower,
tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)
corpus <- create_vocab_corpus(it, vocabulary = vocab)
print( difftime( Sys.time(), t1, units = 'sec'))
## Time difference of 6.042079 secs
dtm <- corpus %>%
get_dtm %>%
tfidf_transformer
## idf scaling matrix not provided, calculating it form input matrix
dim(dtm)
## [1] 5000 27226
t1 <- Sys.time()
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']],
family = 'binomial',
# lasso penalty
alpha = 1,
# interested area unded ROC curve
type.measure = "auc",
# 5-fold cross-validation
nfolds = 5,
# high value, less accurate, but faster training
thresh = 1e-3,
# again lower number iterations for faster training
# in this vignette
maxit = 1e3)
print( difftime( Sys.time(), t1, units = 'sec'))
## Time difference of 4.567166 secs
plot(fit)
print (paste("max AUC = ", round(max(fit$cvm), 4)))
## [1] "max AUC = 0.9199"
So improved our model a little bit more. I’m leaving further tuning for the reader.
If you didn’t hear anything about Feature hashing (or hashing trick), I recommend to start with wikipedia article and after that review original paper by Yahoo! research team. This techique is very fast - we don’t perform look up over associative array. But another benefit is very low memory footprint - we can map arbitrary number of features into much more compact space. This method was popularized by Yahoo and widely used in Vowpal Wabbit.
Here I will demonstrate, how to use feature hashing in text2vec:
t1 <- Sys.time()
it <- itoken(movie_review[['review']], preprocess_function = tolower,
tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)
fh <- feature_hasher(hash_size = 2**16, ngram = c(1L, 3L))
corpus <- create_hash_corpus(it, feature_hasher = fh)
print( difftime( Sys.time(), t1, units = 'sec'))
## Time difference of 2.190547 secs
dtm <- corpus %>%
get_dtm %>%
tfidf_transformer
## idf scaling matrix not provided, calculating it form input matrix
dim(dtm)
## [1] 5000 65536
t1 <- Sys.time()
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']],
family = 'binomial',
# lasso penalty
alpha = 1,
# interested area unded ROC curve
type.measure = "auc",
# 5-fold cross-validation
nfolds = 5,
# high value, less accurate, but faster training
thresh = 1e-3,
# again lower number iterations for faster training
# in this vignette
maxit = 1e3)
print( difftime( Sys.time(), t1, units = 'sec'))
## Time difference of 8.823554 secs
plot(fit)
print (paste("max AUC = ", round(max(fit$cvm), 4)))
## [1] "max AUC = 0.9027"
As you can see, we got a little bit worse AUC, but DTM construction time was considerably lower. On large collections of documents this can become a serious argument.