How to use CountVectorizer in R ?

Manish Saraswat

2020-02-19

In this tutorial, we’ll look at how to create bag of words model (token occurence count matrix) in R in two simple steps with superml. Superml borrows speed gains using parallel computation and optimised functions from data.table R package. Bag of words model is often use to analyse text pattern using word occurences in a given text.

Install

You can install latest cran version using (recommended):

install.packages("superml")

You can install the developmemt version directly from github using:

devtools::install_github("saraswatmks/superml")

Sample Data

First, we’ll create a sample data. Feel free to run it alongside in your laptop and check the computation.

library(superml)
#> Loading required package: R6

# should be a vector of texts
sents <-  c('i am going home and home',
          'where are you going.? //// ',
          'how does it work',
          'transform your work and go work again',
          'home is where you go from to work')

# generate more sentences
n <- 10
sents <- rep(sents, n) 
length(sents)
#> [1] 50

For sample, we’ve generated 50 documents. Let’s create the features now. For ease, superml uses the similar API layout as python scikit-learn.

# initialise the class, set parallel to TRUE for fast computation
cfv <- CountVectorizer$new(max_features = 10, remove_stopwords = FALSE, parallel = FALSE)

# generate the matrix
cf_mat <- cfv$fit_transform(sents)

head(cf_mat, 3)
#>      work home going and where you go i am are
#> [1,]    0    2     1   1     0   0  0 1  1   0
#> [2,]    0    0     1   0     1   1  0 0  0   1
#> [3,]    1    0     0   0     0   0  0 0  0   0

Few observations:

Now, let’s generate the matrix using its ngram_range features.

# initialise the class, set parallel to TRUE for fast computation
cfv <- CountVectorizer$new(max_features = 10, remove_stopwords = FALSE, ngram_range = c(1, 3), parallel = FALSE)

# generate the matrix
cf_mat <- cfv$fit_transform(sents)

head(cf_mat, 3)
#>      home and where you work go i i am i am going am
#> [1,]    2   1     0   0    0  0 1    1          1  1
#> [2,]    0   0     1   1    0  0 0    0          0  0
#> [3,]    0   0     0   0    1  0 0    0          0  0

Few observations:

Usage for a Machine Learning Model

In order to use Count Vectorizer as an input for a machine learning model, sometimes it gets confusing as to which method fit_transform, fit, transform should be used to generate features for the given data. Here’s a way to do:


library(data.table)
library(superml)

# use sents from above
sents <-  c('i am going home and home',
          'where are you going.? //// ',
          'how does it work',
          'transform your work and go work again',
          'home is where you go from to work',
          'how does it work')

# create dummy data
train <- data.table(text = sents, target = rep(c(0,1), 3))
test <- data.table(text = sample(sents), target = rep(c(0,1), 3))

Let’s see how the data looks like:

head(train, 3)
#>                           text target
#> 1:    i am going home and home      0
#> 2: where are you going.? ////       1
#> 3:            how does it work      0
head(test, 3)
#>                                     text target
#> 1: transform your work and go work again      0
#> 2:           where are you going.? ////       1
#> 3:                      how does it work      0

Now, we generate features for train-test data:

# initialise the class, set parallel to TRUE for fast computation
cfv <- CountVectorizer$new(max_features = 12, remove_stopwords = FALSE, ngram_range = c(1,3), parallel = FALSE)

# we fit on train data
cfv$fit(train$text)

train_cf_features <- cfv$transform(train$text)
test_cf_features <- cfv$transform(test$text)

dim(train_cf_features)
#> [1]  6 12
dim(test_cf_features)
#> [1]  6 12

We generate 12 features for each of the given data. Let’s see how they look:

head(train_cf_features, 3)
#>      home and where you how how does how does it does does it does it work it
#> [1,]    2   1     0   0   0        0           0    0       0            0  0
#> [2,]    0   0     1   1   0        0           0    0       0            0  0
#> [3,]    0   0     0   0   1        1           1    1       1            1  1
#>      it work
#> [1,]       0
#> [2,]       0
#> [3,]       1
head(test_cf_features, 3)
#>      home and where you how how does how does it does does it does it work it
#> [1,]    0   1     0   0   0        0           0    0       0            0  0
#> [2,]    0   0     1   1   0        0           0    0       0            0  0
#> [3,]    0   0     0   0   1        1           1    1       1            1  1
#>      it work
#> [1,]       0
#> [2,]       0
#> [3,]       1

Finally, to train a machine learning model on this, you can simply do:


# ensure the input to classifier is a data.table or data.frame object
x_train <- data.table(cbind(train_cf_features, target = train$target))
x_test <- data.table(test_cf_features)


xgb <- RFTrainer$new(n_estimators = 10)
xgb$fit(x_train, "target")

predictions <- xgb$predict(x_test)
predictions
#> [1] 1 0 1 0 1 0
#> Levels: 0 1

Summary

In this tutorial, we discussed how to use superml’s countvectorizer (also known as bag of words model) to create word counts matrix and train a machine learning model on it.