The goal of this document is to introduce kms (as in keras_model_sequential()), a regression-style function which allows users to call keras neural nets with R formula objects (hence, library(kerasformula)).

First, make sure that keras is properly configured:

install.packages("keras")
library(keras)
install_keras() # see https://keras.rstudio.com/ for details. 

kms splits training and test data into sparse matrices.kms also auto-detects whether the dependent variable is categorical or binary. kms accepts the major parameters found in library(keras) as inputs (loss function, batch size, number of epochs, etc.) and allows users to customize basic neural nets (dense neural nets of various input shapes and dropout rates). The final example below also shows how to pass a compiled keras_model_sequential to kms (preferable for more complex models).

IMDB Movie Reviews

This example works with some of the imdb movie review data that comes with library(keras). Specifically, this example compares the default dense model that ksm generates to the lstm model described here. To expedite package building and installation, the code below is not actually run but can be run in under six minutes on a 2017 MacBook Pro with 16 GB of RAM (of which the majority of the time is for the lstm).

max_features <- 5000 # 5,000 words (ranked by popularity) found in movie reviews
maxlen <- 50  # Cut texts after 50 words (among top max_features most common words) 
Nsample <- 1000 

cat('Loading data...\n')
imdb <- keras::dataset_imdb(num_words = max_features)
imdb_df <- as.data.frame(cbind(c(imdb$train$y, imdb$test$y),
                               pad_sequences(c(imdb$train$x, imdb$test$x))))

set.seed(2017)   # can also set kms(..., seed = 2017)

demo_sample <- sample(nrow(imdb_df), Nsample)
P <- ncol(imdb_df) - 1
colnames(imdb_df) <- c("y", paste0("x", 1:P))

out_dense <- kms("y ~ .", data = imdb_df[demo_sample, ], Nepochs = 10)

plot(out_dense$history)  # incredibly useful 
# choose Nepochs to maximize out of sample accuracy

out_dense$confusion
    1
  0 107
  1 105
cat('Test accuracy:', out_dense$evaluations$acc, "\n")
Test accuracy: 0.495283 

Pretty bad–that's a 'broken clock' model. Suppose want to add some more layers.

out_dense <- kms("y ~ .", data = imdb_df[demo_sample, ], Nepochs = 10, 
                 layers = list(units = c(512, 256, 128, NA), 
                               activation = c("relu", "relu", "relu", "softmax"), 
                               dropout = c(0.5, 0.4, 0.3, NA)))
out_dense$confusion
     1
  0 113
  1 105
cat('Test accuracy:', out_dense$evaluations$acc, "\n")
Test accuracy: 0.4816514

No progress. Suppose we want to build an lstm model and pass it to ksm.

k <- keras_model_sequential()
k %>%
  layer_embedding(input_dim = max_features, output_dim = 128) %>% 
  layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>% 
  layer_dense(units = 1, activation = 'sigmoid')

k %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = c('accuracy')
)
out_lstm <- kms(input_formula = "y ~ .", data = imdb_df[demo_sample, ], 
                keras_model_seq = k, Nepochs = 5, seed = 12345)
out_lstm$confusion
    0  1
  0 80 17
  1 25 77
cat('Test accuracy:', out_lstm$evaluations$acc, "\n")
Test accuracy: 0.7889447 

78.9% out-of-sample accuracy. That's marked improvement!

For another worked example starting with raw data (from rtweet) visit here.