A Short Introduction to MAKL Package

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

For a better understanding of MAKL library, we build a simple example in this document. We first create a synthetic dataset that consists of 1000 rows and 6 features, using standard Gaussian distribution.

library(MAKL)
set.seed(64327) #midas
df <- matrix(rnorm(6000, 0, 1), nrow = 1000)
colnames(df) <- c("F1", "F2", "F3", "F4", "F5", "F6")

As to membership argument of makl_train(), we prepare a list consisting of two groups such that the first one contains the features F1, F5 and F6; the second one contains the rest. Note that the column names of the input dataset should be a superset of the union of all feature names in the groups list.

# check colnames(df) for them to be matching with group members
groups <- list()
groups[[1]] <- c("F1", "F5", "F6")
groups[[2]] <- c("F2", "F3", "F4")

We then create the response vector y such that it will be dependent on the second, the third and the fourth features, namely F2, F3 and F4: If, for a data instance, the sum of entries in the second, the third and the fourth columns is positive, the corresponding response is assigned +1, else, it is assigned -1.

y <- c()
for(i in 1:nrow(df)) {
  if((df[i, 2] + df[i, 3] + df[i, 4]) > 0) {
    y[i] <- +1
  } else {
    y[i] <- -1
  }
}

We use the synthetic dataset df and response vector y as our train dataset and train response vector in makl_train(), we choose the number of random features D equal to 2 which makes sense knowing that our train dataset is 6 dimensional. We choose the number of rows to be used for distance matrix calculation, sigma_N equal to 1000, and lambda_set consisting of 0.9, 0.8, 0.7, 0.6 for sparse solutions. As membership list, we use the groups list that we created above.

makl_model <- makl_train(X = df, y = y, D = 2, sigma_N = 1000, CV = 1, membership = groups, lambda_set = c(0.9, 0.8, 0.7, 0.6))
#> Lambda: 155.0901  nr.var: 5 
#> Lambda: 137.8579  nr.var: 5 
#> Lambda: 120.6257  nr.var: 5 
#> Lambda: 103.3934  nr.var: 5

When we check the coefficients of our model, we see that the chosen kernel for prediction by makl_train() was the kernel of the second group. This was an expected result since we created the response vector y to be dependent on the second group members of the groups list.

makl_model$model$coefficients
#>       155.090126229481 137.857889981761 120.625653734041 103.39341748632
#>  [1,]       0.00000000        0.0000000        0.0000000       0.0000000
#>  [2,]       0.00000000        0.0000000        0.0000000       0.0000000
#>  [3,]       0.00000000        0.0000000        0.0000000       0.0000000
#>  [4,]       0.00000000        0.0000000        0.0000000       0.0000000
#>  [5,]      -0.29314353       -0.5938544       -0.9106226      -1.2539243
#>  [6,]       0.06703617        0.1352210        0.2057486       0.2799665
#>  [7,]       0.24539658        0.4973664        0.7630398       1.0509792
#>  [8,]      -0.36108294       -0.7320709       -1.1246002      -1.5535840
#>  [9,]       0.12450233        0.1542956        0.1858601       0.2195980

Now, let us create a synthetic dataset df_test and a synthetic test response vector y_test to use in makl_test() to check the results.

df_test <- matrix(rnorm(600, 0, 1), nrow = 100)
colnames(df_test) <- c("F1", "F2", "F3", "F4", "F5", "F6")
y_test <- c()
for(i in 1:nrow(df_test)) {
  if((df_test[i, 2] + df_test[i, 3] + df_test[i, 4]) > 0) {
    y_test[i] <- +1
  } else {
    y_test[i] <- -1
  }
}
result <-makl_test(X = df_test, y = y_test, makl_model = makl_model)

The list result contains two elements: 1) The predictions for the test response vector y_test and 2) The area under the ROC curve (AUROC) versus the number of selected kernels values for each element in the lambda_set if CV is not applied; the area under the ROC curve versus the number of selected kernels value for the best lambda in the lambda_set if CV is applied.

result$auroc_kernel_number
#>     auroc_array n_selected_kernels
#> 0.9   0.9494179                  1
#> 0.8   0.9494179                  1
#> 0.7   0.9498193                  1
#> 0.6   0.9498193                  1

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.