NIMAA-vignette

Cheng Chen, Mohieddin Jafari

2021-09-09

Introduction

When dealing with nominal data, usually only some methods such as simple frequency statistics can be carried out. NIMAA package proposes a pipeline for nominal data mining, which can effectively find special relationships between data.

It uses bipartite graphs to represent the relationship between two different types of nominal data, and organizes them in the incidence matrix to find sub-matrices that are larger in different dimensions and do not contain any missing values.

Then, two one-partite graphs are obtained on the sub-matrix using a variety of different methods of projection. For any of them, NIMAA can use a variety of methods to cluster, and users can use possible external prior knowledge to select. the result of the best clustering method as the ‘reference cluster’.

After that we can perform multiple numerical imputations on the matrix with missing data, and apply the same clustering method as ‘reference clustering’ for clustering, and select the best data imputation method which is the one with the closest result to the ‘reference cluster’.

library(NIMAA)

1 Explore the data

Here we use ‘beatAML’ dataset as an example, which is a data set composed of three columns, the first two columns are nominal data, and the third column is numerical data.

beatAML dataset samples
inhibitor patient_id median
Alisertib (MLN8237) 11-00261 81.00097
Barasertib (AZD1152-HQPA) 11-00261 60.69244
Bortezomib (Velcade) 11-00261 81.00097
Canertinib (CI-1033) 11-00261 87.03067
Crenolanib 11-00261 68.13586
CYT387 11-00261 69.66083
Dasatinib 11-00261 66.13318
Doramapimod (BIRB 796) 11-00261 101.52120
Dovitinib (CHIR-258) 11-00261 33.48040
Erlotinib 11-00261 56.11189

Read the data from the package:

# read the data
beatAML_data <- NIMAA::beatAML

1.1 Plot the original data:

Function plotInput() will print the incidence matrix plot of input data, and return that matrix.

NB: To keep the size of vignette small enough for CRAN rules, we won’t output the interactive figure here.

beatAML_incidence_matrix <- plotInput(
  x = beatAML_data, # original data with 3 columns
  index_nominal = c(2,1), # the first two columns are nominal data
  index_numeric = 3,  # the third column inumeric data
  print_skim = FALSE, # if you want to check the skim output, set this as TRUE(Default)
  plot_weight = TRUE, # when plotting the figure, show the weights
  verbose = FALSE # NOT save the figures to local folder
  )
Na/missing values Proportion: 0.2603
beatAML dataset as incidence matrix

beatAML dataset as incidence matrix

1.2 Plot the bipartite graph of the original data:

Since we have got the incidence matrix, then we can easily use it to form a bipartite graph, in this part we have two different ways to visualize the bipartite graph, static or interactive

1.2.1 stastic:

Function plotBipartite() will print the bipartite graph based on igraph package, and return that igraph graph object.

graph <- plotBipartite(inc_mat = beatAML_incidence_matrix,vertex.label.display = T)


# show the igraph graph object
graph
#> IGRAPH ad6c4b9 UNWB 650 47636 -- 
#> + attr: name (v/c), type (v/l), shape (v/c), color (v/c), weight (e/n)
#> + edges from ad6c4b9 (vertex names):
#>  [1] Alisertib (MLN8237)      --11-00261 Barasertib (AZD1152-HQPA)--11-00261
#>  [3] Bortezomib (Velcade)     --11-00261 Canertinib (CI-1033)     --11-00261
#>  [5] Crenolanib               --11-00261 CYT387                   --11-00261
#>  [7] Dasatinib                --11-00261 Doramapimod (BIRB 796)   --11-00261
#>  [9] Dovitinib (CHIR-258)     --11-00261 Erlotinib                --11-00261
#> [11] Flavopiridol             --11-00261 GDC-0941                 --11-00261
#> [13] Gefitinib                --11-00261 Go6976                   --11-00261
#> [15] GW-2580                  --11-00261 Idelalisib               --11-00261
#> + ... omitted several edges

1.2.2 interactive:

Function plotBipartiteInteractive() will print the interactive bipartite graph based on visNetwork package.

NB: To keep the size of vignette small enough, we won’t output the interactive figure here, a screenshot instead.

plotBipartiteInteractive(inc_mat = beatAML_incidence_matrix)

1.3 Analysis of the network(graph)

analysis_reuslt <- analyseNetwork(graph)

2 Extract the sub-matrices without missing data

Function extractSubMatrix() will extract the sub-matrices which have no missing value inside or with specific proportion of missing values inside (not for elements-max matrix), depends on the user’s input. The result will also be showed as plotly figure.

The extraction process has two types of data preprocessing, the difference is that the first one directly uses the original input matrix (row-wise), while the second one uses the transposed matrix (column-wise).

After preprocessing, the matrix will be “three-step arrangement”:

Then look for the largest possible matrix (with no missing values or with specific proportion of missing values) in the four dimensions, output the result and print the visualization.

2.1 Extract the sub-matrices without missing data

Here we extract two sub-matrices of the beatAML_incidence_matrix

sub_matrices <- extractSubMatrix(
  x = beatAML_incidence_matrix,
  shape = c("Square", "Rectangular_element_max"), # the shapes wanted
  row.vars = "patient_id",
  col.vars = "inhibitor",
  plot_weight = TRUE,
  verbose = FALSE,
  print_skim = TRUE # just to reduce the length of vignette
  )

We can see that there is an output called binmatnest2.temperature, this is the nestedness measure of the matrix, if the input is a highly nested (nestedness temperature is less than 1). We suggest that divide the data into different parts.

Row-wise arrangement

Row-wise arrangement

Column-wise arrangement

Column-wise arrangement

3 Find the cluster in one-partite graph

Function findCluster() will perform optional pre-processing on the input incidence matrix, such as normalization. Then use the matrix to perform bipartite graph projection, and perform optional pre-processing in one of the specified parts, such as removing edges with lower weights, that is, weak edges.

The removal method and threshold selection can also be specified, and for the remaining You can choose to keep the original weight or set all of them to 1. For the graphs obtained after processing, implement some clustering methods in igraphto obtain the classification results.

3.1 Do clustering based on sub-matrices

First we can do clustering on one part (patient_id), to find the clusters.

cls <- findCluster(
  sub_matrices$Rectangular_element_max,
  dim = 1,
  method = "all", # clustering mehod
  normalization = TRUE, # normalize the input matrix
  rm_weak_edges = TRUE, # remove the weak edges in graph
  rm_method = 'delete', # removing method is deleting the edges
  threshold = 'median', # edges with weights under the median of all edges' weight are weak edges
  set_remaining_to_1 = TRUE, # set the weights of remaining edges to 1
  )

In addition, if there is an input of external features (prior knowledge), the function will also compare the clustering results obtained with external features regard similarity.

For example, let’s generate some random features for patient_id part, here are the samples:

Then we can do clustering again on one part (patient_id) to see the difference.

cls <- findCluster(
  sub_matrices$Rectangular_element_max,
  dim = 1,
  method = "all", # clustering mehod
  normalization = TRUE, # normalize the input matrix
  rm_weak_edges = TRUE, # remove the weak edges in graph
  rm_method = 'delete', # removing method is deleting the edges
  threshold = 'median', # edges with weights under the median of all edges' weight are weak edges
  set_remaining_to_1 = TRUE, # set the weights of remaining edges to 1
  extra_feature =external_feature # ADD A EXTRA FEATURE REFRRENCE HERE!
  )

Some new indices are included (jaccard_similarity, corrected), which showing the similarity between the different clustering results and the reference feature.

We can do clustering on the other part (inhibitors), just need to change the dim to 2.

cls2 <- findCluster(
  sub_matrices$Rectangular_element_max, # the same sub-matrix
  dim = 2 # set to 2 to use the other part of graph
  )

4 Explore the clusters

In this part we will do exploration about the result of clustering, mainly focus on the visualization of clusters.

4.1 plotCluster

plotCluster() will output an interactive network figure, in which nodes belonging to the same group will be given the same color, and nodes belonging to different groups will have different colors.

plotCluster(graph=cls2$graph,cluster = cls2$louvain)

4.2 Visualize the clusters in sankey figure format

We use the sankey graph to represent the bipartite graph. The difference is that we group the nodes that belong to the same part and the same group in the two parts as ‘summary’ nodes, and output an interactive figure.

visualClusterInBipartite(
  data = beatAML_data,
  community_left = cls$leading_eigen,
  community_right = cls2$fast_greedy,
  name_left = 'patient_id',
  name_right = 'inhibitor')
Interactive plot for function visualClusterInBipartite()

Interactive plot for function visualClusterInBipartite()

4.3 Score clusters

When we get a cluster, we can score it, which mainly uses the fpc package, and secondly we will calculate the coverage, which is an indicator obtained from Almeida, Hélio, et al. “Is there a best quality metric for graph clusters?.”, which is a clustering is given as the fraction of the weight of all intra-cluster edges with respect to the total weight of all edges in the whole graph.

scoreCluster(community = cls2$infomap,
             graph = cls2$graph,
             distance_matrix = cls2$distance_matrix)

4.4 Validate clusters

validateCluster() will calculate the similarity of the given clustering result and the reference feature, that is, corrected.rand and jaccard_similarity

validateCluster(dist_mat = cls$distance_matrix,
                extra_feature = external_feature,
                community = cls$leading_eigen)

5 Imputation

5.1 impute missing data

The imputeMissingValue() function can impute the missing values in the matrix, we only need to select which methods are needed. The result will be a list, each element is a matrix with no missing values.

it will perform a variety of numerical imputation according to the user’s input, and return all the data that does not contain any missing data, a list of matrices.

imputations <- imputeMissingValue(
  inc_mat = beatAML_incidence_matrix,
  method = c('svd','median','als','CA')
  )
# show the result format
names(imputations)
#> [1] "median" "svd"    "als"    "CA"

5.2 Validate imputation

validateImputation() will calculate the

  1. Jaccard similarity

  2. Dice similarity coefficient

  3. Rand index

  4. Minkowski (inversed)

  5. Fowlkes–Mallows index

between each imputation and the reference community, then plot the ranking, user can find which imputation method is relatively better.

validation_of_imputation <- validateImputation(
  imputation = imputations,
  refer_community = cls$fast_greedy,
  clustering_args = cls$clustering_args
  )

6 Using the data with imputation to explore again

By the result from previous chapter, we can easily know that ‘als’ is the best method.

There is no missing value in imputation matrix

NA %in% imputations$als

And the size of imputations are the same with original matrix.

dim(imputations$als) == dim(beatAML_incidence_matrix)

Thus the users can use this imputation matrix as input matrix to re-do all the first 5 steps (no need to do extracting because this matrix is already without any missing value) to mining the relationships inside the original dataset.