When dealing with nominal data, usually only some methods such as simple frequency statistics can be carried out. NIMAA package proposes a pipeline for nominal data mining, which can effectively find special relationships between data.
It uses bipartite graphs to represent the relationship between two different types of nominal data, and organizes them in the incidence matrix to find sub-matrices that are larger in different dimensions and do not contain any missing values.
Then, two one-partite graphs are obtained on the sub-matrix using a variety of different methods of projection. For any of them, NIMAA can use a variety of methods to cluster, and users can use possible external prior knowledge to select. the result of the best clustering method as the ‘reference cluster’.
After that we can perform multiple numerical imputations on the matrix with missing data, and apply the same clustering method as ‘reference clustering’ for clustering, and select the best data imputation method which is the one with the closest result to the ‘reference cluster’.
library(NIMAA)
Here we use ‘beatAML’ dataset as an example, which is a data set composed of three columns, the first two columns are nominal data, and the third column is numerical data.
inhibitor | patient_id | median |
---|---|---|
Alisertib (MLN8237) | 11-00261 | 81.00097 |
Barasertib (AZD1152-HQPA) | 11-00261 | 60.69244 |
Bortezomib (Velcade) | 11-00261 | 81.00097 |
Canertinib (CI-1033) | 11-00261 | 87.03067 |
Crenolanib | 11-00261 | 68.13586 |
CYT387 | 11-00261 | 69.66083 |
Dasatinib | 11-00261 | 66.13318 |
Doramapimod (BIRB 796) | 11-00261 | 101.52120 |
Dovitinib (CHIR-258) | 11-00261 | 33.48040 |
Erlotinib | 11-00261 | 56.11189 |
Read the data from the package:
# read the data
<- NIMAA::beatAML beatAML_data
Function plotInput() will print the incidence matrix plot of input data, and return that matrix.
NB: To keep the size of vignette small enough for CRAN rules, we won’t output the interactive figure here.
<- plotInput(
beatAML_incidence_matrix x = beatAML_data, # original data with 3 columns
index_nominal = c(2,1), # the first two columns are nominal data
index_numeric = 3, # the third column inumeric data
print_skim = FALSE, # if you want to check the skim output, set this as TRUE(Default)
plot_weight = TRUE, # when plotting the figure, show the weights
verbose = FALSE # NOT save the figures to local folder
)
beatAML dataset as incidence matrix
Since we have got the incidence matrix, then we can easily use it to form a bipartite graph, in this part we have two different ways to visualize the bipartite graph, static or interactive
Function plotBipartite() will print the bipartite graph based on igraph package, and return that igraph graph object.
<- plotBipartite(inc_mat = beatAML_incidence_matrix,vertex.label.display = T) graph
# show the igraph graph object
graph#> IGRAPH ad6c4b9 UNWB 650 47636 --
#> + attr: name (v/c), type (v/l), shape (v/c), color (v/c), weight (e/n)
#> + edges from ad6c4b9 (vertex names):
#> [1] Alisertib (MLN8237) --11-00261 Barasertib (AZD1152-HQPA)--11-00261
#> [3] Bortezomib (Velcade) --11-00261 Canertinib (CI-1033) --11-00261
#> [5] Crenolanib --11-00261 CYT387 --11-00261
#> [7] Dasatinib --11-00261 Doramapimod (BIRB 796) --11-00261
#> [9] Dovitinib (CHIR-258) --11-00261 Erlotinib --11-00261
#> [11] Flavopiridol --11-00261 GDC-0941 --11-00261
#> [13] Gefitinib --11-00261 Go6976 --11-00261
#> [15] GW-2580 --11-00261 Idelalisib --11-00261
#> + ... omitted several edges
Function plotBipartiteInteractive() will print the interactive bipartite graph based on visNetwork package.
NB: To keep the size of vignette small enough, we won’t output the interactive figure here, a screenshot instead.
plotBipartiteInteractive(inc_mat = beatAML_incidence_matrix)
<- analyseNetwork(graph) analysis_reuslt
Function extractSubMatrix() will extract the sub-matrices which have no missing value inside or with specific proportion of missing values inside (not for elements-max matrix), depends on the user’s input. The result will also be showed as plotly figure.
The extraction process has two types of data preprocessing, the difference is that the first one directly uses the original input matrix (row-wise), while the second one uses the transposed matrix (column-wise).
After preprocessing, the matrix will be “three-step arrangement”:
the first step is row arranging ;
the second step is column arranging ;
the third step is total rearranging.
Then look for the largest possible matrix (with no missing values or with specific proportion of missing values) in the four dimensions, output the result and print the visualization.
Here we extract two sub-matrices of the beatAML_incidence_matrix
<- extractSubMatrix(
sub_matrices x = beatAML_incidence_matrix,
shape = c("Square", "Rectangular_element_max"), # the shapes wanted
row.vars = "patient_id",
col.vars = "inhibitor",
plot_weight = TRUE,
verbose = FALSE,
print_skim = TRUE # just to reduce the length of vignette
)
We can see that there is an output called binmatnest2.temperature, this is the nestedness measure of the matrix, if the input is a highly nested (nestedness temperature is less than 1). We suggest that divide the data into different parts.
Row-wise arrangement
Column-wise arrangement
Function findCluster() will perform optional pre-processing on the input incidence matrix, such as normalization. Then use the matrix to perform bipartite graph projection, and perform optional pre-processing in one of the specified parts, such as removing edges with lower weights, that is, weak edges.
The removal method and threshold selection can also be specified, and for the remaining You can choose to keep the original weight or set all of them to 1. For the graphs obtained after processing, implement some clustering methods in igraphto obtain the classification results.
First we can do clustering on one part (patient_id), to find the clusters.
<- findCluster(
cls $Rectangular_element_max,
sub_matricesdim = 1,
method = "all", # clustering mehod
normalization = TRUE, # normalize the input matrix
rm_weak_edges = TRUE, # remove the weak edges in graph
rm_method = 'delete', # removing method is deleting the edges
threshold = 'median', # edges with weights under the median of all edges' weight are weak edges
set_remaining_to_1 = TRUE, # set the weights of remaining edges to 1
)
In addition, if there is an input of external features (prior knowledge), the function will also compare the clustering results obtained with external features regard similarity.
For example, let’s generate some random features for patient_id part, here are the samples:
Then we can do clustering again on one part (patient_id) to see the difference.
<- findCluster(
cls $Rectangular_element_max,
sub_matricesdim = 1,
method = "all", # clustering mehod
normalization = TRUE, # normalize the input matrix
rm_weak_edges = TRUE, # remove the weak edges in graph
rm_method = 'delete', # removing method is deleting the edges
threshold = 'median', # edges with weights under the median of all edges' weight are weak edges
set_remaining_to_1 = TRUE, # set the weights of remaining edges to 1
extra_feature =external_feature # ADD A EXTRA FEATURE REFRRENCE HERE!
)
Some new indices are included (jaccard_similarity, corrected), which showing the similarity between the different clustering results and the reference feature.
We can do clustering on the other part (inhibitors), just need to change the dim
to 2.
<- findCluster(
cls2 $Rectangular_element_max, # the same sub-matrix
sub_matricesdim = 2 # set to 2 to use the other part of graph
)
In this part we will do exploration about the result of clustering, mainly focus on the visualization of clusters.
plotCluster() will output an interactive network figure, in which nodes belonging to the same group will be given the same color, and nodes belonging to different groups will have different colors.
plotCluster(graph=cls2$graph,cluster = cls2$louvain)
We use the sankey graph to represent the bipartite graph. The difference is that we group the nodes that belong to the same part and the same group in the two parts as ‘summary’ nodes, and output an interactive figure.
visualClusterInBipartite(
data = beatAML_data,
community_left = cls$leading_eigen,
community_right = cls2$fast_greedy,
name_left = 'patient_id',
name_right = 'inhibitor')
Interactive plot for function visualClusterInBipartite()
When we get a cluster, we can score it, which mainly uses the fpc package, and secondly we will calculate the coverage, which is an indicator obtained from Almeida, Hélio, et al. “Is there a best quality metric for graph clusters?.”, which is a clustering is given as the fraction of the weight of all intra-cluster edges with respect to the total weight of all edges in the whole graph.
scoreCluster(community = cls2$infomap,
graph = cls2$graph,
distance_matrix = cls2$distance_matrix)
validateCluster() will calculate the similarity of the given clustering result and the reference feature, that is, corrected.rand and jaccard_similarity
validateCluster(dist_mat = cls$distance_matrix,
extra_feature = external_feature,
community = cls$leading_eigen)
The imputeMissingValue() function can impute the missing values in the matrix, we only need to select which methods are needed. The result will be a list, each element is a matrix with no missing values.
it will perform a variety of numerical imputation according to the user’s input, and return all the data that does not contain any missing data, a list of matrices.
‘median’ will replace the missing values with the median of each rows(observations)
‘knn’ is the method in package
‘als’ and ‘svd’ are methods from package
‘CA’, ‘PCA’ and ‘FAMD’ are from package
others are from the famous package.
<- imputeMissingValue(
imputations inc_mat = beatAML_incidence_matrix,
method = c('svd','median','als','CA')
)# show the result format
names(imputations)
#> [1] "median" "svd" "als" "CA"
validateImputation() will calculate the
Jaccard similarity
Dice similarity coefficient
Rand index
Minkowski (inversed)
Fowlkes–Mallows index
between each imputation and the reference community, then plot the ranking, user can find which imputation method is relatively better.
<- validateImputation(
validation_of_imputation imputation = imputations,
refer_community = cls$fast_greedy,
clustering_args = cls$clustering_args
)
By the result from previous chapter, we can easily know that ‘als’ is the best method.
There is no missing value in imputation matrix
NA %in% imputations$als
And the size of imputations are the same with original matrix.
dim(imputations$als) == dim(beatAML_incidence_matrix)
Thus the users can use this imputation matrix as input matrix to re-do all the first 5 steps (no need to do extracting because this matrix is already without any missing value) to mining the relationships inside the original dataset.