Clustering Package

Luis Alfonso Pérez Martos

2020-04-21

Clustering is considered as a concise data model by which from a set of data we must partition them and introduce them in data groups, which are ́an as similar as possible. If review all clustering algorithm implements in R, can see a great number of packages that implement or improve algorithm or functionality.

The Clustering package contain multiply implementations of algorithms like: gmm, kmeans-arma, kmeans-rcpp, fuzzy_cm, fuzzy_gg, fuzzy_gk, hclust, apclusterk,aggExcluster,clara, daisy, diana,fanny,gama,mona,pam, pvclust,pvpick.

Also can use differents similarity measures to calculate the distance between points like: Euclidean, Manhattan, Jaccard, Gower, Mahalanobis, Correlation and Minkowski.

Furthermore, the package offers functions to:

Clustering

It’s the main method of the package.Clustering method processes a set of clustering algorithms. If we need to get information about the parameters that the method has we can do so by using the ?function or help(function). The way to load the datasets can be done in two different ways:

Once the method has been executed, we obtain the results divided into four parts:


df <- Clustering::clustering(df = Clustering::basketball,  
                             packages = c("clusterr"), min = 4, max = 6)

Here we have a dataframe with the result of the execution. In it you can see all the algorithms, the similarity measures used, the variables classified in order of importance, the execution time of the algorithms and the evaluation metrics.

Algorithm Distance Clusters Dataset Ranking timeExternal entropy variation_information precision recall f_measure fowlkes_mallows_index connectivity dunn silhouette timeInternal
gmm gmm_euclidean 4 dataframe 1 0.0218 0 0 0 0 0 0 34.09 0.1646 0.23 0.0064
gmm gmm_euclidean 4 dataframe 2 0.0242 0 0 0 0 0 0 34.09 0.1646 0.23 0.0078
gmm gmm_euclidean 4 dataframe 3 0.2115 0 0 0 0 0 0 34.09 0.1646 0.23 0.0101
gmm gmm_euclidean 4 dataframe 4 0.2985 0 0 0 0 0 0 34.09 0.1646 0.23 0.0103
gmm gmm_euclidean 4 dataframe 5 0.5018 0 0 0 0 0 0 34.09 0.1646 0.23 0.0103
gmm gmm_euclidean 5 dataframe 1 0.0214 0 0 0 0 0 0 42.08 0.1619 0.25 0.0064
gmm gmm_euclidean 5 dataframe 2 0.0227 0 0 0 0 0 0 42.08 0.1619 0.25 0.0064
gmm gmm_euclidean 5 dataframe 3 0.1612 0 0 0 0 0 0 42.08 0.1619 0.25 0.0066
gmm gmm_euclidean 5 dataframe 4 0.1617 0 0 0 0 0 0 42.08 0.1619 0.25 0.0094
gmm gmm_euclidean 5 dataframe 5 0.1967 0 0 0 0 0 0 42.08 0.1619 0.25 0.0141
gmm gmm_euclidean 6 dataframe 1 0.0245 0 0 0 0 0 0 51.46 0.1619 0.23 0.0065
gmm gmm_euclidean 6 dataframe 2 0.0285 0 0 0 0 0 0 51.46 0.1619 0.23 0.0066
gmm gmm_euclidean 6 dataframe 3 0.1529 0 0 0 0 0 0 51.46 0.1619 0.23 0.0066
gmm gmm_euclidean 6 dataframe 4 0.2573 0 0 0 0 0 0 51.46 0.1619 0.23 0.0075
gmm gmm_euclidean 6 dataframe 5 0.3114 0 0 0 0 0 0 51.46 0.1619 0.23 0.02
gmm gmm_manhattan 4 dataframe 1 0.0151 0 0 0 0 0 0 35.59 0.1348 0.23 0.0064
gmm gmm_manhattan 4 dataframe 2 0.0241 0 0 0 0 0 0 35.59 0.1348 0.23 0.0066
gmm gmm_manhattan 4 dataframe 3 0.1482 0 0 0 0 0 0 35.59 0.1348 0.23 0.0069
gmm gmm_manhattan 4 dataframe 4 0.1494 0 0 0 0 0 0 35.59 0.1348 0.23 0.0073
gmm gmm_manhattan 4 dataframe 5 0.1626 0 0 0 0 0 0 35.59 0.1348 0.23 0.0084
gmm gmm_manhattan 5 dataframe 1 0.0189 0 0 0 0 0 0 46.83 0.1322 0.26 0.0064
gmm gmm_manhattan 5 dataframe 2 0.0217 0 0 0 0 0 0 46.83 0.1322 0.26 0.0064
gmm gmm_manhattan 5 dataframe 3 0.1468 0 0 0 0 0 0 46.83 0.1322 0.26 0.0065
gmm gmm_manhattan 5 dataframe 4 0.1583 0 0 0 0 0 0 46.83 0.1322 0.26 0.0066
gmm gmm_manhattan 5 dataframe 5 0.1642 0 0 0 0 0 0 46.83 0.1322 0.26 0.0068
gmm gmm_manhattan 6 dataframe 1 0.0221 0 0 0 0 0 0 54.87 0.1467 0.25 0.0064
gmm gmm_manhattan 6 dataframe 2 0.0258 0 0 0 0 0 0 54.87 0.1467 0.25 0.0064
gmm gmm_manhattan 6 dataframe 3 0.143 0 0 0 0 0 0 54.87 0.1467 0.25 0.0067
gmm gmm_manhattan 6 dataframe 4 0.1491 0 0 0 0 0 0 54.87 0.1467 0.25 0.0067
gmm gmm_manhattan 6 dataframe 5 0.155 0 0 0 0 0 0 54.87 0.1467 0.25 0.0068
kmeans_arma kmeans_arma 4 dataframe 1 0.0006 0 0 0 0 0 0 44.21 0.1495 0.23 0.0065
kmeans_arma kmeans_arma 4 dataframe 2 0.0007 0 0 0 0 0 0 44.21 0.1495 0.23 0.0066
kmeans_arma kmeans_arma 4 dataframe 3 0.0007 0 0 0 0 0 0 44.21 0.1495 0.23 0.0067
kmeans_arma kmeans_arma 4 dataframe 4 0.0008 0 0 0 0 0 0 44.21 0.1495 0.23 0.0069
kmeans_arma kmeans_arma 4 dataframe 5 0.0016 0 0 0 0 0 0 44.21 0.1495 0.23 0.0078
kmeans_arma kmeans_arma 5 dataframe 1 0.0007 0 0 0 0 0 0 49.22 0.1538 0.26 0.0065
kmeans_arma kmeans_arma 5 dataframe 2 0.0007 0 0 0 0 0 0 49.22 0.1538 0.26 0.0067
kmeans_arma kmeans_arma 5 dataframe 3 0.0007 0 0 0 0 0 0 49.22 0.1538 0.26 0.0068
kmeans_arma kmeans_arma 5 dataframe 4 0.0007 0 0 0 0 0 0 49.22 0.1538 0.26 0.0068
kmeans_arma kmeans_arma 5 dataframe 5 0.0007 0 0 0 0 0 0 49.22 0.1538 0.26 0.0078
kmeans_arma kmeans_arma 6 dataframe 1 0.0007 0 0 0 0 0 0 57.63 0.1619 0.24 0.007
kmeans_arma kmeans_arma 6 dataframe 2 0.0007 0 0 0 0 0 0 57.63 0.1619 0.24 0.0071
kmeans_arma kmeans_arma 6 dataframe 3 0.0007 0 0 0 0 0 0 57.63 0.1619 0.24 0.0071
kmeans_arma kmeans_arma 6 dataframe 4 0.0007 0 0 0 0 0 0 57.63 0.1619 0.24 0.0071
kmeans_arma kmeans_arma 6 dataframe 5 0.0008 0 0 0 0 0 0 57.63 0.1619 0.24 0.0072
kmeans_rcpp kmeans_rcpp 4 dataframe 1 0.0135 0 0 0 0 0 0 51.04 0.1741 0.23 0.0062
kmeans_rcpp kmeans_rcpp 4 dataframe 2 0.0181 0 0 0 0 0 0 51.04 0.1741 0.23 0.0062
kmeans_rcpp kmeans_rcpp 4 dataframe 3 0.1401 0 0 0 0 0 0 51.04 0.1741 0.23 0.0063
kmeans_rcpp kmeans_rcpp 4 dataframe 4 0.1434 0 0 0 0 0 0 51.04 0.1741 0.23 0.0064
kmeans_rcpp kmeans_rcpp 4 dataframe 5 0.1499 0 0 0 0 0 0 51.04 0.1741 0.23 0.0066
kmeans_rcpp kmeans_rcpp 5 dataframe 1 0.0216 0 0 0 0 0 0 66.85 0.152 0.19 0.0063
kmeans_rcpp kmeans_rcpp 5 dataframe 2 0.0261 0 0 0 0 0 0 66.85 0.152 0.19 0.0065
kmeans_rcpp kmeans_rcpp 5 dataframe 3 0.1503 0 0 0 0 0 0 66.85 0.152 0.19 0.0065
kmeans_rcpp kmeans_rcpp 5 dataframe 4 0.1607 0 0 0 0 0 0 66.85 0.152 0.19 0.0066
kmeans_rcpp kmeans_rcpp 5 dataframe 5 0.1836 0 0 0 0 0 0 66.85 0.152 0.19 0.0073
kmeans_rcpp kmeans_rcpp 6 dataframe 1 0.0164 0 0 0 0 0 0 74.78 0.1522 0.19 0.0062
kmeans_rcpp kmeans_rcpp 6 dataframe 2 0.021 0 0 0 0 0 0 74.78 0.1522 0.19 0.0063
kmeans_rcpp kmeans_rcpp 6 dataframe 3 0.1445 0 0 0 0 0 0 74.78 0.1522 0.19 0.0064
kmeans_rcpp kmeans_rcpp 6 dataframe 4 0.1492 0 0 0 0 0 0 74.78 0.1522 0.19 0.0064
kmeans_rcpp kmeans_rcpp 6 dataframe 5 0.1503 0 0 0 0 0 0 74.78 0.1522 0.19 0.0072
mini_kmeans mini_kmeans 4 dataframe 1 0.0008 0 0 0 0 0 0 50.35 0.1571 0.21 0.0062
mini_kmeans mini_kmeans 4 dataframe 2 0.0008 0 0 0 0 0 0 50.35 0.1571 0.21 0.0065
mini_kmeans mini_kmeans 4 dataframe 3 0.0008 0 0 0 0 0 0 50.35 0.1571 0.21 0.0065
mini_kmeans mini_kmeans 4 dataframe 4 0.0009 0 0 0 0 0 0 50.35 0.1571 0.21 0.0066
mini_kmeans mini_kmeans 4 dataframe 5 0.0014 0 0 0 0 0 0 50.35 0.1571 0.21 0.0067
mini_kmeans mini_kmeans 5 dataframe 1 0.0008 0 0 0 0 0 0 76.4 0.1216 0.17 0.0066
mini_kmeans mini_kmeans 5 dataframe 2 0.0009 0 0 0 0 0 0 76.4 0.1216 0.17 0.0066
mini_kmeans mini_kmeans 5 dataframe 3 0.001 0 0 0 0 0 0 76.4 0.1216 0.17 0.0069
mini_kmeans mini_kmeans 5 dataframe 4 0.0011 0 0 0 0 0 0 76.4 0.1216 0.17 0.0072
mini_kmeans mini_kmeans 5 dataframe 5 0.0015 0 0 0 0 0 0 76.4 0.1216 0.17 0.0079
mini_kmeans mini_kmeans 6 dataframe 1 0.0009 0 0 0 0 0 0 76.53 0.15 0.17 0.0067
mini_kmeans mini_kmeans 6 dataframe 2 0.0009 0 0 0 0 0 0 76.53 0.15 0.17 0.0071
mini_kmeans mini_kmeans 6 dataframe 3 0.0009 0 0 0 0 0 0 76.53 0.15 0.17 0.0072
mini_kmeans mini_kmeans 6 dataframe 4 0.001 0 0 0 0 0 0 76.53 0.15 0.17 0.0073
mini_kmeans mini_kmeans 6 dataframe 5 0.001 0 0 0 0 0 0 76.53 0.15 0.17 0.0078

This property tells us if we have made an internal evaluation of the groups

#> [1] TRUE

This property tells us if we have made an external evaluation of the groups

#> [1] TRUE

Algorithms executed

#> [1] "gmm"         "kmeans_arma" "kmeans_rcpp" "mini_kmeans"

Similarity Metrics

#> [1] "gmm_euclidean" "gmm_manhattan" "kmeans_arma"   "kmeans_rcpp"  
#> [5] "mini_kmeans"

If we want to obtain the classified variables instead of the values we must use the variable property


df_variable <- Clustering::clustering(df = Clustering::basketball,  
                             packages = c("clusterr"), min = 4, max = 6, variables = TRUE)
Algorithm Distance Clusters Dataset Ranking timeExternal entropy variation_information precision recall f_measure fowlkes_mallows_index connectivity dunn silhouette timeInternal
gmm gmm_euclidean 4 dataframe 1 5 1 1 1 1 1 1 1 1 1 5
gmm gmm_euclidean 4 dataframe 2 1 2 2 2 2 2 2 2 2 2 4
gmm gmm_euclidean 4 dataframe 3 4 3 3 3 3 3 3 3 3 3 2
gmm gmm_euclidean 4 dataframe 4 2 4 4 4 4 4 4 4 4 4 3
gmm gmm_euclidean 4 dataframe 5 3 5 5 5 5 5 5 5 5 5 1
gmm gmm_euclidean 5 dataframe 1 4 1 1 1 1 1 1 1 1 1 2
gmm gmm_euclidean 5 dataframe 2 1 2 2 2 2 2 2 2 2 2 5
gmm gmm_euclidean 5 dataframe 3 3 3 3 3 3 3 3 3 3 3 4
gmm gmm_euclidean 5 dataframe 4 2 4 4 4 4 4 4 4 4 4 1
gmm gmm_euclidean 5 dataframe 5 5 5 5 5 5 5 5 5 5 5 3
gmm gmm_euclidean 6 dataframe 1 5 1 1 1 1 1 1 1 1 1 4
gmm gmm_euclidean 6 dataframe 2 1 2 2 2 2 2 2 2 2 2 5
gmm gmm_euclidean 6 dataframe 3 3 3 3 3 3 3 3 3 3 3 3
gmm gmm_euclidean 6 dataframe 4 2 4 4 4 4 4 4 4 4 4 1
gmm gmm_euclidean 6 dataframe 5 4 5 5 5 5 5 5 5 5 5 2
gmm gmm_manhattan 4 dataframe 1 4 1 1 1 1 1 1 1 1 1 5
gmm gmm_manhattan 4 dataframe 2 1 2 2 2 2 2 2 2 2 2 1
gmm gmm_manhattan 4 dataframe 3 3 3 3 3 3 3 3 3 3 3 3
gmm gmm_manhattan 4 dataframe 4 2 4 4 4 4 4 4 4 4 4 2
gmm gmm_manhattan 4 dataframe 5 5 5 5 5 5 5 5 5 5 5 4
gmm gmm_manhattan 5 dataframe 1 3 1 1 1 1 1 1 1 1 1 1
gmm gmm_manhattan 5 dataframe 2 1 2 2 2 2 2 2 2 2 2 2
gmm gmm_manhattan 5 dataframe 3 4 3 3 3 3 3 3 3 3 3 3
gmm gmm_manhattan 5 dataframe 4 2 4 4 4 4 4 4 4 4 4 4
gmm gmm_manhattan 5 dataframe 5 5 5 5 5 5 5 5 5 5 5 5
gmm gmm_manhattan 6 dataframe 1 5 1 1 1 1 1 1 1 1 1 4
gmm gmm_manhattan 6 dataframe 2 2 2 2 2 2 2 2 2 2 2 3
gmm gmm_manhattan 6 dataframe 3 4 3 3 3 3 3 3 3 3 3 1
gmm gmm_manhattan 6 dataframe 4 1 4 4 4 4 4 4 4 4 4 5
gmm gmm_manhattan 6 dataframe 5 3 5 5 5 5 5 5 5 5 5 2
kmeans_arma kmeans_arma 4 dataframe 1 1 1 1 1 1 1 1 1 1 1 5
kmeans_arma kmeans_arma 4 dataframe 2 3 2 2 2 2 2 2 2 2 2 2
kmeans_arma kmeans_arma 4 dataframe 3 2 3 3 3 3 3 3 3 3 3 3
kmeans_arma kmeans_arma 4 dataframe 4 5 4 4 4 4 4 4 4 4 4 1
kmeans_arma kmeans_arma 4 dataframe 5 4 5 5 5 5 5 5 5 5 5 4
kmeans_arma kmeans_arma 5 dataframe 1 5 1 1 1 1 1 1 1 1 1 4
kmeans_arma kmeans_arma 5 dataframe 2 1 2 2 2 2 2 2 2 2 2 2
kmeans_arma kmeans_arma 5 dataframe 3 4 3 3 3 3 3 3 3 3 3 5
kmeans_arma kmeans_arma 5 dataframe 4 2 4 4 4 4 4 4 4 4 4 1
kmeans_arma kmeans_arma 5 dataframe 5 3 5 5 5 5 5 5 5 5 5 3
kmeans_arma kmeans_arma 6 dataframe 1 3 1 1 1 1 1 1 1 1 1 2
kmeans_arma kmeans_arma 6 dataframe 2 5 2 2 2 2 2 2 2 2 2 4
kmeans_arma kmeans_arma 6 dataframe 3 2 3 3 3 3 3 3 3 3 3 3
kmeans_arma kmeans_arma 6 dataframe 4 4 4 4 4 4 4 4 4 4 4 5
kmeans_arma kmeans_arma 6 dataframe 5 1 5 5 5 5 5 5 5 5 5 1
kmeans_rcpp kmeans_rcpp 4 dataframe 1 3 1 1 1 1 1 1 1 1 1 4
kmeans_rcpp kmeans_rcpp 4 dataframe 2 1 2 2 2 2 2 2 2 2 2 3
kmeans_rcpp kmeans_rcpp 4 dataframe 3 4 3 3 3 3 3 3 3 3 3 5
kmeans_rcpp kmeans_rcpp 4 dataframe 4 2 4 4 4 4 4 4 4 4 4 2
kmeans_rcpp kmeans_rcpp 4 dataframe 5 5 5 5 5 5 5 5 5 5 5 1
kmeans_rcpp kmeans_rcpp 5 dataframe 1 3 1 1 1 1 1 1 1 1 1 1
kmeans_rcpp kmeans_rcpp 5 dataframe 2 1 2 2 2 2 2 2 2 2 2 4
kmeans_rcpp kmeans_rcpp 5 dataframe 3 4 3 3 3 3 3 3 3 3 3 5
kmeans_rcpp kmeans_rcpp 5 dataframe 4 2 4 4 4 4 4 4 4 4 4 2
kmeans_rcpp kmeans_rcpp 5 dataframe 5 5 5 5 5 5 5 5 5 5 5 3
kmeans_rcpp kmeans_rcpp 6 dataframe 1 3 1 1 1 1 1 1 1 1 1 1
kmeans_rcpp kmeans_rcpp 6 dataframe 2 1 2 2 2 2 2 2 2 2 2 5
kmeans_rcpp kmeans_rcpp 6 dataframe 3 4 3 3 3 3 3 3 3 3 3 2
kmeans_rcpp kmeans_rcpp 6 dataframe 4 2 4 4 4 4 4 4 4 4 4 4
kmeans_rcpp kmeans_rcpp 6 dataframe 5 5 5 5 5 5 5 5 5 5 5 3
mini_kmeans mini_kmeans 4 dataframe 1 3 1 1 1 1 1 1 1 1 1 1
mini_kmeans mini_kmeans 4 dataframe 2 4 2 2 2 2 2 2 2 2 2 2
mini_kmeans mini_kmeans 4 dataframe 3 2 3 3 3 3 3 3 3 3 3 5
mini_kmeans mini_kmeans 4 dataframe 4 5 4 4 4 4 4 4 4 4 4 4
mini_kmeans mini_kmeans 4 dataframe 5 1 5 5 5 5 5 5 5 5 5 3
mini_kmeans mini_kmeans 5 dataframe 1 2 1 1 1 1 1 1 1 1 1 5
mini_kmeans mini_kmeans 5 dataframe 2 5 2 2 2 2 2 2 2 2 2 2
mini_kmeans mini_kmeans 5 dataframe 3 4 3 3 3 3 3 3 3 3 3 1
mini_kmeans mini_kmeans 5 dataframe 4 3 4 4 4 4 4 4 4 4 4 3
mini_kmeans mini_kmeans 5 dataframe 5 1 5 5 5 5 5 5 5 5 5 4
mini_kmeans mini_kmeans 6 dataframe 1 2 1 1 1 1 1 1 1 1 1 1
mini_kmeans mini_kmeans 6 dataframe 2 3 2 2 2 2 2 2 2 2 2 5
mini_kmeans mini_kmeans 6 dataframe 3 4 3 3 3 3 3 3 3 3 3 3
mini_kmeans mini_kmeans 6 dataframe 4 1 4 4 4 4 4 4 4 4 4 2
mini_kmeans mini_kmeans 6 dataframe 5 5 5 5 5 5 5 5 5 5 5 4

If we only want to obtain the best classified variables or values for the external variables we execute the following method:


df_best_ranked_external <- Clustering::best_ranked_external_metrics(df$result)
Algorithm Distance Clusters Dataset Ranking timeExternal entropy variation_information precision recall f_measure fowlkes_mallows_index
gmm gmm_euclidean 4 dataframe 1 0.0218 0 0 0 0 0 0
gmm gmm_euclidean 5 dataframe 1 0.0214 0 0 0 0 0 0
gmm gmm_euclidean 6 dataframe 1 0.0245 0 0 0 0 0 0
gmm gmm_manhattan 4 dataframe 1 0.0151 0 0 0 0 0 0
gmm gmm_manhattan 5 dataframe 1 0.0189 0 0 0 0 0 0
gmm gmm_manhattan 6 dataframe 1 0.0221 0 0 0 0 0 0
kmeans_arma kmeans_arma 4 dataframe 1 0.0006 0 0 0 0 0 0
kmeans_arma kmeans_arma 5 dataframe 1 0.0007 0 0 0 0 0 0
kmeans_arma kmeans_arma 6 dataframe 1 0.0007 0 0 0 0 0 0
kmeans_rcpp kmeans_rcpp 4 dataframe 1 0.0135 0 0 0 0 0 0
kmeans_rcpp kmeans_rcpp 5 dataframe 1 0.0216 0 0 0 0 0 0
kmeans_rcpp kmeans_rcpp 6 dataframe 1 0.0164 0 0 0 0 0 0
mini_kmeans mini_kmeans 4 dataframe 1 0.0008 0 0 0 0 0 0
mini_kmeans mini_kmeans 5 dataframe 1 0.0008 0 0 0 0 0 0
mini_kmeans mini_kmeans 6 dataframe 1 0.0009 0 0 0 0 0 0

We also obtain the best classified values for internal evaluation


df_best_ranked_internal <- Clustering::best_ranked_internal_metrics(df$result)
Algorithm Distance Clusters Dataset Ranking timeInternal connectivity dunn silhouette
gmm gmm_euclidean 4 dataframe 1 0.0064 34.09 0.1646 0.23
gmm gmm_euclidean 5 dataframe 1 0.0064 42.08 0.1619 0.25
gmm gmm_euclidean 6 dataframe 1 0.0065 51.46 0.1619 0.23
gmm gmm_manhattan 4 dataframe 1 0.0064 35.59 0.1348 0.23
gmm gmm_manhattan 5 dataframe 1 0.0064 46.83 0.1322 0.26
gmm gmm_manhattan 6 dataframe 1 0.0064 54.87 0.1467 0.25
kmeans_arma kmeans_arma 4 dataframe 1 0.0065 44.21 0.1495 0.23
kmeans_arma kmeans_arma 5 dataframe 1 0.0065 49.22 0.1538 0.26
kmeans_arma kmeans_arma 6 dataframe 1 0.007 57.63 0.1619 0.24
kmeans_rcpp kmeans_rcpp 4 dataframe 1 0.0062 51.04 0.1741 0.23
kmeans_rcpp kmeans_rcpp 5 dataframe 1 0.0063 66.85 0.152 0.19
kmeans_rcpp kmeans_rcpp 6 dataframe 1 0.0062 74.78 0.1522 0.19
mini_kmeans mini_kmeans 4 dataframe 1 0.0062 50.35 0.1571 0.21
mini_kmeans mini_kmeans 5 dataframe 1 0.0066 76.4 0.1216 0.17
mini_kmeans mini_kmeans 6 dataframe 1 0.0067 76.53 0.15 0.17

In order to obtain the best evaluation by algorithm


df_best_validation_external <- Clustering::evaluate_best_validation_external_by_metrics(df$result)
Algorithm Distance timeExternal entropy variation_information precision recall f_measure fowlkes_mallows_index
gmm gmm_euclidean 0.0245 0 0 0 0 0 0
gmm gmm_manhattan 0.0221 0 0 0 0 0 0
kmeans_arma kmeans_arma 0.0007 0 0 0 0 0 0
kmeans_rcpp kmeans_rcpp 0.0216 0 0 0 0 0 0
mini_kmeans mini_kmeans 0.0009 0 0 0 0 0 0

Based on the results obtained we can see that the gmm algorithm behaves better.

From the algorithm with the best rating we can select the most appropriate number of clusters.


df_result_external <- Clustering::result_external_algorithm_by_metric(df$result,"gmm")
Algorithm Clusters timeExternal entropy variation_information precision recall f_measure fowlkes_mallows_index
gmm 4 0.0218 0 0 0 0 0 0
gmm 5 0.0214 0 0 0 0 0 0
gmm 6 0.0245 0 0 0 0 0 0

The same checks performed for external evaluation metrics, we can perform for internal evaluation.


df_best_validation_internal <-   
  Clustering::evaluate_best_validation_internal_by_metrics(df$result)
Algorithm Distance timeInternal connectivity dunn silhouette
gmm gmm_euclidean 0.0065 51.46 0.1646 0.25
gmm gmm_manhattan 0.0064 54.87 0.1467 0.26
kmeans_arma kmeans_arma 0.007 57.63 0.1619 0.26
kmeans_rcpp kmeans_rcpp 0.0063 74.78 0.1741 0.23
mini_kmeans mini_kmeans 0.0067 76.53 0.1571 0.21

In this case we can see that depending on the evaluation you want to make, one algorithm or another is chosen.

If we want to see graphically the representation of any metric as a function of the number of clusters and algorithm we can do it in the following way depending if the evaluation metric is internal or external


Clustering::plot_external_validation(df,"variation_information")