The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
tidyclust provides a unified, tidy interface to clustering models, following the same design patterns as parsnip. It lets you swap clustering algorithms by changing a single line, and integrates seamlessly with the rest of the tidymodels ecosystem (recipes, workflows, tune).
Every tidyclust analysis follows the same four steps:
set.seed(1234)
kmeans_fit <- fit(kmeans_spec, ~., data = mtcars)
kmeans_fit
#> tidyclust cluster object
#>
#> K-means clustering with 3 clusters of sizes 7, 11, 14
#>
#> Cluster means:
#> mpg cyl disp hp drat wt qsec vs
#> 1 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
#> 3 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909
#> 2 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
#> am gear carb
#> 1 0.4285714 3.857143 3.428571
#> 3 0.7272727 4.090909 1.545455
#> 2 0.1428571 3.285714 3.500000
#>
#> Clustering vector:
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 1 1 2 1
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> 3 1 3 2
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 2 1 1 3
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> 3 3 3 3
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> 3 2 2 2
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 2 3 3 3
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> 3 2 2 2
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> 3 1 3 2
#>
#> Within cluster sum of squares by cluster:
#> [1] 13954.34 11848.37 93643.90
#> (between_SS / total_SS = 80.8 %)
#>
#> Available components:
#>
#> [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
#> [6] "betweenss" "size" "iter" "ifault"extract_cluster_assignment() returns the cluster label
for each training observation:
extract_cluster_assignment(kmeans_fit)
#> # A tibble: 32 × 1
#> .cluster
#> <fct>
#> 1 Cluster_1
#> 2 Cluster_1
#> 3 Cluster_2
#> 4 Cluster_1
#> 5 Cluster_3
#> 6 Cluster_1
#> 7 Cluster_3
#> 8 Cluster_2
#> 9 Cluster_2
#> 10 Cluster_1
#> # ℹ 22 more rowsextract_centroids() returns the location (mean) of each
cluster:
extract_centroids(kmeans_fit)
#> # A tibble: 3 × 12
#> .cluster mpg cyl disp hp drat wt qsec vs am gear carb
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Cluster_1 19.7 6 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
#> 2 Cluster_2 26.7 4 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
#> 3 Cluster_3 15.1 8 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5predict() assigns new observations to clusters:
predict(kmeans_fit, new_data = mtcars[1:5, ])
#> # A tibble: 5 × 1
#> .pred_cluster
#> <fct>
#> 1 Cluster_1
#> 2 Cluster_1
#> 3 Cluster_2
#> 4 Cluster_1
#> 5 Cluster_3augment() appends the cluster assignment to the original
data:
augment(kmeans_fit, new_data = mtcars)
#> # A tibble: 32 × 12
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rows
#> # ℹ 1 more variable: .pred_cluster <fct>tidyclust provides several cluster quality metrics:
sse_within_total(kmeans_fit, mtcars)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 sse_within_total standard 119447.
sse_ratio(kmeans_fit, mtcars)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 sse_ratio standard 0.192
silhouette_avg(kmeans_fit, mtcars)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 silhouette_avg standard 0.439Lower sse_within_total() and sse_ratio()
indicate tighter clusters. Higher silhouette_avg() (maximum
1) indicates better-separated clusters.
The same workflow applies to hier_clust(). The number of
clusters is cut from the dendrogram at fit time using
num_clusters:
hclust_spec <- hier_clust(num_clusters = 3) |>
set_engine("stats")
hclust_fit <- fit(hclust_spec, ~., data = mtcars)
extract_cluster_assignment(hclust_fit)
#> # A tibble: 32 × 1
#> .cluster
#> <fct>
#> 1 Cluster_1
#> 2 Cluster_1
#> 3 Cluster_1
#> 4 Cluster_2
#> 5 Cluster_3
#> 6 Cluster_2
#> 7 Cluster_3
#> 8 Cluster_1
#> 9 Cluster_1
#> 10 Cluster_1
#> # ℹ 22 more rows
extract_centroids(hclust_fit)
#> # A tibble: 3 × 12
#> .cluster mpg cyl disp hp drat wt qsec vs am gear carb
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Cluster_1 24.5 4.62 122. 96.9 4.00 2.52 18.5 0.75 0.688 4.12 2.44
#> 2 Cluster_2 17.0 7.43 276. 151. 2.99 3.60 18.1 0.286 0 3 2.14
#> 3 Cluster_3 14.6 8 388. 232. 3.34 4.16 16.4 0 0.222 3.44 4tidyclust works with the broader tidymodels ecosystem. For example, you can preprocess data with a recipe and bundle it with a model in a workflow:
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(workflows)
rec <- recipe(~., data = mtcars) |>
step_normalize(all_predictors())
wf <- workflow() |>
add_recipe(rec) |>
add_model(k_means(num_clusters = 3))
wf_fit <- fit(wf, data = mtcars)
augment(wf_fit, new_data = mtcars)
#> # A tibble: 32 × 12
#> .pred_cluster mpg cyl disp hp drat wt qsec vs am gear
#> * <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Cluster_1 21 6 160 110 3.9 2.62 16.5 0 1 4
#> 2 Cluster_1 21 6 160 110 3.9 2.88 17.0 0 1 4
#> 3 Cluster_1 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 4 Cluster_2 21.4 6 258 110 3.08 3.22 19.4 1 0 3
#> 5 Cluster_3 18.7 8 360 175 3.15 3.44 17.0 0 0 3
#> 6 Cluster_2 18.1 6 225 105 2.76 3.46 20.2 1 0 3
#> 7 Cluster_3 14.3 8 360 245 3.21 3.57 15.8 0 0 3
#> 8 Cluster_2 24.4 4 147. 62 3.69 3.19 20 1 0 4
#> 9 Cluster_2 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
#> 10 Cluster_2 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
#> # ℹ 22 more rows
#> # ℹ 1 more variable: carb <dbl>vignette("tuning_and_metrics", package = "tidyclust").vignette("k_means", package = "tidyclust").vignette("hier_clust", package = "tidyclust").These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.