Comparing Feature Engineering Approaches

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

All combinations

We want to design a structure that incorporates all these features. compare_methods() function, then unpacks what the results mean.

We use the bundled steel_industry dataset: one full year of 15-minute energy measurements from a Korean steel plant, including reactive power, power factor, CO2 emissions, and time-of-day indicators.

library(cyclicwave)
data(steel_industry)

Preparing the data

Three preprocessing steps:

Thinning: keep every 10th row to reduce samples. The clustering analysis becomes both faster and more meaningful.
Select numeric columns: discard date and categorical columns.
Z-score normalization: required for any distance-based method.

data_thin    <- thin_data(steel_industry, step = 10)
numeric_data <- select_numeric_columns(data_thin)
data_scaled  <- normalize_features(numeric_data, method = "zscore")

dim(data_scaled)
#> [1] 3504    7

Ground-truth labels

steel_industry doesn’t ship with explicit class labels, but Usage_kWh gives us natural ones: low, medium, and high consumption regimes, defined by tertile cutoffs. We use these as a yardstick for evaluating how meaningful each clustering result is.

true_labels <- label_by_quantile(data_thin$Usage_kWh,
                                 probs = c(1/3, 2/3))
table(true_labels)
#> true_labels
#>    1    2    3 
#> 1184 1152 1168

Each class has roughly N/3 observations.

Defining the feature methods

compare_methods() takes a named list of feature extractors. Each is just a function that takes the raw data and returns a numeric feature matrix.

feature_methods <- list(
  pca_only = function(d) {
    pca <- prcomp(d, center = FALSE, scale. = FALSE)
    pca$x[, 1:3]
  },
  pca_circular = function(d) {
    pca <- prcomp(d, center = FALSE, scale. = FALSE)
    phase <- compute_phase(d, axis = "feature")
    circ <- extract_circular_features(phase)
    cbind(pca$x[, 1:3], circ)
  }
)

Defining the clustering methods

We try DBSCAN with two different parameter settings: one with a larger neighborhood radius (loose) and one with a smaller one (tight). This is a parameter sweep disguised as a method comparison.

cluster_methods <- list(
  dbscan_loose = list(fn = run_dbscan, params = list(eps = 0.5, min_pts = 8)),
  dbscan_tight = list(fn = run_dbscan, params = list(eps = 0.3, min_pts = 5))
)

One call to rule them all

compare_methods() runs every combination, evaluates each with the requested metrics, and returns a single comparison table.

comparison <- compare_methods(
  data            = data_scaled,
  feature_methods = feature_methods,
  cluster_methods = cluster_methods,
  metrics         = c("dbi", "accuracy", "n_clusters", "n_noise"),
  true_labels     = true_labels,
  normalize       = NULL,
  verbose         = FALSE
)

print(comparison)
#>   feature_method cluster_method       dbi  accuracy n_clusters n_noise
#> 1       pca_only   dbscan_loose 0.9056772 0.5907534          4      15
#> 2       pca_only   dbscan_tight 0.6312147 0.7285959         13      56
#> 3   pca_circular   dbscan_loose 0.5957794 0.7619863         14     126
#> 4   pca_circular   dbscan_tight 0.8596812 0.7796804         38     254

Four rows, one per combination, four metrics each. Now to read it.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.