The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
A modular toolkit for clustering time series data and detecting anomalies using classical, wavelet-based, Hilbert-based, and circular feature extraction methods. It supports DBSCAN, OPTICS clustering with consistent output formats and provides a comparison function that allows users to compare multiple feature/algorithm combinations with a single call.
We use the bundled power_consumption dataset, recorded
at 10-minute intervals across three urban zones in Tetouan, Morocco.
Each row is a single time point. The last three columns are the zone-wise power consumption signals; the rest are weather variables we will ignore in this example.
dim(power_consumption)
#> [1] 13906 9
head(power_consumption, 3)
#> Datetime Temperature Humidity WindSpeed GeneralDiffuseFlows DiffuseFlows
#> 1 1/1/2017 0:00 6.559 73.8 0.083 0.051 0.119
#> 2 1/1/2017 0:10 6.414 74.5 0.083 0.070 0.085
#> 3 1/1/2017 0:20 6.313 74.5 0.080 0.062 0.100
#> PowerConsumption_Zone1 PowerConsumption_Zone2 PowerConsumption_Zone3
#> 1 34055.70 16128.88 20240.96
#> 2 29814.68 19375.08 20131.08
#> 3 29128.10 19006.69 19668.43For this walkthrough we will work with a 1000-row slice to keep everything fast. The exact same code runs on the full dataset; it just takes longer.
DBSCAN expects a 2D matrix where each row is one observation. We flatten it and attach a zone identifier per row.
flat <- flatten_with_zones(zones_matrix)
length(flat$values)
#> [1] 3000
table(flat$zones)
#>
#> 1 2 3
#> 1000 1000 1000After this step we have a single long vector with 3000 values and a
matching zones vector of identifiers.
Each observation needs more than a single value to be informative. We compute rolling mean and standard deviation over a 10-point window
rolling_stats returns a list of matrices. We flatten
each to align with our long-format values.
raw_features <- cbind(
zone = flat$zones,
value = flat$values,
mavg = as.vector(rolling$mean),
sd = as.vector(rolling$sd)
)
head(raw_features, 3)
#> zone value mavg sd
#> [1,] 1 34055.70 29712.61 2601.235
#> [2,] 1 29814.68 29197.97 2646.171
#> [3,] 1 29128.10 28740.98 2701.318The first column is the zone identifier; it is metadata, not a feature. We will exclude it from clustering and normalization.
DBSCAN is distance-based, so feature scales matter.
DBSCAN needs an eps parameter: the neighborhood radius.
The k-distance plot is the standard visual heuristic. We look for an
elbow in the sorted distances curve.
result <- run_dbscan(raw_features[, 2:4],
eps = 0.3,
min_pts = 7)
result$n_clusters
#> [1] 3
result$n_noise
#> [1] 63The result is a list with a standardized structure.
The Davies-Bouldin Index summarizes how compact and separated the clusters are. Lower values are better.
We can visualize the partition by projecting onto the first two principal components and coloring by cluster.
For function-level reference, see the help pages,
e.g. ?run_dbscan.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.