The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Currently sparklyr.flint
supports a number of commonly used summarizers (e.g., count, sum, average, etc) that are implemented in the Flint time series library. Each summarizer can be either applied to a moving time window (e.g., in_past(5s)
) or groups of rows within a TimeSeriesRDD
having the same timestamps (which is known as a “cycle” in Flint nomenclature).
The following is a quick example of applying the sum summarizer to a moving time window:
library(sparklyr)
library(sparklyr.flint)
# Step 0: decide which Spark version to use, how to connect to Spark, etc
<- "3.0.0"
spark_version <- spark_connect(master = "local", version = spark_version)
sc
<- data.frame(
example_time_series t = c(1, 3, 4, 6, 7, 10, 15, 16, 18, 19),
v = c(4, -2, NA, 5, NA, 1, -4, 5, NA, 3)
)
# Step 1: import example time series data into a Spark dataframe
<- copy_to(sc, example_time_series, overwrite = TRUE)
sdf
# Step 2: specify how the Spark dataframe should be interpreted as a time series by Flint
<- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
ts_rdd
# Step 3: apply a Flint summarizer to the time series above
<- summarize_sum(ts_rdd, column = "v", window = in_past("3s"))
sum
# Step 4: collect summarized result from Spark to R
<- ts_sum %>% collect()
res
print(res)
## # A tibble: 10 x 3
## time v v_sum
## <dttm> <dbl> <dbl>
## 1 1970-01-01 00:00:01 4 4
## 2 1970-01-01 00:00:03 -2 2
## 3 1970-01-01 00:00:04 NaN 2
## 4 1970-01-01 00:00:06 5 3
## 5 1970-01-01 00:00:07 NaN 5
## 6 1970-01-01 00:00:10 1 1
## 7 1970-01-01 00:00:15 -4 -4
## 8 1970-01-01 00:00:16 5 1
## 9 1970-01-01 00:00:18 NaN 1
## 10 1970-01-01 00:00:19 3 8
From the result above, one can see as a result of specifying window = in_past("3s")
, for each time point t
from example_time_series
(i.e., t = 1
, t = 3
, t = 4
, t = 6
, and so on), Flint has created a row containing t
and the summation of all v
value(s) occurring within the time window of [t - 3, t]
, and the sums are stored in a new column named v_sum
.
Given a timestamp t
, the subset of rows in a TimeSeriesRDD
having that timestamp is known as a “cycle” in Flint.
If the window = "<time window specification>"
argument is omitted, then the summarizer function will look at all cycles in the TimeSeriesRDD
. In other words, it will group all rows by their timestamps and perform aggregation within each group.
For example:
ts_sum <- summarize_sum(ts_rdd, column = "v")
will return a TimeSeriesRDD
with a timestamp column named time
and a summation column named v_sum
. For each timestamp t
present in ts_rdd
, ts_sum
will contain a row with timestamp t
and v_sum
value equal to summation of all v
values occurring at t
.
Because all rows from ts_rdd
are already ordered internally by timestamps, aggregations on cycles can be performed efficiently in Flint without re-shuffling rows in the input TimeSeriesRDD
.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.