The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
ruler offers a set of tools for creating tidy data
validation reports using dplyr
grammar of data manipulation. It is structured to be flexible and
extendable in terms of creating rules and using their output.
To fully use this package a solid knowledge of dplyr is
required. The key idea behind ruler’s design is to validate
data by modifying regular dplyr code with as little
overhead as possible.
Some functionality is powered by the keyholder package.
It is highly recommended to use its supported functions during rule
construction. All one- and two-table dplyr verbs applied to
local data frames are supported and considered the most appropriate way
to create rules.
This README is structured as follows:
ruler
for exploration of obeying user-defined rules and its automatic
validation.ruler’s capabilities
in more detail.You can install current stable version from CRAN with:
install.packages("ruler")Also you can install development version from github with:
# install.packages("devtools")
devtools::install_github("echasnovski/ruler")# Utilities functions
is_integerish <- function(x) {
  all(x == as.integer(x))
}
z_score <- function(x) {
  abs(x - mean(x)) / sd(x)
}
# Define rule packs
my_packs <- list(
  data_packs(
    dims = . %>% summarise(nrow_low = nrow(.) >= 10, nrow_high = nrow(.) <= 15,
      ncol_low = ncol(.) >= 20, ncol_high = ncol(.) <= 30)
  ),
  group_packs(
    vs_am_num = . %>% group_by(vs, am) %>% summarise(vs_am_low = n() >= 7),
    .group_vars = c("vs", "am")
  ),
  col_packs(
    enough_col_sum = . %>%
      summarise_if(is_integerish, rules(is_enough = sum(.) >= 14))
  ),
  row_packs(
    enough_row_sum = . %>%
      filter(vs == 1) %>%
      transmute(is_enough = rowSums(.) >= 200)
  ),
  cell_packs(
    dbl_not_outlier = . %>%
      transmute_if(is.numeric, rules(is_not_out = z_score(.) < 1)) %>%
      slice(-(1:5))
  )
)
# Expose data to rules
mtcars_exposed <- mtcars %>% as_tibble() %>%
  expose(my_packs)
# View exposure
mtcars_exposed %>% get_exposure()
#>   Exposure
#> 
#> Packs info:
#> # A tibble: 5 × 4
#>   name            type       fun        remove_obeyers
#>   <chr>           <chr>      <list>     <lgl>         
#> 1 dims            data_pack  <data_pck> TRUE          
#> 2 vs_am_num       group_pack <grop_pck> TRUE          
#> 3 enough_col_sum  col_pack   <col_pack> TRUE          
#> 4 enough_row_sum  row_pack   <row_pack> TRUE          
#> 5 dbl_not_outlier cell_pack  <cell_pck> TRUE          
#> 
#> Tidy data validation report:
#> # A tibble: 117 × 5
#>   pack            rule       var      id value
#>   <chr>           <chr>      <chr> <int> <lgl>
#> 1 dims            nrow_high  .all      0 FALSE
#> 2 dims            ncol_low   .all      0 FALSE
#> 3 vs_am_num       vs_am_low  0.1       0 FALSE
#> 4 enough_col_sum  is_enough  am        0 FALSE
#> 5 enough_row_sum  is_enough  .all     19 FALSE
#> 6 dbl_not_outlier is_not_out mpg      15 FALSE
#> # ℹ 111 more rows
# Assert any breaker
invisible(mtcars_exposed %>% assert_any_breaker())
#>   Breakers report
#> Tidy data validation report:
#> # A tibble: 117 × 5
#>   pack            rule       var      id value
#>   <chr>           <chr>      <chr> <int> <lgl>
#> 1 dims            nrow_high  .all      0 FALSE
#> 2 dims            ncol_low   .all      0 FALSE
#> 3 vs_am_num       vs_am_low  0.1       0 FALSE
#> 4 enough_col_sum  is_enough  am        0 FALSE
#> 5 enough_row_sum  is_enough  .all     19 FALSE
#> 6 dbl_not_outlier is_not_out mpg      15 FALSE
#> # ℹ 111 more rows
#> Error: assert_any_breaker: Some breakers found in exposure.Rule is a function which converts data unit of interest (data, group, column, row, cell) to logical value indicating whether this object satisfies certain condition.
Rule pack is a function which combines several rules
into one functional block. The recommended way of creating rules is by
creating packs right away with the use of dplyr and magrittr’s pipe operator.
Exposing data to rules means applying rules to data,
collecting results in common format and attaching them to the data as an
exposure attribute. In this way actual exposure can be done
in multiple steps and also be a part of a general data preparation
pipeline.
Exposure is a format designed to contain uniform information about validation of different data units. For reproducibility it also saves information about applied packs. Basically exposure is a list with two elements:
tibble
with the following structure:
There are four basic combinations of var and
id values which define five basic data units:
var == '.all' and id == 0: Data as a
whole.var != '.all' and id == 0: Group
(var shouldn’t be an actual column name) or column
(var should be an actual column name) as a whole.var == '.all' and id != 0: Row as a
whole.var != '.all' and id != 0: Described
cell.With exposure attached to data one can perform different kinds of actions: exploration, assertion, imputation and so on.
# List of two rule packs for checking data properties
my_data_packs <- data_packs(
  # data_dims is a pack name
  data_dims = . %>% summarise(
    # ncol and nrow are rule names
    ncol = ncol(.) == 12,
    nrow = nrow(.) == 32
  ),
  # Data after subsetting should have number of rows in between 10 and 30
  # Rules are applied separately
  vs_1 = . %>% filter(vs == 1) %>%
    summarise(
      nrow_low = nrow(.) > 10,
      nrow_high = nrow(.) < 30
    )
)# List of one nameless rule pack for checking group property
my_group_packs <- group_packs(
  # Name will be imputed during exposure
  . %>% group_by(vs, am) %>%
    summarise(any_cyl_6 = any(cyl == 6)),
  # One should supply grouping variables for correct interpretation of output
  .group_vars = c("vs", "am")
)# rules() defines function predicators with necessary name imputations
# List of two rule pack for checking certain columns' properties
my_col_packs <- col_packs(
  sum_bounds = . %>% summarise_at(
    # Check only columns with names starting with 'c'
    vars(starts_with("c")),
    rules(sum_low = sum(.) > 300, sum_high = sum(.) < 400)
  ),
  # In the edge case of checking one column with one rule there is a need
  # for forcing inclusion of names in the output of summarise_at().
  # This is done with naming argument in vars()
  vs_mean = . %>% summarise_at(vars(vs = vs), rules(mean(.) > 0.5))
)z_score <- function(x) {
  (x - mean(x)) / sd(x)
}
# List of one rule pack checking certain rows' property
my_row_packs <- row_packs(
  row_mean = . %>% mutate(rowMean = rowMeans(.)) %>%
    transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>%
    # Check only rows 10-15
    # Values in 'id' column of report will be based on input data (i.e. 10-15)
    # and not on output data (1-6)
    slice(10:15)
)is_integerish <- function(x) {
  all(x == as.integer(x))
}
# List of two cell pack checking certain cells' property
my_cell_packs <- cell_packs(
  my_cell_pack_1 = . %>% transmute_if(
    # Check only integer-like columns
    is_integerish,
    rules(is_common = abs(z_score(.)) < 1)
  ) %>%
    # Check only rows 20-30
    slice(20:30),
  # The same edge case as in column rule pack
  vs_side = . %>% transmute_at(vars(vs = "vs"), rules(. > mean(.)))
)By default exposing removes obeyers.
mtcars %>%
  expose(my_data_packs, my_group_packs) %>%
  get_exposure()
#>   Exposure
#> 
#> Packs info:
#> # A tibble: 3 × 4
#>   name          type       fun        remove_obeyers
#>   <chr>         <chr>      <list>     <lgl>         
#> 1 data_dims     data_pack  <data_pck> TRUE          
#> 2 vs_1          data_pack  <data_pck> TRUE          
#> 3 group_pack__1 group_pack <grop_pck> TRUE          
#> 
#> Tidy data validation report:
#> # A tibble: 3 × 5
#>   pack          rule      var      id value
#>   <chr>         <chr>     <chr> <int> <lgl>
#> 1 data_dims     ncol      .all      0 FALSE
#> 2 group_pack__1 any_cyl_6 0.0       0 FALSE
#> 3 group_pack__1 any_cyl_6 1.1       0 FALSEOne can leave obeyers by setting .remove_obeyers to
FALSE.
mtcars %>%
  expose(my_data_packs, my_group_packs, .remove_obeyers = FALSE) %>%
  get_exposure()
#>   Exposure
#> 
#> Packs info:
#> # A tibble: 3 × 4
#>   name          type       fun        remove_obeyers
#>   <chr>         <chr>      <list>     <lgl>         
#> 1 data_dims     data_pack  <data_pck> FALSE         
#> 2 vs_1          data_pack  <data_pck> FALSE         
#> 3 group_pack__1 group_pack <grop_pck> FALSE         
#> 
#> Tidy data validation report:
#> # A tibble: 8 × 5
#>   pack          rule      var      id value
#>   <chr>         <chr>     <chr> <int> <lgl>
#> 1 data_dims     ncol      .all      0 FALSE
#> 2 data_dims     nrow      .all      0 TRUE 
#> 3 vs_1          nrow_low  .all      0 TRUE 
#> 4 vs_1          nrow_high .all      0 TRUE 
#> 5 group_pack__1 any_cyl_6 0.0       0 FALSE
#> 6 group_pack__1 any_cyl_6 0.1       0 TRUE 
#> # ℹ 2 more rowsBy default expose() guesses the pack type if ‘not-pack’
function is supplied. This behaviour has some edge cases but is useful
for interactive use.
mtcars %>%
  expose(
    some_data_pack = . %>% summarise(nrow = nrow(.) == 10),
    some_col_pack = . %>% summarise_at(vars(vs = "vs"), rules(is.character(.)))
  ) %>%
  get_exposure()
#>   Exposure
#> 
#> Packs info:
#> # A tibble: 2 × 4
#>   name           type      fun        remove_obeyers
#>   <chr>          <chr>     <list>     <lgl>         
#> 1 some_data_pack data_pack <data_pck> TRUE          
#> 2 some_col_pack  col_pack  <col_pack> TRUE          
#> 
#> Tidy data validation report:
#> # A tibble: 2 × 5
#>   pack           rule    var      id value
#>   <chr>          <chr>   <chr> <int> <lgl>
#> 1 some_data_pack nrow    .all      0 FALSE
#> 2 some_col_pack  rule__1 vs        0 FALSETo write strict and robust code one can set .guess to
FALSE.
mtcars %>%
  expose(
    some_data_pack = . %>% summarise(nrow = nrow(.) == 10),
    some_col_pack = . %>% summarise_at(vars(vs = "vs"), rules(is.character(.))),
    .guess = FALSE
  ) %>%
  get_exposure()
#> Error in expose_single.default(X[[i]], ...): There is unsupported class of rule pack.General actions are recommended to be done with
act_after_exposure(). It takes two arguments:
.trigger - a function which takes the data with
attached exposure and returns TRUE if some action should be
made..actor - a function which takes the same argument as
.trigger and performs some action.If trigger didn’t notify then the input data is returned untouched.
Otherwise the output of .actor() is returned.
Note that act_after_exposure() is often
used for creating side effects (printing, throwing error etc.) and in
that case should invisibly return its input (to be able to use it with
pipe).
trigger_one_pack <- function(.tbl) {
  packs_number <- .tbl %>%
    get_packs_info() %>%
    nrow()
  packs_number > 1
}
actor_one_pack <- function(.tbl) {
  cat("More than one pack was applied.\n")
  invisible(.tbl)
}
mtcars %>%
  expose(my_col_packs, my_row_packs) %>%
  act_after_exposure(
    .trigger = trigger_one_pack,
    .actor = actor_one_pack
  ) %>%
  invisible()
#> More than one pack was applied.ruler has function assert_any_breaker()
which can notify about presence of any breaker in exposure.
mtcars %>%
  expose(my_col_packs, my_row_packs) %>%
  assert_any_breaker()
#>   Breakers report
#> Tidy data validation report:
#> # A tibble: 4 × 5
#>   pack       rule               var      id value
#>   <chr>      <chr>              <chr> <int> <lgl>
#> 1 sum_bounds sum_low            cyl       0 FALSE
#> 2 sum_bounds sum_low            carb      0 FALSE
#> 3 vs_mean    rule__1            vs        0 FALSE
#> 4 row_mean   is_common_row_mean .all     15 FALSE
#> Error: assert_any_breaker: Some breakers found in exposure.More leaned towards assertions:
More leaned towards validation:
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.