Using table.express

The goal of this package is to offer an alternative way of expressing common operations with data.table without sacrificing the performance optimizations that it offers. The foundation for the data manipulation verbs is the dplyr package, which also advocates the piping operator from the magrittr package. The rlang package powers most of this package’s functionality, which means that tidy evaluation can also be supported. There are other resources that provide comprehensive descriptions of these packages, so they will not be explained here.

Even though using data manipulation verbs can improve expressiveness in some cases, this is not always true, so using the traditional data.table syntax might still be preferable in many situations.

In order to resemble SQL syntax more closely, a couple of verb aliases are also defined:

The examples here will be working with the mtcars data:

data("mtcars")

DT <- mtcars %>%
  as.data.table %T>%
  print
#>      mpg cyl disp  hp drat    wt  qsec vs am gear carb
#>  1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#>  2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> ---                                                   
#> 31: 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
#> 32: 21.4   4  121 109 4.11 2.780 18.60  1  1    4    2

Expression delimiters

The foundation for this package is building expressions that are almost entirely delegated to data.table. These expressions are built by parsing the input of the different verbs. In order to explicitly show when an expression is being built and subsequently evaluated, we use 3 delimiters:

These also serve as visual reminders that we are not dealing directly with data.tables during the process. We capture the input data.table and start the process with start_expr, and evaluate the final expression with end_expr. Using chain is equivalent to calling end_expr immediately followed by start_expr.

Arranging rows

The arrange/order_by verbs add an expression with order to the frame, and let data.table handle it as usual:

DT %>%
  start_expr %>%
  order_by(mpg, -cyl) %T>%
  print %>%
  end_expr
#> .DT_[order(mpg, -cyl)]
#>      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#>  1: 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#>  2: 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> ---                                                    
#> 31: 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> 32: 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

We see here that the built expression includes a .DT_ pronoun. When the expression is evaluated, the captured data.table is assigned to the evaluation environment as said pronoun.

Selecting columns

Even though selecting a subset of columns is a common operation, it may be undesirable to do so when working with data.tables because they lead to data copies.

x <- 1:2
tracemem(x)
#> [1] "<00000000141B1828>"
df <- data.frame(x=x)
x2 <- df[, "x"]
x2[1L] <- 0L
#> tracemem[0x00000000141b1828 -> 0x00000000178d45b0]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> doTryCatch tryCatchOne tryCatchList tryCatch <Anonymous>

With this normal data frame, only the last assignment triggered a copy.

dt <- data.table(x=x)
#> tracemem[0x00000000141b1828 -> 0x0000000017b4d818]: data.table eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> doTryCatch tryCatchOne tryCatchList tryCatch <Anonymous>
x3 <- dt[, x]
#> tracemem[0x0000000017b4d818 -> 0x0000000017bec920]: copy [.data.table [ eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> doTryCatch tryCatchOne tryCatchList tryCatch <Anonymous>

In this case with data.table, more copies were triggered. Given that data.table supports modification by reference, these copies are necessary.

With that said, the select verb can be used as usual:

DT %>%
  start_expr %>%
  select(mpg, am) %T>%
  print %>%
  end_expr
#> .DT_[, list(mpg, am)]
#>      mpg am
#>  1: 21.0  1
#>  2: 21.0  1
#> ---        
#> 31: 15.0  1
#> 32: 21.4  1

To maintain consistency, even single columns are kept as data.tables:

DT %>%
  start_expr %>%
  select(mpg) %T>%
  print %>%
  end_expr
#> .DT_[, list(mpg)]
#>      mpg
#>  1: 21.0
#>  2: 21.0
#> ---     
#> 31: 15.0
#> 32: 21.4

In the case of single expressions in select, calls to tidyselect’s helpers or to : are handled specially internally:

DT %>%
  start_expr %>%
  select(mpg:cyl) %>%
  end_expr
#>      mpg cyl
#>  1: 21.0   6
#>  2: 21.0   6
#> ---         
#> 31: 15.0   8
#> 32: 21.4   4
DT %>%
  start_expr %>%
  select(contains("M", ignore.case = TRUE)) %>%
  end_expr
#>      mpg am
#>  1: 21.0  1
#>  2: 21.0  1
#> ---        
#> 31: 15.0  1
#> 32: 21.4  1

Tidy evaluation and the .parse argument can also aid in cases where certain parts of the frame were computed programmatically:

selected <- c("mpg", "am")
DT %>%
  start_expr %>%
  select(!!!selected, .parse = TRUE) %>%
  end_expr
#>      mpg am
#>  1: 21.0  1
#>  2: 21.0  1
#> ---        
#> 31: 15.0  1
#> 32: 21.4  1

Transmuting columns

Given the way data.table handles the j part of the frame, creating and keeping only new columns (like with dplyr’s transmute) can be done with select, so transmute is simply an alias during expression building.

DT %>%
  start_expr %>%
  select(foo = mpg * 2, bar = exp(cyl)) %>%
  end_expr
#>      foo        bar
#>  1: 42.0  403.42879
#>  2: 42.0  403.42879
#> ---                
#> 31: 30.0 2980.95799
#> 32: 42.8   54.59815
DT %>%
  start_expr %>%
  transmute(foo = mpg * 2, bar = exp(cyl)) %>%
  end_expr
#>      foo        bar
#>  1: 42.0  403.42879
#>  2: 42.0  403.42879
#> ---                
#> 31: 30.0 2980.95799
#> 32: 42.8   54.59815

Mutating columns

The mutate verb builds an expression with := in order to perform assignment by reference by default. This can be avoided by passing .by_ref = FALSE to end_expr, which will use data.table::copy before assigning .DT_:

DT %>%
  start_expr %>%
  mutate(mpg = mpg / 2, hp = log(hp))
#> .DT_[, `:=`(mpg = mpg/2, hp = log(hp))]
DT %>%
  start_expr %>%
  mutate(mpg = mpg / 2, hp = log(hp)) %>%
  end_expr(.by_ref = FALSE) %>% {
    invisible(print(.))
  }
#>      mpg cyl disp       hp drat    wt  qsec vs am gear carb
#>  1: 10.5   6  160 4.700480 3.90 2.620 16.46  0  1    4    4
#>  2: 10.5   6  160 4.700480 3.90 2.875 17.02  0  1    4    4
#> ---                                                        
#> 31:  7.5   8  301 5.814131 3.54 3.570 14.60  0  1    5    8
#> 32: 10.7   4  121 4.691348 4.11 2.780 18.60  1  1    4    2
print(DT)
#>      mpg cyl disp  hp drat    wt  qsec vs am gear carb
#>  1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#>  2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> ---                                                   
#> 31: 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
#> 32: 21.4   4  121 109 4.11 2.780 18.60  1  1    4    2

Filtering rows

The where/filter verbs work with the i part of the frame:

DT %>%
  start_expr %>%
  filter(vs == 1L, carb > 2L) %T>%
  print %>%
  end_expr
#> .DT_[vs == 1L & carb > 2L]
#>     mpg cyl  disp  hp drat   wt qsec vs am gear carb
#> 1: 19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
#> 2: 17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
DT %>%
  start_expr %>%
  select(mean_mpg = mean(mpg)) %>%
  where(vs == 1L, carb > 2L, .collapse = `|`) %T>%
  print %>%
  end_expr
#> .DT_[vs == 1L | carb > 2L, list(mean_mpg = mean(mpg))]
#>    mean_mpg
#> 1: 20.30741

The helper verb filter_sd can be used to apply the same conditions to many columns, and it can use a special pronoun .COL while specifying the expression, as well as tidyselect helpers to choose .SDcols (with caveats, see eager verbs):

DT %>%
  start_expr %>%
  filter_sd(`>`, 20, .SDcols = c("mpg", "qsec")) %T>%
  print %>%
  end_expr
#> .DT_[mpg > 20 & qsec > 20]
#>     mpg cyl  disp hp drat    wt  qsec vs am gear carb
#> 1: 22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2
#> 2: 21.5   4 120.1 97 3.70 2.465 20.01  1  0    3    1
DT %>%
  start_expr %>%
  filter_sd(.COL > 20, .SDcols = c("mpg", "qsec")) %>%
  end_expr
#>     mpg cyl  disp hp drat    wt  qsec vs am gear carb
#> 1: 22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2
#> 2: 21.5   4 120.1 97 3.70 2.465 20.01  1  0    3    1
DT %>%
  start_expr %>%
  filter_sd(.COL > 0, .SDcols = contains("m"))
#> .DT_[mpg > 0 & am > 0]

Using keys or secondary indices

The filter_on verb can be used to build an expression that specifies the on argument of the frame. It accepts key-value pairs where each key is a column in the data, and each value is the corresponding value that the column should have to match:

#> .DT_[list(6, 0), on = c("cyl", "am")]
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 2: 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 3: 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 4: 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#>     mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

Modifying subset of data

In order to support functionality similar to data.table’s DT[, lapply(.SD, fun), .SDcols = c("...")] syntax, 2 data.table-specific verbs are provided: mutate_sd and transmute_sd.

Starting with mutate_sd, it modifies columns in .SDcols by reference, and columns that are not part of .SDcols are kept:

DT %>%
  start_expr %>%
  mutate_sd(exp, .SDcols = c("mpg", "cyl")) %>%
  end_expr

print(DT)
#>            mpg        cyl disp  hp drat    wt  qsec vs am gear carb
#>  1: 1318815734  403.42879  160 110 3.90 2.620 16.46  0  1    4    4
#>  2: 1318815734  403.42879  160 110 3.90 2.875 17.02  0  1    4    4
#> ---                                                                
#> 31:    3269017 2980.95799  301 335 3.54 3.570 14.60  0  1    5    8
#> 32: 1967441884   54.59815  121 109 4.11 2.780 18.60  1  1    4    2

Additionally, mutate_sd supports the special .COL pronoun that symbolizes the column that should be modified, and can be used to express the mutation expression:

DT %>%
  start_expr %>%
  mutate_sd(log(.COL), .SDcols = c("mpg", "cyl")) %>%
  end_expr

print(DT)
#>      mpg cyl disp  hp drat    wt  qsec vs am gear carb
#>  1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#>  2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> ---                                                   
#> 31: 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
#> 32: 21.4   4  121 109 4.11 2.780 18.60  1  1    4    2

On the other hand, transmute_sd never modifies by reference, and supports special expressions to “build” what is chosen as .SDcols. These expressions can use tidyselect helpers, as well as another special pronoun: .COLNAME:

DT %>%
  start_expr %>%
  transmute_sd(.COL * 2, .SDcols = starts_with("d")) %>%
  end_expr
#>     disp drat
#>  1:  320 7.80
#>  2:  320 7.80
#> ---          
#> 31:  602 7.08
#> 32:  242 8.22
DT %>%
  start_expr %>%
  transmute_sd(.COL * 2, .SDcols = grepl("^d", .COLNAME)) %>%
  end_expr
#>     disp drat
#>  1:  320 7.80
#>  2:  320 7.80
#> ---          
#> 31:  602 7.08
#> 32:  242 8.22

Data manipulation by group

Since data.table already supports this by means of its by parameter, the group_by verb simply parses its input and assigns it accordingly:

DT %>%
  start_expr %>%
  select(.N) %>%
  group_by(gear) %T>%
  print %>%
  end_expr
#> .DT_[, list(.N), by = list(gear)]
#>    gear  N
#> 1:    4 12
#> 2:    3 15
#> 3:    5  5

The key_by verb does the same but sets the key of the result in order to sort:

DT %>%
  start_expr %>%
  select(.N) %>%
  key_by(gear) %T>%
  print %>%
  end_expr
#> .DT_[, list(.N), keyby = list(gear)]
#>    gear  N
#> 1:    3 15
#> 2:    4 12
#> 3:    5  5

Automatic expression chaining

A data.table’s frame has 3 main elements: i, j, and by. By default, the verbs defined in this package automatically start a new frame whenever they want to define one of these elements, but the current expression’s frame has already specified it; otherwise they add to the current frame. More complex expressions are thus supported by automatically chaining data.table frames:

DT %>%
  start_expr %>%
  select(mean_mpg = mean(mpg)) %>%
  where(hp > 50L) %>%
  group_by(vs, am, gear) %>%
  order_by(gear, -vs, am) %>%
  filter(mean_mpg > 20) %T>%
  print %>%
  end_expr %>% {
    invisible(print(., nrows = 10L))
  }
#> .DT_[hp > 50L, list(mean_mpg = mean(mpg)), by = list(vs, am, 
#>     gear)][order(gear, -vs, am)][mean_mpg > 20]
#>    vs am gear mean_mpg
#> 1:  1  0    3 20.33333
#> 2:  1  0    4 21.05000
#> 3:  1  1    4 28.03333
#> 4:  0  1    4 21.00000
#> 5:  1  1    5 30.40000

If we wanted to be explicit about chaining whenever possible (see below), we could set options(table.express.chain = FALSE), which would lead to a warning being shown whenever a part of the query is replaced.

Verbs’ effects in the frame

Explicit chaining

The automatic chaining mentioned above is not a problem in most situations. For example the following chains lead to the same result, and therefore have the same semantics:

DT[mpg > 20, mpg * 2]
#>  [1] 42.0 42.0 45.6 42.8 48.8 45.6 64.8 60.8 67.8 43.0 54.6 52.0 60.8 42.8
DT[mpg > 20][, mpg * 2]
#>  [1] 42.0 42.0 45.6 42.8 48.8 45.6 64.8 60.8 67.8 43.0 54.6 52.0 60.8 42.8

However, these two chains have different semantics:

DT[, .(mpg = mpg * 2)][mpg > 40]
#>      mpg
#>  1: 42.0
#>  2: 42.0
#> ---     
#> 13: 60.8
#> 14: 42.8
DT[mpg > 40, .(mpg = mpg * 2)]
#> Empty data.table (0 rows and 1 cols): mpg

As mentioned above, chain can be used to chain expressions by evaluating the current one with end_expr, and immediately capturing the resulting data.table to start building a new expression. This can be helpful in situations where automatic chaining (or lack thereof) can lead to a change in the expression’s semantics:

DT %>%
  start_expr %>%
  transmute(mpg = mpg * 2) %>%
  filter(mpg > 40) %T>%
  print %>%
  end_expr
#> .DT_[mpg > 40, list(mpg = mpg * 2)]
#> Empty data.table (0 rows and 1 cols): mpg
DT %>%
  start_expr %>%
  transmute(mpg = mpg * 2) %>%
  chain %>%
  filter(mpg > 40) %>%
  end_expr
#>      mpg
#>  1: 42.0
#>  2: 42.0
#> ---     
#> 13: 60.8
#> 14: 42.8

Eager verbs

In the following cases, the mentioned verbs use the captured data.table eagerly during expression building:

This can lead to unexpected results if we don’t keep in mind the expression that is built:

#>      mpg cyl disp
#>  1: 21.0   6  160
#>  2: 21.0   6  160
#> ---              
#> 31: 15.0   8  301
#> 32: 21.4   4  121
#> .DT_[, mpg:disp][mpg > 0 & am > 0, list(ans = sqrt(mpg))]

The select gets rid of am, but filter_sd sees the columns of DT before any expression has been evaluated. Explicit chaining can help in these cases, capturing intermediate results:

#> .DT_[mpg > 0, list(ans = sqrt(mpg))]