Parametric Programming in R

John Mount

2017-04-13

Consider the problem of “parametric programming.” That i:s simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future.

Suppose, for example, your task was to and build a new advisory column that tells you which values in a column of a data.frame are missing or NA. We will illustrate this in R using the example data given below:

d <- data.frame(x = c(1, NA))
print(d)
 #     x
 #  1  1
 #  2 NA

Performing an ad hoc analysis is trivial in R: we would just directly write:

d$x_isNA <- is.na(d$x)

We used the fact that we are looking at the data interactively to note the only column is “x”, and then picked “x_isNA” as our result name. If we want to use dplyr the notation remains straightforward:

library("dplyr")
packageVersion("dplyr")
 #  [1] '0.5.0'
d %>% mutate(x_isNA = is.na(x))
 #     x x_isNA
 #  1  1  FALSE
 #  2 NA   TRUE

Now suppose, as is common in actual data science and data wrangling work, we are not the ones picking the column names. Instead suppose we are trying to produce reusable code to perform this task again and again on many data sets. In that case we would then expect the column names to be given to us as values inside other variables (i.e., as parameters).

cname <- "x"                            # column we are examining
rname <- paste(cname, "isNA", sep= '_') # where to land results
print(rname)
 #  [1] "x_isNA"

And writing the matching code is again trivial:

d[[rname]] <- is.na(d[[cname]])

We are now programming at a slightly higher level, or automating tasks. We don’t need to type in new code each time a new data set with a different column name comes in. It is now easy to write a for-loop or lapply over a list of columns to analyze many columns in a single data set. It is an absolute travesty when something that is purely virtual (such as formulas and data) can not be automated over. So the slightly clunkier “[[]]” notation (which can be automated) is a necessary complement to the more convenient “$” notation (which is too specific to be easily automated over).

Using dplyr directly (when you know all the names) is deliberately straightforward, but programming over dplyr can become a challenge.

Standard practice

The standard parametric dplyr practice is to use dplyr::mutate_ (the standard evaluation or parametric variation of dplyr::mutate). Unfortunately the notation in using such an “underbar form” is currently cumbersome.

You have the choice building up your formula through variations of one of:

(source: dplyr Non-standard evaluation vignette “nse”, for additional theory and upcoming official solutions please see here).

Let us try a few of these to try and emphasize we are proposing a new solution, not because we do not know of the current solutions, but instead because we are familiar with the current solutions.

Formula interface

Formula interface is a nice option as it is R’s common way for holding names unevaluated. The code looks like the following (using lazyeval::interp to execute):

if  (requireNamespace("lazyeval")) {
  print(d %>%
    mutate_(.dots = stats::setNames(list(
      lazyeval::interp(~ is.na(VAR),
                       VAR = as.name(cname))), rname)))
}
 #     x x_isNA
 #  1  1  FALSE
 #  2 NA   TRUE

The extra layer of list()-wrapping seems to be needed to successfully control both the name of the input and result (please see here for the original solution).

Currently mutate_ does not take “two-sided formulas” so we can not write:

if  (requireNamespace("lazyeval")) {
  print(d %>% mutate_(RCOL = lazyeval::interp(RES ~ is.na(VAR),
                                              VAR= as.name(cname),
                                              RES= as.name(rname))))
}
 #  Error: Must use one-sided formula.

Trying quote() / substitute()

quote() can delay evaluation, but isn’t the right tool for parameterizing (what the linked NSE reference called “mixing constants and variable”). We can use the related substitute() method as shown below (notice mutate_ ingores our first attempt to name the result and we have to guess/reconstruct what name mutate_ ends up using).

d %>% mutate_(.dots =
    stats::setNames(list(substitute(is.na(XVAR),list(XVAR=cname))),
                    rname))
 #     x x_isNA
 #  1  1  FALSE
 #  2 NA  FALSE

My point is: even if this is something that you know how to accomplish, this is evidence we are really trying to swim upstream with this notation.

String solutions

String based solutions can involve using paste to get parameter values into the strings. Here is an example:

# dplyr mutate_ paste stats::setNames solution
d %>% mutate_(.dots =
                stats::setNames(paste0('is.na(', cname, ')'),
                rname))
 #     x x_isNA
 #  1  1  FALSE
 #  2 NA   TRUE

Or just using strings as an interface to control lazyeval::interp (without the formula interface):

# dplyr mutate_ lazyeval::interp solution
if  (requireNamespace("lazyeval")) {
  print(d %>% mutate_(.dots =
                        stats::setNames(list(lazyeval::interp("is.na(cname)",
                                         cname = as.name(cname))), rname)))
}
 #     x x_isNA
 #  1  1  FALSE
 #  2 NA   TRUE

Our advice

Our advice is to give wrapr::let a try. wrapr::let takes a name mapping list (called “alias”) and a code-block (called “expr”). The code-block is re-written so that names in expr appearing on the left hand sides of the alias map are replaced with names appearing on the right hand side of the alias map.

The code looks like this:

# wrapr::let solution
wrapr::let(alias = list(cname = cname, rname = rname),
            expr  = {
            d %>% mutate(rname = is.na(cname))
            })
 #     x x_isNA
 #  1  1  FALSE
 #  2 NA   TRUE

Notice we are able to use dplyr::mutate instead of needing to invoke dplyr::mutate_. The expression block can be arbitrarily long and contain deep pipelines. We now have a useful separation of concerns, the mapping code is a wrapper completely outside of the user pipeline (the two are no longer commingled). For complicated tasks the ratio of wrapr::let boilerplate to actual useful work goes down quickly.

We also have a varation for piping into (though to save such pipes for later you use wrapr::let, not replyr::letp):

# replyr::letp solution
d %>% replyr::letp(alias = list(cname = cname, rname = rname),
                   expr  = {
                   . %>% mutate(rname = is.na(cname))
                   })
 #     x x_isNA
 #  1  1  FALSE
 #  2 NA   TRUE

The alias map is deliberately only allowed to be a string to string map (no environments, as.name, formula, expressions, or values) so wrapr::let itself is easy to use in automation or program over. I’ll repeat that for emphasis: externally wrapr::let is completely controllable through standard (or parametric) evaluation interfaces. Also notice the code we wrote is never directly mentions “x” or “x_isNA” as it pulls these names out of its execution environment.

All of these solutions have consequences and corner cases. Our (biased) opinion is: we dislike wrapr::let the least.

More reading

Our group has been writing a lot on wrapr::let. It is new code, yet something we think analysts should try. Some of our recent notes include: