This package allows you to monitor changes in data as they get processed. It implements an easy-to-use and extensible logging framework, and comes with a few data loggers implemented.
This vignette will show you how to get started and what the default loggers do. The extending lumberjack vignette explains how to build your own loggers.
The basic version can be installed with
install.packages("lumberjack")
So you want to know who does what to your data as it flows through your process. Here’s the workflow that allows you to do it using the lumberjack
package. (Note the use of %>>%
!)
out <- women %>>%
start_log() %>>%
identity() %>>%
head() %>>%
dump_log(stop=TRUE)
## Dumped a log at /tmp/RtmpcnsPBw/Rbuild316977a77a36/lumberjack/vignettes/simple_log.csv
read.csv("simple_log.csv")
## step time expression changed
## 1 1 2017-06-14 11:39:37 identity() FALSE
## 2 2 2017-06-14 11:39:37 head() TRUE
Lets go through this step by step to see what happened. The start of the script defines an output variable out
and passes women
to the lumberjack (%>>%
). Next, the function start_log
makes sure that logging starts from there. We are now ready to start performing logged transformations on our dataset. First, we apply the identity
function, which does exactly nothing. Then, the head
function selects the first six rows in of the dataset and dump_log(stop=TRUE)
writes the log to a csv file, which we then read in. The optional argument stop=TRUE
signals that no logging is necessary for following activities.
The resulting data frame contains a step number, a timestamp, the expression evaluated to transform the data, and an indicator whether the data had changed at all. As expected, the identity
function hasn’t changed anything and the head
function cuts of all records below the sixth row.
You have now seen the most important functions of the package. Let’s summarize them.
start_log(data, log)
: start logging using possibly a custom logger (see next section)%>>%
: the lumberjack. a logging-aware function composition operator (‘pipe’).dump_log(data, stop, ...)
: dump the log for data
(if present)stop_log(data)
: stop logging.All these functions are data-in, data-out. You are probably used to this from using dplyr or one of its siblings. However, the lumberjack
functions are not limited to data.frame
-like objects. In principle, changes to any object type can be logged, but it depends on the logger whether that will actually work – most will expect a particular data structure.
Just tell start_log()
what logger to use. In the example below we use the builtin cellwise
logger. For this logger it is necessary to have a key column that identifies the rows uniquely so we add that first.
logfile <- tempfile(fileext = ".csv")
women$a_key <- seq_len(nrow(women))
out <- women %>>%
start_log( log = cellwise$new(key="a_key") ) %>>%
{.$height[1] <- 60; .} %>>%
head(13) %>>%
dump_log(file=logfile, stop=TRUE)
## Dumped a log at /tmp/RtmpxXXEYy/file317e7af793d1.csv
read.csv(logfile)
## step time expression key
## 1 1 2017-06-14 11:39:37 CEST {\n .$height[1] <- 60\n .\n} 1
## 2 2 2017-06-14 11:39:37 CEST head(13) 14
## 3 2 2017-06-14 11:39:37 CEST head(13) 14
## 4 2 2017-06-14 11:39:37 CEST head(13) 15
## 5 2 2017-06-14 11:39:37 CEST head(13) 15
## variable old new
## 1 height 58 60
## 2 height 71 NA
## 3 weight 159 NA
## 4 height 72 NA
## 5 weight 164 NA
There are two ways to change how a logger behaves. By setting options at initialization and by setting options when dumping a log.
The start_log
function adds a logging object as an attribute to its input data. By default, this is the simple
logger, which only checks whether data has changed at all. The behavior of this logger can be changed by passing options when it is created. To see this, have a look at the complete call, as it is executed by default.
dat <- start_log(women, log = simple$new())
The expression simple$new()
creates a new logging object, and start_log
makes sure it is attached as an attribute to the copy of the women
dataset stored in dat
. The simple logger has one option called verbose
, that can be set when calling $new
. The default is TRUE
, here we set it to FALSE
.
dat <- start_log(women, log=simple$new(verbose=FALSE))
The effect is that no message is printed when the log is dumped to file.
out <- dat %>>% identity() %>>% dump_log()
read.csv("simple_log.csv")
## step time expression changed
## 1 1 2017-06-14 11:39:37 identity() FALSE
Note that the available options depend logger you use. Look at the logger’s helpfile (?simple
, ?cellwise
) to see all options.
For the simple logger, the default output file is simple_log.csv
This can be changed when calling dump_log
.
out <- dat %>>%
start_log() %>>%
identity() %>>%
dump_log(file="log_all_day.csv")
## Dumped a log at /tmp/RtmpcnsPBw/Rbuild316977a77a36/lumberjack/vignettes/log_all_day.csv
read.csv("log_all_day.csv")
## step time expression changed
## 1 1 2017-06-14 11:39:37 identity() FALSE
The function dump_log
passes most of its arguments to the logger’s $dump()
method. See the help file of the logger for the options (?simple
, ?cellwise
).
Loggers can come in different forms. In principle, authors are free to use R6 classes (as is done here), Reference classes, or anything else that follows the lumberjack API. This means that the way that logging objects are initialized may vary from logger to logger. Check the documentation of a logger to see how to operate it. Maintainers of packages that offer loggers that work with the lumberjack are kindly requested to list the lumberjack
in the Enhances
field of the DESCRIPTION
file, so they can be found through lumberjack
’s CRAN page.
There are several function composition (‘pipe’) operators in the R community, including magrittr, pipeR and yapo. All have different behavior.
The lumberjack operator behaves as a simplified version of the magrittr
pipe operator. Here are some examples.
# pass the first argument to a function
1:3 %>>% mean()
# pass arguments using "."
TRUE %>>% mean(c(1,NA,3), na.rm = .)
# pass arguments to an expression, using "."
1:3 %>>% { 3 * .}
# in a more complicated expression, return "." explicitly
women %>>% { .$height <- 2*.$height; . }
The main differences are that - there is no assignment-pipe like %<>%
. - it does not allow you to define functions in the magrittr style: a <- . %>% sin(.)
This is possible, but the logger has to support it. The simple
logger works for any object, but the cellwise
logger works on data.frame-like objects only.
out <- 1:3 %>>%
start_log() %>>%
{.*2} %>>%
dump_log(file="foo.csv", stop=TRUE)
## Dumped a log at /tmp/RtmpcnsPBw/Rbuild316977a77a36/lumberjack/vignettes/foo.csv
print(out)
## [1] 2 4 6
read.csv("foo.csv")
## step time expression changed
## 1 1 2017-06-14 11:39:37 {\n . * 2\n} TRUE