1. Introduction to pivottabler

Chris Bailiss

2017-03-27

In This Vignette

pivottabler Development Status

The pivottabler package has undergone a rapid development in a relatively short time period. Some rough edges exist. Very likely some bugs too.

The latest version of the pivottabler package can be obtained directly from the package repository. Please log any questions not answered by the vignettes or any bug reports here.

Pivot Tables

Definition

Pivot tables are a common technique for summarising large tables of data into smaller and more easily understood summary tables to answer specific questions.

Starting from a specific question that requires answering, the variables relevant to the question are identified. The distinct values of the fixed variables1 are rendered as a mixture of row and column headings in the summary table. One or more aggregations of the (numerical) measured variables are added into the body of the table. The summary table should then yield a concise answer to the original question.

In reality

The definition above is probably more difficult to understand than just looking at some examples - several are presented in this vignette. An extended definition is also provided by Wikipedia.

Pivot tables can be found in everyday use within many commercial and non-commercial organisations. Pivot tables feature prominently in applications such as Microsoft Excel, Open Office, etc. More advanced forms are found in Business Intelligence (BI) and Online Analytical Processing (OLAP) tools.

The pivottabler package:

Since pivot tables are primarily visualisation tools, the pivottabler package offers several custom styling options as well as conditional/custom formatting capabilities.

Output is rendered as HTML via the htmlwidgets framework. The generated HTML can also be easily retrieved, e.g. to be used outside of R.

Sample Data: Trains in Birmingham

To build a series of example pivot tables, we will use the bhmtrains data frame. This contains all 83,710 trains that arrived into and/or departed from Birmingham New Street railway station between 1st December 2016 and 28th February 2017. As an example, the following are four trains that arrived into Birmingham New Street at the very start of this time period - note the data has been transposed (otherwise the table would be very wide).

GbttArrival and GbttDeparture are the scheduled arrival and departure times of the trains at Birmingham New Street, as advertised in the Great Britain Train Timetable (GBTT). Also given are the actual arrival and departure times of the trains at Birmingham New Street. Note that all four of the trains above terminated at New Street, hence they have arrival times but no departure times. The origin and destination stations of each of the trains is also included, in the form of three letter station codes, e.g. BHM = Birmingham New Street. The trainstations data frame (used later in this vignette) includes a lookup from the code to the full station name for all stations.

The first train above:

Basic Pivot Table

Suppose we want to answer the question: How many ordinary/express passenger trains did each train operating company (TOC) operate in the three month period?

The following code will generate the relevant pivot table:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

Each line above works as follows:

  1. Load the namespace of the pivottabler library.
  2. Create a new pivot table instance3.
  3. Specify the data frame that contains the data for the pivot table.
  4. Add the distinct values from the TrainCategory column in the data frame as columns in the pivot table.
  5. Add the distinct values from the TOC column in the data frame as rows in the pivot table.
  6. Specify the calculation. The summarise expression must be an expression that can be used with the dplyr summarise() function. This expression is used internally by the pivottabler package with the dplyr summarise function4.
  7. Generate the pivot table.

Constructing the Basic Pivot Table

The following examples show how each line in the above example constructs the pivot table. To improve readability, each code change is highlighted.

# produces no pivot table
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$renderPivot()
# specify the column headings
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")   #    << **** LINE ADDED **** <<
pt$renderPivot()
# specify the row headings
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")                #    << **** LINE ADDED **** <<
pt$renderPivot()
# specifying a calculation
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")                #     **** LINE BELOW ADDED ****
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

Extending the Basic Pivot Table

There follows below a progressive series of changes to the basic pivot table shown above. Each change is made by adding or changing one line of code. Again, to improve readability, each code change is highlighted.

First, adding an additional column data group to sub-divide each “TrainCategory” by “PowerType”:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType")    #    << **** CODE CHANGE **** <<
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

By default, the new data group does not expand the existing “TrainCategory” total. However, an additional argument allows the total column to also be expanded:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType", expandExistingTotals=TRUE) # << ** CODE CHANGE ** <<
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

Instead of adding “PowerType” as columns, it can also be added as rows:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$addRowDataGroups("PowerType")    #    << **** CODE CHANGE **** <<
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

It is possible to continue adding additional data groups. The pivottabler enforces no maximum depth of data groups. For example, adding the maximum scheduled speed to the rows:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$addRowDataGroups("PowerType")
pt$addRowDataGroups("SchedSpeedMPH")    #    << **** CODE CHANGE **** <<
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

As more data groups are added, the pivot table can seem overwhelmed with totals. It is possible to selectively shpw/hide totals using the addTotal argument. Totals can be renamed using the totalCaption argument. Both of these options are demonstrated below.

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC", totalCaption="Grand Total")    #    << **** CODE CHANGE **** <<
pt$addRowDataGroups("PowerType")
pt$addRowDataGroups("SchedSpeedMPH", addTotal=FALSE)      #    << **** CODE CHANGE **** <<
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

Further Reading

The pivottabler package has many more capabilities. More details can be found in the other vignettes. The full set of vignettes is:

  1. Introduction
  2. Data Groups
  3. Calculations
  4. Styling
  5. Shiny

  1. The terms “fixed variables” and “measured variables” are used here as in Wickham 2014

  2. This is the identifier assigned by the Recent Train Times website, the source of this sample data

  3. pivottabler is implemented in R6 Classes so pt here is an instance of the R6 PivotTable class.

  4. See the dplyr cheatsheet for other summary functions.