A Grammar of Tables

This package is meant to implement the concept of a grammar of tables. It allows for a simple formula expression and a data frame to create a rich summary table in a variety of formats. It is designed for extensibility at each step of the process, so that one is not limited by the authors choice of table statistics, output format. The grammar however is an integral part of the package, and as such is not modifiable.

Here’s an example similar to summaryM from Hmisc to get us started:

summary_table("drug ~ bili + albumin + stage::Categorical + protime + sex + age + spiders", pbc)
===================================================================================================================== 
                                    N   D-penicillamine       placebo        not randomized       Test Statistic      
                                            (N=154)           (N=158)           (N=106)                               
--------------------------------------------------------------------------------------------------------------------- 
Serum Bilirubin  (mg/dl)           418  0.72 *1.30* 3.60  0.80 *1.40* 3.20  0.72 *1.40* 3.08  F_{2,415}=0.03, P=0.972 
Albumin  (gm/dl)                   418  3.34 *3.54* 3.78  3.21 *3.56* 3.83  3.12 *3.47* 3.72  F_{2,415}=2.13, P=0.120 
Histologic Stage, Ludwig Criteria  412                                                            X^2_6=5.33, P=0.502 
   1                                     0.026    4/154    0.076   12/158    0.047    5/106                           
   2                                     0.208   32/154    0.222   35/158    0.236   25/106                           
   3                                     0.416   64/154    0.354   56/158    0.330   35/106                           
   4                                     0.351   54/154    0.348   55/158    0.330   35/106                           
Prothrombin Time  (sec.)           416  10.0 *10.6* 11.4  10.0 *10.6* 11.0  10.1 *10.6* 11.0  F_{2,413}=0.23, P=0.795 
sex : female                       418   0.903  139/154    0.867  137/158    0.925   98/106       X^2_2=2.38, P=0.304 
Age                                418  41.4 *48.1* 55.8  43.0 *51.9* 58.9  46.0 *53.0* 61.0  F_{2,415}=6.10, P=0.002 
spiders : present                  312   0.292   45/154    0.285   45/158                         X^2_1=0.02, P=0.885 
===================================================================================================================== 

Or the same directly into an Rmarkdown pipe_table:

rmd(summary_table("drug ~ bili[2] + albumin + stage::Categorical + protime + sex + age + spiders", pbc))
N D-penicillamine placebo not randomized Test Statistic
(N=154) (N=158) (N=106)
Serum Bilirubin (mg/dl) 418 0.72 1.30 3.60 0.80 1.40 3.20 0.72 1.40 3.08 F2,415=0.03, P=0.972
Albumin (gm/dl) 418 3.34 3.54 3.78 3.21 3.56 3.83 3.12 3.47 3.72 F2,415=2.13, P=0.120
Histologic Stage, Ludwig Criteria 412 χ2
6
=5.33, P=0.502
1 0.026 4/154 0.076 12/158 0.047 5/106
2 0.208 32/154 0.222 35/158 0.236 25/106
3 0.416 64/154 0.354 56/158 0.330 35/106
4 0.351 54/154 0.348 55/158 0.330 35/106
Prothrombin Time (sec.) 416 10.0 10.6 11.4 10.0 10.6 11.0 10.1 10.6 11.0 F2,413=0.23, P=0.795
sex : female 418 0.903 139/154 0.867 137/158 0.925 98/106 χ2
2
=2.38, P=0.304
Age 418 41.4 48.1 55.8 43.0 51.9 58.9 46.0 53.0 61.0 F2,415=6.10, P=0.002
spiders : present 312 0.292 45/154 0.285 45/158 χ2
1
=0.02, P=0.885

Notice that stage in the formula wasn’t stored as a factor, i.e. Categorical variable, so by adding a type specifier in the formula given, it is treated as a Categorical. There is no preconversion applied to the data frame, nor is there a guess based on the number of unique values. Full direct control of typing is provided in the formula specification.

It also supports HTML5, with styling fragments

Hmisc Style Example

html5(summary_table("drug ~ bili[2] + albumin + stage::Categorical + protime + sex + age + spiders", pbc),
      fragment=TRUE, inline="hmisc.css", caption = "HTML5 Table Hmisc Style", id="tbl2")
HTML5 Table Hmisc Style
ND-penicillamineplacebonot randomizedTest Statistic
154158106
Serum Bilirubin mg/dl4180.721.303.600.801.403.200.721.403.08F2,415 = 0.03,P = 0.9721
Albumin gm/dl4183.343.543.783.213.563.833.123.473.72F2,415 = 2.13,P = 0.1201
Histologic Stage, Ludwig Criteria412χ2
6
=
5.33,
P = 0.5022
    10
.
026
2.6 4154
0
.
076
7.6 12158
0
.
047
4.7 5106
    20
.
208
20.8 32154
0
.
222
22.2 35158
0
.
236
23.6 25106
    30
.
416
41.6 64154
0
.
354
35.4 56158
0
.
330
33.0 35106
    40
.
351
35.1 54154
0
.
348
34.8 55158
0
.
330
33.0 35106
Prothrombin Time sec.41610.010.611.410.010.611.010.110.611.0F2,413 = 0.23,P = 0.7951
sex : female4180
.
903
90.3139154
0
.
867
86.7137158
0
.
925
92.5 98106
χ2
2
=
2.38,
P = 0.3042
Age41841.448.155.843.051.958.946.053.061.0F2,415 = 6.10,P = 0.0021
spiders : present3120
.
292
29.2 45154
0
.
285
28.5 45158
χ2
1
=
0.02,
P = 0.8852
N is the number of non-missing value. 1Kruskal-Wallis test. 2Pearson test

NEJM Style Example

Fragments can have localized style sheets specified by given id.

html5(summary_table("drug ~ bili[2] + albumin + stage::Categorical + protime + sex + age + spiders", pbc),
      fragment=TRUE, inline="nejm.css", caption = "HTML5 Table NEJM Style", id="tbl3")
HTML5 Table NEJM Style
ND-penicillamineplacebonot randomizedTest Statistic
154158106
Serum Bilirubin mg/dl4180.721.303.600.801.403.200.721.403.08F2,415 = 0.03,P = 0.9721
Albumin gm/dl4183.343.543.783.213.563.833.123.473.72F2,415 = 2.13,P = 0.1201
Histologic Stage, Ludwig Criteria412χ2
6
=
5.33,
P = 0.5022
    10
.
026
2.6 4154
0
.
076
7.6 12158
0
.
047
4.7 5106
    20
.
208
20.8 32154
0
.
222
22.2 35158
0
.
236
23.6 25106
    30
.
416
41.6 64154
0
.
354
35.4 56158
0
.
330
33.0 35106
    40
.
351
35.1 54154
0
.
348
34.8 55158
0
.
330
33.0 35106
Prothrombin Time sec.41610.010.611.410.010.611.010.110.611.0F2,413 = 0.23,P = 0.7951
sex : female4180
.
903
90.3139154
0
.
867
86.7137158
0
.
925
92.5 98106
χ2
2
=
2.38,
P = 0.3042
Age41841.448.155.843.051.958.946.053.061.0F2,415 = 6.10,P = 0.0021
spiders : present3120
.
292
29.2 45154
0
.
285
28.5 45158
χ2
1
=
0.02,
P = 0.8852
N is the number of non-missing value. 1Kruskal-Wallis test. 2Pearson test

Lancet Style Example

Fragments can have localized style sheets specified by given id.

# Lancet uses 4-digit p-values
p_digits_4 <- cell_transform(function(cell) {
  if("p" %in% names(cell)) cell$p <- form(cell$p, "%1.4f")
  cell
})

html5(summary_table("drug ~ bili[2] + albumin + stage::Categorical + protime + sex + age + spiders", pbc,
      after=p_digits_4),
      fragment=TRUE, inline="lancet.css", caption = "HTML5 Table Lancet Style", id="tbl4"
      )
HTML5 Table Lancet Style
ND-penicillamineplacebonot randomizedTest Statistic
154158106
Serum Bilirubin mg/dl4180.721.303.600.801.403.200.721.403.08F2,415 = 0.03,P = 0.97251
Albumin gm/dl4183.343.543.783.213.563.833.123.473.72F2,415 = 2.13,P = 0.12001
Histologic Stage, Ludwig Criteria412χ2
6
=
5.33,
P = 0.50242
    10
.
026
2.6 4154
0
.
076
7.6 12158
0
.
047
4.7 5106
    20
.
208
20.8 32154
0
.
222
22.2 35158
0
.
236
23.6 25106
    30
.
416
41.6 64154
0
.
354
35.4 56158
0
.
330
33.0 35106
    40
.
351
35.1 54154
0
.
348
34.8 55158
0
.
330
33.0 35106
Prothrombin Time sec.41610.010.611.410.010.611.010.110.611.0F2,413 = 0.23,P = 0.79471
sex : female4180
.
903
90.3139154
0
.
867
86.7137158
0
.
925
92.5 98106
χ2
2
=
2.38,
P = 0.30392
Age41841.448.155.843.051.958.946.053.061.0F2,415 = 6.10,P = 0.00241
spiders : present3120
.
292
29.2 45154
0
.
285
28.5 45158
χ2
1
=
0.02,
P = 0.88532
N is the number of non-missing value. 1Kruskal-Wallis test. 2Pearson test

Indexing

It is also capable of producing an index of contents inside a table for traceability.

index(summary_table("drug ~ bili + albumin + stage::Categorical + protime + sex + age + spiders", pbc))[1:20,]
      key    src                                            
 [1,] "MTI1" "Table:bili:drug[D-penicillamine]:N"           
 [2,] "ODI3" "Table:bili:drug[placebo]:N"                   
 [3,] "Zjg4" "Table:bili:drug[not randomized]:N"            
 [4,] "ZjZm" "Table:bili:drug:N"                            
 [5,] "ZDYw" "Table:bili:drug[D-penicillamine]:quantile"    
 [6,] "ZGI1" "Table:bili:drug[placebo]:quantile"            
 [7,] "OGM4" "Table:bili:drug[not randomized]:quantile"     
 [8,] "YTI1" "Table:bili:drug:F"                            
 [9,] "ZDhi" "Table:albumin:drug:N"                         
[10,] "YzEy" "Table:albumin:drug[D-penicillamine]:quantile" 
[11,] "ODBm" "Table:albumin:drug[placebo]:quantile"         
[12,] "MzQy" "Table:albumin:drug[not randomized]:quantile"  
[13,] "ZDlm" "Table:albumin:drug:F"                         
[14,] "ODZk" "Table:stage:drug:N"                           
[15,] "MjUx" "Table:stage:drug:htest"                       
[16,] "NTUx" "Table:stage[1]:drug[D-penicillamine]:fraction"
[17,] "Zjdl" "Table:stage[1]:drug[placebo]:fraction"        
[18,] "Yjk2" "Table:stage[1]:drug[not randomized]:fraction" 
[19,] "YWQy" "Table:stage[2]:drug[D-penicillamine]:fraction"
[20,] "NmEy" "Table:stage[2]:drug[placebo]:fraction"        
      value                                        
 [1,] "154"                                        
 [2,] "158"                                        
 [3,] "106"                                        
 [4,] "418"                                        
 [5,] "1.3 [0.725, 3.6]"                           
 [6,] "1.4 [0.8, 3.2]"                             
 [7,] "1.4 [0.725, 3.075]"                         
 [8,] "F=0.0279075093333664, p = 0.972480132693603"
 [9,] "418"                                        
[10,] "3.545 [3.3425, 3.7775]"                     
[11,] "3.565 [3.2125, 3.83]"                       
[12,] "3.47 [3.125, 3.72]"                         
[13,] "F=2.13150432275865, p = 0.119955914166202"  
[14,] "412"                                        
[15,] "chisq=5.32908004628618, p=0.50235045718865" 
[16,] "0.026  4/154"                               
[17,] "0.076  12/158"                              
[18,] "0.047  5/106"                               
[19,] "0.208  32/154"                              
[20,] "0.222  35/158"                              

Intercept Model Example

x <- round(rnorm(375, 79, 10))
y <- round(rnorm(375, 80,  9))
y[rbinom(375, 1, prob=0.05)] <- NA
attr(x, "label") <- "Global score, 3m"
attr(y, "label") <- "Global score, 12m"
html5(summary_table(1 ~ x+y,
                    data.frame(x=x, y=y),
                    after=hmisc_intercept_cleanup),
      fragment=TRUE, inline="lancet.css", caption="", id="tbl5")
NAll
Global score, 3m375727986
Global score, 12m374748086
N is the number of non-missing value. 1Kruskal-Wallis test. 2Pearson test

Types

The Hmisc default style recognizes 3 types: Categorical, Bionimial, and Numerical. Then for each product of these two, a function is provided to generate the corresponding rows and columns. As mentioned before, the user can declare any type in a formula, and one is not limited to the Hmisc defaults. This is completely customizable, which will be covered later.

Let’s cover the phases of table generations.

  1. Syntax. The formula is parsed into an abstract syntax tree (AST), and factors are right distributed, and the data frame is split into appropriate pieces attached to each node in the AST. The syntax and parser are the only portions of this library that are fixed, and not customizable. The grammar may expand with time, but cautiously as to not create an overly verbose set of possibilites to interpret. The goal is to create a clean grammar that describes the bold areas of a table to fill in.
  2. Semantics. The elements of the AST are examined, and passed to compilation functions. The compilation function function is chosen by determining the type of the row variable, and the type of column variable. For example, drug ~ stage::Categorical, is a Categorical\(\times\)Categorical which references the summarize_chisq for compiling. One can easily specify different compilers for a formula and get very different results inside a formula. Note: the application of multiplication * cannot be done in the previous phase, because this involves semantic meaning of what multiplication means. In one context it might be an interaction, in another simple multiplication. Handling multiplicative terms can be tricky. Once compiling is finished a table object composed of cells (list of lists) which are one of a variety of S3 types is the result.
  3. Rendering. With a compiled table object in memory, the final stage is conversion to an output format which could be plain text, HTML5, LaTeX or anything. These are overrideable via S3 classes representing the different possible types of cells that are present inside a table. User specified rendering is possible as well.

Summary columns

A simple example of using an intercept in a formula, with some post processing to remove undesired columns.

d1 <- iris
d1$A <- d1$Sepal.Length > 5.1
attr(d1$A,"label") <- "Sepal Length > 5.1"
tbl1 <- summary_table(
 Species + 1 ~ A + Sepal.Width,
 data = d1,
 after = list(drop_statistics, function(tbl) del_col(tbl, 6))
 )

html5(tbl1,
     fragment=TRUE, inline="nejm.css", caption = "Example All Summary", id="tbl1")
Example All Summary
NsetosaversicolorvirginicaAll
505050150
Sepal Length > 5.1 : TRUE1500
.
280
28.01450
0
.
920
92.04650
0
.
980
98.04950
0
.
727
72.7109150
Sepal.Width1503.203.403.682.522.803.002.803.003.182.803.003.30
N is the number of non-missing value. 1Kruskal-Wallis test. 2Pearson test

Extensibility

The library is designed to be extensible, in the hopes that more useful summary functions can generate results into a wide variety of formats. This is done by the translator functions, which given a row and column from a formula will process the data into a table.

This example shows how to create a function that given a row and column, to construct summary entries for a table.

### Make up some data, which has events nested within an id
n  <- 1000
df <- data.frame(id = sample(1:250, n*3, replace=TRUE), event = as.factor(rep(c("A", "B","C"), n)))
attr(df$id, "label") <- "ID"

### Now create custom function for counting events with a category
summarize_count <- function(table, row, column)
{
  ### Getting Data for row column ast nodes, assuming no factors
  datar <- row$data
  datac <- column$data

  ### Grabbing categories
  col_categories <- levels(datac)

  n_labels <- lapply(col_categories, FUN=function(cat_name){
    x <- datar[datac == cat_name]
    # Worst interface complexity example. Work in progress to simplify
    tg(tg_N(length(unique(x))), row, column, subcol=cat_name)
  })

  # Test a poisson model
  test <- aov(glm(x ~ treatment,
                  aggregate(datar, by=list(id=datar, treatment=datac), FUN=length),
                  family=poisson))
  # Build the table
  table                                              %>%
  # Create Headers
  row_header(derive_label(row))                      %>%
  col_header("N", col_categories, "Test Statistic")  %>%
  col_header("",  n_labels,       ""              )  %>%
  # Add the First column of summary data as an N value
  add_col(tg_N(length(unique(datar))))               %>%
  # Now add quantiles for the counts
  table_builder_apply(col_categories, FUN=
    function(tbl, cat_name) {
      # Compute each data set
      x  <- datar[datac == cat_name]
      xx <- aggregate(x, by=list(x), FUN=length)$x

      # Add a column that is a quantile
      add_col(tbl, tg_quantile(xx, row$format, na.rm=TRUE))
  })                                                 %>%
  # Now add a statistical test for the final column
  add_col(test)
}

summary_table(event ~ id["%1.0f"], df, summarize_count)
=========================================================== 
     N      A        B        C         Test Statistic      
         (N=247)  (N=240)  (N=242)                          
----------------------------------------------------------- 
ID  250  3 *4* 5  3 *4* 6  3 *4* 5  F_{2,726}=0.23, P=0.798 
===========================================================