The tabula package provides a set of S4 classes for archaeological data matrices that extend the basic matrix
data type. These new classes represent different special types of matrix.
CountMatrix
represents count data,FrequencyMatrix
represents relative frequency data.OccurrenceMatrix
represents a co-occurrence matrix.SimilarityMatrix
represents a (dis)similarity matrix.IncidenceMatrix
represents presence/absence data.It assumes that you keep your data tidy: each variable (taxon/type) must be saved in its own column and each observation (assemblage/sample/case) must be saved in its own row. Note that missing values are not allowed.
Methods for a variety of functions applied to objects from these classes provide tools for analysis, seriation and dating of archaeological assemblages. See help(methods)
to list all available methods for these classes.
The internal structure of S4 classes implemented in tabula is depicted in the UML class diagram in the following figure.
UML class diagram of the S4 classes structure.
We denote the \(m \times p\) count matrix by \(A = \left[ a_{ij} \right] ~\forall i \in \left[ 1,m \right], j \in \left[ 1,p \right]\) with row and column sums:
\[\begin{align} a_{i \cdot} = \sum_{j = 1}^{p} a_{ij} && a_{\cdot j} = \sum_{i = 1}^{m} a_{ij} && a_{\cdot \cdot} = \sum_{i = 1}^{m} \sum_{j = 1}^{p} a_{ij} && \forall a_{ij} \in \mathbb{N} \end{align}\]
A frequency matrix represents relative abundances.
We denote the \(m \times p\) frequency matrix by \(B = \left[ b_{ij} \right] ~\forall i \in \left[ 1,m \right], j \in \left[ 1,p \right]\) with row and column sums:
\[\begin{align} b_{i \cdot} = \sum_{j = 1}^{p} b_{ij} = 1 && b_{\cdot j} = \sum_{i = 1}^{m} b_{ij} && b_{\cdot \cdot} = \sum_{i = 1}^{m} \sum_{j = 1}^{p} b_{ij} && \forall b_{ij} \in \left[ 0,1 \right] \end{align}\]
A co-occurrence matrix is a symmetric matrix with zeros on its main diagonal, which works out how many times (expressed in percent) each pairs of taxa occur together in at least one sample.
The \(p \times p\) co-occurrence matrix \(D = \left[ d_{i,j} \right] ~\forall i,j \in \left[ 1,p \right]\) is defined over an \(m \times p\) abundance matrix \(A = \left[ a_{x,y} \right] ~\forall x \in \left[ 1,m \right], y \in \left[ 1,p \right]\) as:
\[ d_{i,j} = \sum_{x = 1}^{m} \bigcap_{y = i}^{j} a_{xy} \]
with row and column sums:
\[\begin{align} d_{i \cdot} = \sum_{j \geqslant i}^{p} d_{ij} && d_{\cdot j} = \sum_{i \leqslant j}^{p} d_{ij} && d_{\cdot \cdot} = \sum_{i = 1}^{p} \sum_{j \geqslant i}^{p} d_{ij} && \forall d_{ij} \in \mathbb{N} \end{align}\]
We denote the \(m \times p\) incidence matrix by \(C = \left[ c_{ij} \right] ~\forall i \in \left[ 1,m \right], j \in \left[ 1,p \right]\) with row and column sums:
\[\begin{align} c_{i \cdot} = \sum_{j = 1}^{p} c_{ij} && c_{\cdot j} = \sum_{i = 1}^{m} c_{ij} && c_{\cdot \cdot} = \sum_{i = 1}^{m} \sum_{j = 1}^{p} c_{ij} && \forall c_{ij} \in \lbrace 0,1 \rbrace \end{align}\]
These new classes are of simple use, on the same way as the base matrix
:
set.seed(12345)
## Create a count data matrix
CountMatrix(data = sample(0:10, 100, TRUE),
nrow = 10, ncol = 10)
#> 10 x 10 count data matrix:
#> (79c4a59f-2ac3-4f91-b755-e0b32b428d94)
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> 1 2 6 2 3 9 7 9 3 3 6
#> 2 9 9 8 7 9 10 6 8 9 0
#> 3 7 0 3 10 2 3 6 10 3 2
#> 4 9 7 9 5 2 1 4 0 8 1
#> 5 10 6 6 8 2 2 6 2 1 4
#> 6 7 5 1 4 0 5 9 9 7 9
#> 7 1 0 3 2 9 2 7 6 9 5
#> 8 5 3 10 0 7 6 2 9 0 6
#> 9 10 7 8 0 10 9 4 9 8 8
#> 10 5 9 8 4 8 6 10 6 5 9
## Create an incidence (presence/absence) matrix
## Numeric values are coerced to logical as by as.logical
IncidenceMatrix(data = sample(0:1, 100, TRUE),
nrow = 10, ncol = 10)
#> 10 x 10 presence/absence data matrix:
#> (a5a9e049-cf64-4aff-8ef7-9773c4ffe7ff)
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> 1 TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
#> 2 TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> 3 TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE
#> 4 TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
#> 5 FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
#> 6 TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
#> 7 TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
#> 8 FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
#> 9 FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
#> 10 TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Note that a FrequencyMatrix
can only be created by coercion (see below).
tabula uses coercing mechanisms (with validation methods) for data type conversions:
## Create a count matrix
## Numeric values are coerced to integer and hence truncated towards zero
A1 <- CountMatrix(data = sample(0:10, 100, TRUE),
nrow = 10, ncol = 10)
## Coerce counts to frequencies
B <- as_frequency(A1)
## Row sums are internally stored before coercing to a frequency matrix
## (use totals() to get these values)
## This allows to restore the source data
A2 <- as_count(B)
all(A1 == A2)
#> [1] TRUE
## Coerce to presence/absence
C <- as_incidence(A1)
## Coerce to a co-occurrence matrix
D <- as_occurrence(A1)
Several types of graphs are available in tabula which uses ggplot2 for plotting informations. This makes it easy to customize diagrams (e.g. using themes and scales).
Spot matrix allows direct examination of data (above/below some threshold):
## Plot frequencies with the column means as a threshold
mississippi %>%
as_count() %>%
plot_spot(threshold = mean) +
ggplot2::labs(size = "Frequency", colour = "Mean") +
khroma::scale_colour_vibrant()
Spot plot
## Plot co-occurrence of types
## (i.e. how many times (percent) each pairs of taxa occur together
## in at least one sample.)
mississippi %>%
as_occurrence() %>%
plot_spot() +
ggplot2::labs(size = "", colour = "Co-occurrence") +
ggplot2::theme(legend.box = "horizontal") +
khroma::scale_colour_YlOrBr()
Spot plot of co-occurrence
Abundance matrix can be displayed as a heatmap of relative abundances (frequency), or as percentages of the independence value (in french, “pourcentages de valeur d’indépendance”, PVI).
Heatmap
PVI is calculated for each cell as the percentage to the column theoretical independence value: PVI greater than \(1\) represent positive deviations from the independence, whereas PVI smaller than \(1\) represent negative deviations (Desachy 2004). The PVI matrix allows to explore deviations from independence (an intuitive graphical approach to \(\chi^2\)), in such a way that a high-contrast matrix has quite significant deviations, with a low risk of being due to randomness (Desachy 2004).
## Reproduce B. Desachy's matrigraphe
boves %>%
as_count() %>%
plot_heatmap(PVI = TRUE) +
khroma::scale_fill_BuRd(midpoint = 1)
Matrigraphe
Bertin (1977) or Ford (1962) (battleship curve) diagrams can also be plotted, with statistic threshold.
Bertin diagram
Ford diagram
Bertin, Jacques. 1977. La graphique et le traitement graphique de l’information. Nouvelle Bibliothèque Scientifique. Paris: Flammarion.
Desachy, Bruno. 2004. “Le sériographe EPPM : un outil informatisé de sériation graphique pour tableaux de comptages.” Revue archéologique de Picardie 3 (1): 39–56. https://doi.org/10.3406/pica.2004.2396.
Ford, J. A. 1962. A Quantitative Method for Deriving Cultural Chronology. Technical Manual 1. Washington, DC: Pan American Union.