The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Anderson-Darling k-Sample Test

Stefan Kloppenborg, Jeffrey Borlik

20-Jan-2019

This vignette explores the Anderson–Darling k-Sample test. CMH-17-1G [1] provides a formulation for this test that appears different than the formulation given by Scholz and Stephens in their 1987 paper [2].

Given the possibility of ties in the data, the discrete version of the test must be used Scholz and Stephens (1987) give the test statistic as:

Term	CMH-17-1G	Scholz and Stephens
A sample	\(i\)	\(i\)
The number of samples	\(k\)	\(k\)
An observation within a sample	\(j\)	\(j\)
The number of observations within the sample \(i\)	\(n_i\)	\(n_i\)
The total number of observations within all samples	\(n\)	\(N\)
Distinct values in combined data, ordered	\(z_{(1)}\)…\(z_{(L)}\)	\(Z_1^\)…\(Z_L^\)
The number of distinct values in the combined data	\(L\)	\(L\)

\[ A_{a k N}^2 = \frac{N - 1}{N}\sum_{i=1}^k \frac{1}{n_i}\sum_{j=1}^{L}\frac{l_j}{N}\frac{\left(N M_{a i j} - n_i B_{a j}\right)^2}{B_{a j}\left(N - B_{a j}\right) - N l_j / 4} \]

\[ ADK = \frac{n - 1}{n^2\left(k - 1\right)}\sum_{i=1}^k\frac{1}{n_i}\sum_{j=1}^L h_j \frac{\left(n F_{i j} - n_i H_j\right)^2}{H_j \left(n - H_j\right) - n h_j / 4} \]

By inspection, the CMH-17-1G version of this test statistic contains an extra factor of \(\frac{1}{\left(k - 1\right)}\).

Scholz and Stephens indicate that one rejects \(H_0\) at a significance level of \(\alpha\) when:

\[ \frac{A_{a k N}^2 - \left(k - 1\right)}{\sigma_N} \ge t_{k - 1}\left(\alpha\right) \]

\[ A_{c r i t}^2 = \left(k - 1\right) + \sigma_N t_{k - 1}\left(\alpha\right) \]

\[ ADC = 1 + \sigma_n \left(1.96 + \frac{1.149}{\sqrt{k - 1}} - \frac{0.391}{k - 1}\right) \]

The definition of \(\sigma_n\) from the two sources differs by a factor of \(\left(k - 1\right)\).

The value in parentheses in the CMH-17-1G critical value corresponds to the interpolation formula for \(t_m\left(\alpha\right)\) given in Scholz and Stephen’s paper. It should be noted that this is not the student’s t-distribution, but rather a distribution referred to as the \(T_m\) distribution.

The cmstatr package use the package kSamples to perform the k-sample Anderson–Darling tests. This package uses the original formulation from Scholz and Stephens, so the test statistic will differ from that given software based on the CMH-17-1G formulation by a factor of \(\left(k-1\right)\).

For comparison, SciPy’s implementation also uses the original Scholz and Stephens formulation. The statistic that it returns, however, is the normalized statistic, \(\left[A_{a k N}^2 - \left(k - 1\right)\right] / \sigma_N\), rather than kSamples’s \(A_{a k N}^2\) value. To be consistent, SciPy also returns the critical values \(t_{k-1}(\alpha)\) directly. (Currently, SciPy also floors/caps the returned p-value at 0.1% / 25%.) The values of \(k\) and \(\sigma_N\) are available in cmstatr’s ad_ksample return value, if an exact comparison to Python SciPy is necessary.

The conclusions about the null hypothesis drawn, however, will be the same, whether R or CMH-17-1G or SciPy.

References

[1]

“Composite Materials Handbook, Volume 1. Polymer Matrix Composites Guideline for Characterization of Structural Materials,” SAE International, CMH-17-1G, Mar. 2012.

[2]

F. W. Scholz and M. A. Stephens, “K-Sample Anderson--Darling Tests,” Journal of the American Statistical Association, vol. 82, no. 399. pp. 918–924, Sep-1987.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.