The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette explores the Anderson–Darling k-Sample test. CMH-17-1G [1] provides a formulation for this test that appears different than the formulation given by Scholz and Stephens in their 1987 paper [2].
Both references use different nomenclature, which is summarized as follows:
Term | CMH-17-1G | Scholz and Stephens |
---|---|---|
A sample | \(i\) | \(i\) |
The number of samples | \(k\) | \(k\) |
An observation within a sample | \(j\) | \(j\) |
The number of observations within the sample \(i\) | \(n_i\) | \(n_i\) |
The total number of observations within all samples | \(n\) | \(N\) |
Distinct values in combined data, ordered | \(z_{(1)}\)…\(z_{(L)}\) | \(Z_1^*\)…\(Z_L^*\) |
The number of distinct values in the combined data | \(L\) | \(L\) |
Given the possibility of ties in the data, the discrete version of the test must be used Scholz and Stephens (1987) give the test statistic as:
\[ A_{a k N}^2 = \frac{N - 1}{N}\sum_{i=1}^k \frac{1}{n_i}\sum_{j=1}^{L}\frac{l_j}{N}\frac{\left(N M_{a i j} - n_i B_{a j}\right)^2}{B_{a j}\left(N - B_{a j}\right) - N l_j / 4} \]
CMH-17-1G gives the test statistic as:
\[ ADK = \frac{n - 1}{n^2\left(k - 1\right)}\sum_{i=1}^k\frac{1}{n_i}\sum_{j=1}^L h_j \frac{\left(n F_{i j} - n_i H_j\right)^2}{H_j \left(n - H_j\right) - n h_j / 4} \]
By inspection, the CMH-17-1G version of this test statistic contains an extra factor of \(\frac{1}{\left(k - 1\right)}\).
Scholz and Stephens indicate that one rejects \(H_0\) at a significance level of \(\alpha\) when:
\[ \frac{A_{a k N}^2 - \left(k - 1\right)}{\sigma_N} \ge t_{k - 1}\left(\alpha\right) \]
This can be rearranged to give a critical value:
\[ A_{c r i t}^2 = \left(k - 1\right) + \sigma_N t_{k - 1}\left(\alpha\right) \]
CHM-17-1G gives the critical value for \(ADK\) for \(\alpha=0.025\) as:
\[ ADC = 1 + \sigma_n \left(1.96 + \frac{1.149}{\sqrt{k - 1}} - \frac{0.391}{k - 1}\right) \]
The definition of \(\sigma_n\) from the two sources differs by a factor of \(\left(k - 1\right)\).
The value in parentheses in the CMH-17-1G critical value corresponds to the interpolation formula for \(t_m\left(\alpha\right)\) given in Scholz and Stephen’s paper. It should be noted that this is not the student’s t-distribution, but rather a distribution referred to as the \(T_m\) distribution.
The cmstatr
package use the package
kSamples
to perform the k-sample Anderson–Darling tests.
This package uses the original formulation from Scholz and Stephens, so
the test statistic will differ from that given software based on the
CMH-17-1G formulation by a factor of \(\left(k-1\right)\).
For comparison, SciPy’s
implementation also uses the original Scholz and Stephens
formulation. The statistic that it returns, however, is the normalized
statistic, \(\left[A_{a k N}^2 - \left(k -
1\right)\right] / \sigma_N\), rather than kSamples
’s
\(A_{a k N}^2\) value. To be
consistent, SciPy also returns the critical values \(t_{k-1}(\alpha)\) directly. (Currently,
SciPy also floors/caps the returned p-value at 0.1% / 25%.) The values
of \(k\) and \(\sigma_N\) are available in
cmstatr
’s ad_ksample
return value, if an exact
comparison to Python SciPy is necessary.
The conclusions about the null hypothesis drawn, however, will be the same, whether R or CMH-17-1G or SciPy.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.