The analysis of proportions is of two primary types.
From standard base R functions, the lessR function Prop_test()
, abbreviated prop()
, provides either type of analysis. To use, generally enter either the original data from which to compute the sample proportions or enter already computed sample proportions. For the analysis of multiple categorical variables, the test of homogeneity and the test of independence yield the identical statistical result.
When the focus is on a designated value of the variable value, call such an occurrence a success
. All other values of the variable are failures. Success or failure in this context does not necessarily mean good or bad, desired or undesired, but instead, a designated value either occurred or did not.
When analyzing proportions from data for a single categorical variable, indicate the variable’s name with the parameter variable
. Entering a value of the variable for the parameter success
triggers the test of homogeneity. When entering proportions directly, indicate the number of successes and the total number of trials with the n_succ
and n_tot
parameters. Enter the value of each parameter either as a single value for one sample or as a vector of multiple values for multiple samples. Without a value for success
or n_succ
the analysis is of goodness-of-fit or independence.
The example below is from the documentation for the base R function binom.test()
, which provides the exact test of a null hypothesis regarding the probability of success. Prop_test()
, which uses that base R function to compare a sample proportion to a hypothesized population value, yields the same result.
For a given categorical variable of interest, a type of plant, consider two values, either “giant” or “dwarf”. From a sample of 925 plants, the specified value of “giant” occurred 682 times and did not occur 243 times. The null hypothesis tested is that the specified value occurs for 3/4 of the population according to the p0
parameter.
Prop_test(n_succ=682, n_fail=243, p0=.75)
##
## >>> Exact binomial test of a proportion <<<
##
## ------ Description ------
##
## Number of successes: 682
## Number of failures: 243
## Number of trials: 925
## Sample proportion: 0.737
##
## ------ Inference ------
##
## Hypothesis test for null of 0.75, p-value: 0.382
## 95% Confidence interval: 0.708 to 0.765
To illustrate with data, read the Jackets data file included with lessR into the data frame d. The file contains two categorical variables. The variable Bike represents two different types of motorcycle: BMW and Honda. The second variable is Jacket with three values of jacket thickness: Lite, Med, and Thick.
<- Read("Jackets") d
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Bike character 1025 0 2 BMW Honda Honda ... Honda Honda BMW
## 2 Jacket character 1025 0 3 Lite Lite Lite ... Lite Med Lite
## ------------------------------------------------------------------------------------------
For the variable
Bike from the default d data frame, the parameter success
applies to the “BMW” value of Bike in following example. Analyze the proportion of successes, those reporting a Bike of “BMW”. The default null hypothesis is a population value of 0.5, but here explicitly specify.
For clarity, the following example includes the parameter names listed with their corresponding values. These names are unnecessary in this example because the values are listed in the same order of their definition of the Prop_test()
function.
Prop_test(variable=Bike, success="BMW", p0=0.5)
##
## >>> Exact binomial test of a proportion <<<
##
## Variable: Bike
## success: BMW
##
## ------ Description ------
##
## Number of missing values: 0
## Number of successes: 418
## Number of failures: 607
## Number of trials: 1025
## Sample proportion: 0.408
##
## ------ Inference ------
##
## Hypothesis test for null of 0.5, p-value: 0.000
## 95% Confidence interval: 0.378 to 0.439
Reject the null hypothesis, with a \(p\)-value of 0.000, less than \(\alpha = 0.05\). The sample result of the sample proportion \(p=0.408\) is considered far from the hypothesized value of \(0.5\) for the proportion of "BMW"
values for Bike. Conclude that the data were sampled from a population with a population proportion of BMW different from 0.5.
The following example is the same in the base R prop.test()
documentation. Prop_test()
relies upon that base R function to compare proportions across different groups and yield the same result. To indicate multiple proportions across groups, provide multiple values for the n_succ
and n_tot
parameters.
The null hypothesis in this example is that the four populations of patients from which the samples were drawn have the same population proportion of smokers. The alternative is that at least one population proportion is different. Label the groups in the output by providing a named vector for the successes.
<- c(83, 90, 129, 70)
smokers names(smokers) <- c("Group1","Group2","Group3","Group4")
<- c(86, 93, 136, 82)
patients Prop_test(n_succ=smokers, n_tot=patients)
##
## >>> 4-sample test for equality of proportions without continuity correction <<<
##
##
## >>> Description
##
## Group1 Group2 Group3 Group4
## ----------- ------- ------- ------- -------
## n_ 83 90 129 70
## n_total 86 93 136 82
## proportion 0.965 0.968 0.949 0.854
##
## >>> Inference
##
## Chi-square statistic: 12.600
## Degrees of freedom: 3
## Hypothesis test of equal population proportions: p-value = 0.006
The result of the test is that the \(p\)-value \(=0.006 < \alpha=0.05\), so reject the null hypothesis of equal probabilities across the corresponding four populations. At least one of the population proportions of smokers differ.
In the following example, duplicate the previous results from data. To illustrate, create the data frame d according to the proportions of smokers and non-smokers. Of course, in actual data analysis the data would already be available.
<- c(rep("smoke", 83), rep("nosmoke", 3))
sm1 <- c(rep("smoke", 90), rep("nosmoke", 3))
sm2 <- c(rep("smoke", 129), rep("nosmoke", 7))
sm3 <- c(rep("smoke", 70), rep("nosmoke", 12))
sm4 <- c(sm1, sm2, sm3, sm4)
sm <- c(rep("A",86), rep("B",93), rep("C",136), rep("D",82))
grp <- data.frame(sm, grp) d
Examine the first six rows and last six rows of the data frame d. Indicate the variable of interest, sm, with values “smoke” and “nosmoke”.
head(d)
## sm grp
## 1 smoke A
## 2 smoke A
## 3 smoke A
## 4 smoke A
## 5 smoke A
## 6 smoke A
tail(d)
## sm grp
## 392 nosmoke D
## 393 nosmoke D
## 394 nosmoke D
## 395 nosmoke D
## 396 nosmoke D
## 397 nosmoke D
To indicate a comparison across groups, retain the format for a single proportion based on a value of a categorical variable
of interest. Define success by the value of this variable, here “smoke”. This analysis indicates the comparison across the four groups with a grouping variable that contains a label that identifies the corresponding group. Specify the grouping variable with the by
parameter. The grouping variable in this example is grp, with values the first four uppercase letters of the alphabet.
The relevant parameters variable
, success
, and by
are listed in their given order in this example, so the parameter names are unnecessary. They are listed here for completeness.
Prop_test(variable=sm, success="smoke", by=grp)
##
## >>> 4-sample test for equality of proportions without continuity correction <<<
##
## Variable: sm
## success: smoke
## by: grp
##
## >>> Description
##
## A B C D
## ----------- ------ ------ ------ ------
## n_smoke 83 90 129 70
## n_total 86 93 136 82
## proportion 0.965 0.968 0.949 0.854
##
## >>> Inference
##
## Chi-square statistic: 12.600
## Degrees of freedom: 3
## Hypothesis test of equal population proportions: p-value = 0.006
The analysis of data that matches the previously input proportions provides the same results as providing the proportions directly.
For the previously discussed test of homogeneity of the values of a single categorical variable, the proportion of occurrences for a specific value across different samples is of interest. Here, the proportion of occurrence for each value is instead calculated against the total number of occurrences, as one sample from a single population.
For the goodness-of-fit test to a uniform distribution, provide the frequencies for each group for the parameter n_tot
. The default null hypothesis is that the proportions of the different categories of a categorical variable are equal.
In this example, enter five frequencies as a vector for the value of the n_tot
parameter. Make the vector a named vector to label the output accordingly.
= c(372, 342, 311)
x names(x) = c("Lite", "Med", "Thick")
Prop_test(n_tot=x)
##
## >>> Chi-squared test for given probabilities <<<
##
##
## >>> Description
##
## Lite Med Thick
## --------- -------- -------- --------
## observed 372 342 311
## expected 341.667 341.667 341.667
## residual 1.641 0.018 -1.659
## stdn res 2.010 0.022 -2.032
##
## >>> Inference
##
## Chi-square statistic: 5.446
## Degrees of freedom: 2
## Hypothesis test of equal population proportions: p-value = 0.066
The same analysis follows from the data.
<- Read("Jackets", quiet=TRUE) d
Prop_test(Jacket)
##
## >>> Chi-squared test for given probabilities <<<
##
## Variable: Jacket
##
## >>> Description
##
## Lite Med Thick
## --------- -------- -------- --------
## observed 372 342 311
## expected 341.667 341.667 341.667
## residual 1.641 0.018 -1.659
## stdn res 2.010 0.022 -2.032
##
## >>> Inference
##
## Chi-square statistic: 5.446
## Degrees of freedom: 2
## Hypothesis test of equal population proportions: p-value = 0.066
Tests of goodness of fit and independence evaluated here rely upon a contingency table of one or two dimensions. Due to the awkwardness of entering a table of frequencies, Prop_test()
relies upon computing the contingency table from the data, using the base R chisq.test()
function. However, as shown next, there is a way to enter frequencies if comparing just two levels of a categorical variable.
The smokers and patients vectors presented in a previous example together contain the information needed to construct the corresponding full contingency table. In that example, the smokers vector represents the frequencies of patients who smoked.
smokers
## Group1 Group2 Group3 Group4
## 83 90 129 70
The patients vector represents the total number of patients in each of the four groups.
patients
## [1] 86 93 136 82
The previous analysis of the separate vectors is equivalent to the analysis of the full 2 x 4 contingency table of smokers and non-smokers according to the test of independence illustrated in the next section. Conceptually the tests are distinct, but computationally identical.
The construction of the following contingency table is not part of the input to Prop_test()
. This illustration is included here solely to illustrate the equivalence of information provided by the two vectors and the full contingency table.
<- matrix(c(83, 86-83, 90, 93-90, 129, 136-129, 70, 82-70), nrow=2)
cont_tbl dimnames(cont_tbl) <- list(Smoke = c("Yes", "No"),
Group = c("G1","G2","G3","G4"))
addmargins(cont_tbl)
## Group
## Smoke G1 G2 G3 G4 Sum
## Yes 83 90 129 70 372
## No 3 3 7 12 25
## Sum 86 93 136 82 397
If comparing more than two levels of a categorical variable, and only the proportions and not the data are available, then follow the form of the previous construction of the contingency table from the given proportions. Then directly use the base R function chisq.test()
to do the analysis.
The \(\chi^2\) test of independence evaluated here for two categorical variables. The first variable listed in this example is the value of the parameter variable
, so does not need the parameter name. The second variable listed must include the parameter name by
.
The question for the analysis is if the observed frequencies of Jacket thickness and Bike ownership are so different from the frequencies expected by the null hypothesis that we conclude the variables are related?
Prop_test(Jacket, by=Bike)
##
## >>> Pearson's Chi-squared test <<<
##
## Variable: Jacket
## by: Bike
##
## >>> Description
##
## Jacket
## Bike Lite Med Thick Sum
## BMW 89 135 194 418
## Honda 283 207 117 607
## Sum 372 342 311 1025
##
## Cramer's V: 0.319
##
## >>> Inference
##
## Chi-square statistic: 104.083
## Degrees of freedom: 2
## Hypothesis test of independence: p-value = 0.000
The result of this test is that the \(p\)-value = 0.000 \(< \alpha=0.05\), so reject the null hypothesis of independence. Conclude that the type of Bike a person rides and the thickness of their Jacket are related.