Proportion Inference

David Gerbing

The analysis of proportions is of two primary types.

From standard base R functions, the lessR function Prop_test(), abbreviated prop(), provides either type of analysis. To use, generally enter either the original data from which to compute the sample proportions or enter already computed sample proportions. For the analysis of multiple categorical variables, the test of homogeneity and the test of independence yield the identical statistical result.

Test of a Specified Proportion

When the focus is on a designated value of the variable value, call such an occurrence a success. All other values of the variable are failures. Success or failure in this context does not necessarily mean good or bad, desired or undesired, but instead, a designated value either occurred or did not.

When analyzing proportions from data for a single categorical variable, indicate the variable’s name with the parameter variable. Entering a value of the variable for the parameter success triggers the test of homogeneity. When entering proportions directly, indicate the number of successes and the total number of trials with the n_succ and n_tot parameters. Enter the value of each parameter either as a single value for one sample or as a vector of multiple values for multiple samples. Without a value for success or n_succ the analysis is of goodness-of-fit or independence.

Single Proportion

The example below is from the documentation for the base R function binom.test(), which provides the exact test of a null hypothesis regarding the probability of success. Prop_test(), which uses that base R function to compare a sample proportion to a hypothesized population value, yields the same result.

From Input Frequencies

For a given categorical variable of interest, a type of plant, consider two values, either “giant” or “dwarf”. From a sample of 925 plants, the specified value of “giant” occurred 682 times and did not occur 243 times. The null hypothesis tested is that the specified value occurs for 3/4 of the population according to the p0 parameter.

Prop_test(n_succ=682, n_fail=243, p0=.75)
## 
## >>> Exact binomial test of a proportion <<< 
## 
## ------ Description ------
## 
## Number of successes: 682 
## Number of failures: 243 
## Number of trials: 925 
## Sample proportion: 0.737 
## 
## ------ Inference ------
## 
## Hypothesis test for null of 0.75, p-value: 0.382
## 95% Confidence interval: 0.708 to 0.765

From Data

To illustrate with data, read the Jackets data file included with lessR into the data frame d. The file contains two categorical variables. The variable Bike represents two different types of motorcycle: BMW and Honda. The second variable is Jacket with three values of jacket thickness: Lite, Med, and Thick.

d <- Read("Jackets")
## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1      Bike character   1025       0       2   BMW  Honda  Honda ... Honda  Honda  BMW
##  2    Jacket character   1025       0       3   Lite  Lite  Lite ... Lite  Med  Lite
## ------------------------------------------------------------------------------------------

For the variable Bike from the default d data frame, the parameter success applies to the “BMW” value of Bike in following example. Analyze the proportion of successes, those reporting a Bike of “BMW”. The default null hypothesis is a population value of 0.5, but here explicitly specify.

For clarity, the following example includes the parameter names listed with their corresponding values. These names are unnecessary in this example because the values are listed in the same order of their definition of the Prop_test() function.

Prop_test(variable=Bike, success="BMW", p0=0.5)
## 
## >>> Exact binomial test of a proportion <<< 
## 
## Variable: Bike 
## success: BMW 
## 
## ------ Description ------
## 
## Number of missing values: 0 
## Number of successes: 418 
## Number of failures: 607 
## Number of trials: 1025 
## Sample proportion: 0.408 
## 
## ------ Inference ------
## 
## Hypothesis test for null of 0.5, p-value: 0.000
## 95% Confidence interval: 0.378 to 0.439

Reject the null hypothesis, with a \(p\)-value of 0.000, less than \(\alpha = 0.05\). The sample result of the sample proportion \(p=0.408\) is considered far from the hypothesized value of \(0.5\) for the proportion of "BMW" values for Bike. Conclude that the data were sampled from a population with a population proportion of BMW different from 0.5.

Multiple Proportions

The following example is the same in the base R prop.test() documentation. Prop_test() relies upon that base R function to compare proportions across different groups and yield the same result. To indicate multiple proportions across groups, provide multiple values for the n_succ and n_tot parameters.

From Input Frequencies

The null hypothesis in this example is that the four populations of patients from which the samples were drawn have the same population proportion of smokers. The alternative is that at least one population proportion is different. Label the groups in the output by providing a named vector for the successes.

smokers <- c(83, 90, 129, 70)
names(smokers) <- c("Group1","Group2","Group3","Group4")
patients <- c(86, 93, 136, 82)
Prop_test(n_succ=smokers, n_tot=patients)
## 
## >>> 4-sample test for equality of proportions without continuity correction  <<< 
## 
## 
## >>> Description
## 
##               Group1   Group2   Group3   Group4
## -----------  -------  -------  -------  -------
## n_                83       90      129       70
## n_total           86       93      136       82
## proportion     0.965    0.968    0.949    0.854
## 
## >>> Inference
## 
## Chi-square statistic: 12.600 
## Degrees of freedom: 3 
## Hypothesis test of equal population proportions: p-value = 0.006

The result of the test is that the \(p\)-value \(=0.006 < \alpha=0.05\), so reject the null hypothesis of equal probabilities across the corresponding four populations. At least one of the population proportions of smokers differ.

From Data

In the following example, duplicate the previous results from data. To illustrate, create the data frame d according to the proportions of smokers and non-smokers. Of course, in actual data analysis the data would already be available.

sm1 <- c(rep("smoke", 83), rep("nosmoke", 3))
sm2 <- c(rep("smoke", 90), rep("nosmoke", 3))
sm3 <- c(rep("smoke", 129), rep("nosmoke", 7))
sm4 <- c(rep("smoke", 70), rep("nosmoke", 12))
sm <- c(sm1, sm2, sm3, sm4)
grp <- c(rep("A",86), rep("B",93), rep("C",136), rep("D",82))
d <- data.frame(sm, grp)

Examine the first six rows and last six rows of the data frame d. Indicate the variable of interest, sm, with values “smoke” and “nosmoke”.

head(d)
##      sm grp
## 1 smoke   A
## 2 smoke   A
## 3 smoke   A
## 4 smoke   A
## 5 smoke   A
## 6 smoke   A
tail(d)
##          sm grp
## 392 nosmoke   D
## 393 nosmoke   D
## 394 nosmoke   D
## 395 nosmoke   D
## 396 nosmoke   D
## 397 nosmoke   D

To indicate a comparison across groups, retain the format for a single proportion based on a value of a categorical variable of interest. Define success by the value of this variable, here “smoke”. This analysis indicates the comparison across the four groups with a grouping variable that contains a label that identifies the corresponding group. Specify the grouping variable with the by parameter. The grouping variable in this example is grp, with values the first four uppercase letters of the alphabet.

The relevant parameters variable, success, and by are listed in their given order in this example, so the parameter names are unnecessary. They are listed here for completeness.

Prop_test(variable=sm, success="smoke", by=grp)
## 
## >>> 4-sample test for equality of proportions without continuity correction  <<< 
## 
## Variable: sm 
## success: smoke 
## by: grp 
## 
## >>> Description
## 
##                   A       B       C       D
## -----------  ------  ------  ------  ------
## n_smoke          83      90     129      70
## n_total          86      93     136      82
## proportion    0.965   0.968   0.949   0.854
## 
## >>> Inference
## 
## Chi-square statistic: 12.600 
## Degrees of freedom: 3 
## Hypothesis test of equal population proportions: p-value = 0.006

The analysis of data that matches the previously input proportions provides the same results as providing the proportions directly.

Tests without a Specified Proportion

Goodness-of-Fit

For the previously discussed test of homogeneity of the values of a single categorical variable, the proportion of occurrences for a specific value across different samples is of interest. Here, the proportion of occurrence for each value is instead calculated against the total number of occurrences, as one sample from a single population.

From Input Frequencies

For the goodness-of-fit test to a uniform distribution, provide the frequencies for each group for the parameter n_tot. The default null hypothesis is that the proportions of the different categories of a categorical variable are equal.

In this example, enter five frequencies as a vector for the value of the n_tot parameter. Make the vector a named vector to label the output accordingly.

x = c(372, 342, 311)
names(x) = c("Lite", "Med", "Thick")
Prop_test(n_tot=x)
## 
## >>> Chi-squared test for given probabilities  <<< 
## 
## 
## >>> Description
## 
##                Lite       Med     Thick
## ---------  --------  --------  --------
## observed        372       342       311
## expected    341.667   341.667   341.667
## residual      1.641     0.018    -1.659
## stdn res      2.010     0.022    -2.032
## 
## >>> Inference
## 
## Chi-square statistic: 5.446 
## Degrees of freedom: 2 
## Hypothesis test of equal population proportions: p-value = 0.066

From Data

The same analysis follows from the data.

d <- Read("Jackets", quiet=TRUE)
Prop_test(Jacket)
## 
## >>> Chi-squared test for given probabilities  <<< 
## 
## Variable: Jacket 
## 
## >>> Description
## 
##                Lite       Med     Thick
## ---------  --------  --------  --------
## observed        372       342       311
## expected    341.667   341.667   341.667
## residual      1.641     0.018    -1.659
## stdn res      2.010     0.022    -2.032
## 
## >>> Inference
## 
## Chi-square statistic: 5.446 
## Degrees of freedom: 2 
## Hypothesis test of equal population proportions: p-value = 0.066

Independence

Tests of goodness of fit and independence evaluated here rely upon a contingency table of one or two dimensions. Due to the awkwardness of entering a table of frequencies, Prop_test() relies upon computing the contingency table from the data, using the base R chisq.test() function. However, as shown next, there is a way to enter frequencies if comparing just two levels of a categorical variable.

From Input Frequencies

The smokers and patients vectors presented in a previous example together contain the information needed to construct the corresponding full contingency table. In that example, the smokers vector represents the frequencies of patients who smoked.

smokers
## Group1 Group2 Group3 Group4 
##     83     90    129     70

The patients vector represents the total number of patients in each of the four groups.

patients
## [1]  86  93 136  82

The previous analysis of the separate vectors is equivalent to the analysis of the full 2 x 4 contingency table of smokers and non-smokers according to the test of independence illustrated in the next section. Conceptually the tests are distinct, but computationally identical.

The construction of the following contingency table is not part of the input to Prop_test(). This illustration is included here solely to illustrate the equivalence of information provided by the two vectors and the full contingency table.

cont_tbl <- matrix(c(83, 86-83, 90, 93-90, 129, 136-129, 70, 82-70), nrow=2)
dimnames(cont_tbl) <- list(Smoke = c("Yes", "No"),
                           Group = c("G1","G2","G3","G4"))
addmargins(cont_tbl)
##      Group
## Smoke G1 G2  G3 G4 Sum
##   Yes 83 90 129 70 372
##   No   3  3   7 12  25
##   Sum 86 93 136 82 397

If comparing more than two levels of a categorical variable, and only the proportions and not the data are available, then follow the form of the previous construction of the contingency table from the given proportions. Then directly use the base R function chisq.test() to do the analysis.

From Data

The \(\chi^2\) test of independence evaluated here for two categorical variables. The first variable listed in this example is the value of the parameter variable, so does not need the parameter name. The second variable listed must include the parameter name by.

The question for the analysis is if the observed frequencies of Jacket thickness and Bike ownership are so different from the frequencies expected by the null hypothesis that we conclude the variables are related?

Prop_test(Jacket, by=Bike)
## 
## >>> Pearson's Chi-squared test  <<< 
## 
## Variable: Jacket 
## by: Bike 
## 
## >>> Description
## 
##        Jacket
## Bike    Lite  Med Thick  Sum
##   BMW     89  135   194  418
##   Honda  283  207   117  607
##   Sum    372  342   311 1025
## 
## Cramer's V: 0.319 
## 
## >>> Inference
## 
## Chi-square statistic: 104.083 
## Degrees of freedom: 2 
## Hypothesis test of independence: p-value = 0.000

The result of this test is that the \(p\)-value = 0.000 \(< \alpha=0.05\), so reject the null hypothesis of independence. Conclude that the type of Bike a person rides and the thickness of their Jacket are related.