This vignette introduces the following functions from the PHEindicatormethods package and provides basic sample code to demonstrate their execution. The code included is based on the code provided within the ‘examples’ section of the function documentation. This vignette does not explain the methods applied in detail but these can (optionally) be output alongside the statistics or for a more detailed explanation, please see the references section of the function documentation.
library(PHEindicatormethods)
library(dplyr)
This vignette covers the following functions available within the first release of the package (v1.0.8) but has been updated to apply to these functions in their latest release versions. If further functions are added to the package in future releases these will be explained elsewhere.
Function | Type | Description |
---|---|---|
phe_proportion | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_rate | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_mean | Aggregate | Performs a calculation on each grouping set |
phe_dsr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_smr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_isr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
The following code chunk creates a data frame containing observed number of events and populations for 4 geographical areas over 2 time periods that is used later to demonstrate the PHEindicatormethods package functions:
df <- data.frame(
area = rep(c("Area1","Area2","Area3","Area4"), 2),
year = rep(2015:2016, each = 4),
obs = sample(100, 2 * 4, replace = TRUE),
pop = sample(100:200, 2 * 4, replace = TRUE))
df
#> area year obs pop
#> 1 Area1 2015 90 193
#> 2 Area2 2015 94 124
#> 3 Area3 2015 90 137
#> 4 Area4 2015 34 174
#> 5 Area1 2016 82 122
#> 6 Area2 2016 67 198
#> 7 Area3 2016 96 180
#> 8 Area4 2016 32 101
INPUT: The phe_proportion and phe_rate functions take a single data frame as input with columns representing the numerators and denominators for the statistic. Any other columns present will be retained in the output.
OUTPUT: The functions output the original data frame with additional columns appended. By default the additional columns are the proportion or rate, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The functions also accept additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate these two functions and the arguments that can optionally be specified
# default proportion
phe_proportion(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence
#> 1 Area1 2015 90 193 0.4663212 0.3972851 0.5366719 95%
#> 2 Area2 2015 94 124 0.7580645 0.6756700 0.8249500 95%
#> 3 Area3 2015 90 137 0.6569343 0.5741342 0.7311736 95%
#> 4 Area4 2015 34 174 0.1954023 0.1433360 0.2606275 95%
#> 5 Area1 2016 82 122 0.6721311 0.5846897 0.7490636 95%
#> 6 Area2 2016 67 198 0.3383838 0.2761117 0.4068077 95%
#> 7 Area3 2016 96 180 0.5333333 0.4605179 0.6047558 95%
#> 8 Area4 2016 32 101 0.3168317 0.2342353 0.4128509 95%
#> statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson
# specify confidence level for proportion
phe_proportion(df, obs, pop, confidence=99.8)
#> area year obs pop value lowercl uppercl confidence
#> 1 Area1 2015 90 193 0.4663212 0.3595776 0.5762406 99.8%
#> 2 Area2 2015 94 124 0.7580645 0.6236165 0.8556064 99.8%
#> 3 Area3 2015 90 137 0.6569343 0.5250925 0.7683237 99.8%
#> 4 Area4 2015 34 174 0.1954023 0.1194300 0.3030692 99.8%
#> 5 Area1 2016 82 122 0.6721311 0.5325394 0.7867319 99.8%
#> 6 Area2 2016 67 198 0.3383838 0.2440544 0.4475855 99.8%
#> 7 Area3 2016 96 180 0.5333333 0.4196635 0.6436445 99.8%
#> 8 Area4 2016 32 101 0.3168317 0.1950033 0.4703051 99.8%
#> statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson
# specify to output proportions as percentages
phe_proportion(df, obs, pop, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 90 193 46.63212 39.72851 53.66719 95% percentage
#> 2 Area2 2015 94 124 75.80645 67.56700 82.49500 95% percentage
#> 3 Area3 2015 90 137 65.69343 57.41342 73.11736 95% percentage
#> 4 Area4 2015 34 174 19.54023 14.33360 26.06275 95% percentage
#> 5 Area1 2016 82 122 67.21311 58.46897 74.90636 95% percentage
#> 6 Area2 2016 67 198 33.83838 27.61117 40.68077 95% percentage
#> 7 Area3 2016 96 180 53.33333 46.05179 60.47558 95% percentage
#> 8 Area4 2016 32 101 31.68317 23.42353 41.28509 95% percentage
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify level of detail to output for proportion
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 90 193 46.63212 35.95776 57.62406 99.8% percentage
#> 2 Area2 2015 94 124 75.80645 62.36165 85.56064 99.8% percentage
#> 3 Area3 2015 90 137 65.69343 52.50925 76.83237 99.8% percentage
#> 4 Area4 2015 34 174 19.54023 11.94300 30.30692 99.8% percentage
#> 5 Area1 2016 82 122 67.21311 53.25394 78.67319 99.8% percentage
#> 6 Area2 2016 67 198 33.83838 24.40544 44.75855 99.8% percentage
#> 7 Area3 2016 96 180 53.33333 41.96635 64.36445 99.8% percentage
#> 8 Area4 2016 32 101 31.68317 19.50033 47.03051 99.8% percentage
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify level of detail to output for proportion and remove metadata columns
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100, type="standard")
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 90 193 46.63212 35.95776 57.62406
#> 2 Area2 2015 94 124 75.80645 62.36165 85.56064
#> 3 Area3 2015 90 137 65.69343 52.50925 76.83237
#> 4 Area4 2015 34 174 19.54023 11.94300 30.30692
#> 5 Area1 2016 82 122 67.21311 53.25394 78.67319
#> 6 Area2 2016 67 198 33.83838 24.40544 44.75855
#> 7 Area3 2016 96 180 53.33333 41.96635 64.36445
#> 8 Area4 2016 32 101 31.68317 19.50033 47.03051
# default rate
phe_rate(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 90 193 46632.12 37496.69 57319.41 95% rate per 100000
#> 2 Area2 2015 94 124 75806.45 61257.75 92768.84 95% rate per 100000
#> 3 Area3 2015 90 137 65693.43 52823.81 80749.24 95% rate per 100000
#> 4 Area4 2015 34 174 19540.23 13530.09 27306.37 95% rate per 100000
#> 5 Area1 2016 82 122 67213.11 53454.85 83430.20 95% rate per 100000
#> 6 Area2 2016 67 198 33838.38 26223.07 42974.20 95% rate per 100000
#> 7 Area3 2016 96 180 53333.33 43199.10 65129.77 95% rate per 100000
#> 8 Area4 2016 32 101 31683.17 21667.52 44728.66 95% rate per 100000
#> method
#> 1 Byars
#> 2 Byars
#> 3 Byars
#> 4 Byars
#> 5 Byars
#> 6 Byars
#> 7 Byars
#> 8 Byars
# specify rate parameters
phe_rate(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 90 193 46.63212 32.89479 63.92121 99.8% rate per 100
#> 2 Area2 2015 94 124 75.80645 53.90614 103.23233 99.8% rate per 100
#> 3 Area3 2015 90 137 65.69343 46.34083 90.04959 99.8% rate per 100
#> 4 Area4 2015 34 174 19.54023 10.77682 32.29409 99.8% rate per 100
#> 5 Area1 2016 82 122 67.21311 46.57199 93.47871 99.8% rate per 100
#> 6 Area2 2016 67 198 33.83838 22.47514 48.67537 99.8% rate per 100
#> 7 Area3 2016 96 180 53.33333 38.07060 72.40174 99.8% rate per 100
#> 8 Area4 2016 32 101 31.68317 17.11577 53.13188 99.8% rate per 100
#> method
#> 1 Byars
#> 2 Byars
#> 3 Byars
#> 4 Byars
#> 5 Byars
#> 6 Byars
#> 7 Byars
#> 8 Byars
# specify rate parameters and reduce columns output and remove metadata columns
phe_rate(df, obs, pop, type="standard", confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 90 193 46.63212 32.89479 63.92121
#> 2 Area2 2015 94 124 75.80645 53.90614 103.23233
#> 3 Area3 2015 90 137 65.69343 46.34083 90.04959
#> 4 Area4 2015 34 174 19.54023 10.77682 32.29409
#> 5 Area1 2016 82 122 67.21311 46.57199 93.47871
#> 6 Area2 2016 67 198 33.83838 22.47514 48.67537
#> 7 Area3 2016 96 180 53.33333 38.07060 72.40174
#> 8 Area4 2016 32 101 31.68317 17.11577 53.13188
These functions can also return aggregate data if the input dataframes are grouped:
# default proportion - grouped
df %>%
group_by(year) %>%
phe_proportion(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 308 628 0.490 0.452 0.529 95% proportion of 1 Wilson
#> 2 2016 277 601 0.461 0.421 0.501 95% proportion of 1 Wilson
# default rate - grouped
df %>%
group_by(year) %>%
phe_rate(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 308 628 49045. 43720. 54839. 95% rate per 1000~ Byars
#> 2 2016 277 601 46090. 40821. 51850. 95% rate per 1000~ Byars
The remaining functions aggregate the rows in the input data frame to produce a single statistic. It is also possible to calculate multiple statistics in a single execution of these functions if the input data frame is grouped - for example by indicator ID, geographic area or time period (or all three). The output contains only the grouping variables and the values calculated by the function - any additional unused columns provided in the input data frame will not be retained in the output.
The df test data generated earlier can be used to demonstrate phe_mean:
INPUT: The phe_mean function take a single data frame as input with a column representing the numbers to be averaged.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values (if applicable), the mean, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The function also accepts additional arguments to specify the level of confidence and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_mean function and the arguments that can optionally be specified
# default mean
phe_mean(df,obs)
#> value_sum value_count stdev value lowercl uppercl confidence
#> 1 585 8 26.36793 73.125 51.08086 95.16914 95%
#> statistic method
#> 1 mean Student's t-distribution
# multiple means in a single execution with 99.8% confidence
df %>%
group_by(year) %>%
phe_mean(obs, confidence=0.998)
#> # A tibble: 2 x 10
#> year value_sum value_count stdev value lowercl uppercl confidence
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2015 308 4 28.7 77 -69.7 224. 99.8%
#> 2 2016 277 4 27.5 69.2 -71.3 210. 99.8%
#> # ... with 2 more variables: statistic <chr>, method <chr>
# multiple means in a single execution with 99.8% confidence and data-only output
df %>%
group_by(year) %>%
phe_mean(obs, type = "standard", confidence=0.998)
#> # A tibble: 2 x 7
#> year value_sum value_count stdev value lowercl uppercl
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2015 308 4 28.7 77 -69.7 224.
#> 2 2016 277 4 27.5 69.2 -71.3 210.
The following code chunk creates a data frame containing observed number of events and populations by age band for 4 areas, 5 time periods and 2 sexes:
df_std <- data.frame(
area = rep(c("Area1", "Area2", "Area3", "Area4"), each = 19 * 2 * 5),
year = rep(2006:2010, each = 19 * 2),
sex = rep(rep(c("Male", "Female"), each = 19), 5),
ageband = rep(c(0, 5,10,15,20,25,30,35,40,45,
50,55,60,65,70,75,80,85,90), times = 10),
obs = sample(200, 19 * 2 * 5 * 4, replace = TRUE),
pop = sample(10000:20000, 19 * 2 * 5 * 4, replace = TRUE))
head(df_std)
#> area year sex ageband obs pop
#> 1 Area1 2006 Male 0 75 14989
#> 2 Area1 2006 Male 5 78 17379
#> 3 Area1 2006 Male 10 179 10718
#> 4 Area1 2006 Male 15 183 10417
#> 5 Area1 2006 Male 20 111 17387
#> 6 Area1 2006 Male 25 21 19406
INPUT: The minimum input requirement for the phe_dsr function is a single data frame with columns representing the numerators and denominators for each standardisation category. This is sufficient if the data is:
The 2013 European Standard Population is provided within the package in vector form (esp2013) and is used by default by this function. Alternative standard populations can be used but must be provided by the user. When the function joins a standard population vector to the input data frame it does this by position so it is important that the data is sorted accordingly. This is a user responsibility.
The function can also accept standard populations provided as a column within the input data frame.
standard populations provided as a vector - the vector and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order
standard populations provided as a column within the input data frame - the standard populations can be appended to the input data frame by the user prior to execution of the function - if the data is grouped to generate multiple dsrs then the standard populations will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values, the total count, the total population, the dsr, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If standard populations are being provided as a column within the input data frame then the user must specify this using the stdpoptype argument as the function expects a vector by default. The function also accepts additional arguments to specify the standard populations, the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_dsr function and the arguments that can optionally be specified
# calculate separate dsrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 2034 274567 745. 711. 781. 95%
#> 2 Area1 2006 Male 1793 294669 649. 617. 682. 95%
#> 3 Area1 2007 Fema~ 2043 287373 742. 708. 777. 95%
#> 4 Area1 2007 Male 2021 290581 678. 647. 710. 95%
#> 5 Area1 2008 Fema~ 1812 302461 588. 560. 618. 95%
#> 6 Area1 2008 Male 1964 292433 668. 638. 700. 95%
#> 7 Area1 2009 Fema~ 2226 281936 806. 770. 843. 95%
#> 8 Area1 2009 Male 1763 284014 606. 575. 637. 95%
#> 9 Area1 2010 Fema~ 1786 287422 617. 587. 649. 95%
#> 10 Area1 2010 Male 2141 292748 766. 733. 801. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate dsrs for each area, year and sex and drop metadata fields from output
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, type="standard")
#> # A tibble: 40 x 8
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 2034 274567 745. 711. 781.
#> 2 Area1 2006 Male 1793 294669 649. 617. 682.
#> 3 Area1 2007 Female 2043 287373 742. 708. 777.
#> 4 Area1 2007 Male 2021 290581 678. 647. 710.
#> 5 Area1 2008 Female 1812 302461 588. 560. 618.
#> 6 Area1 2008 Male 1964 292433 668. 638. 700.
#> 7 Area1 2009 Female 2226 281936 806. 770. 843.
#> 8 Area1 2009 Male 1763 284014 606. 575. 637.
#> 9 Area1 2010 Female 1786 287422 617. 587. 649.
#> 10 Area1 2010 Male 2141 292748 766. 733. 801.
#> # ... with 30 more rows
# calculate same specifying standard population in vector form
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 2034 274567 745. 711. 781. 95%
#> 2 Area1 2006 Male 1793 294669 649. 617. 682. 95%
#> 3 Area1 2007 Fema~ 2043 287373 742. 708. 777. 95%
#> 4 Area1 2007 Male 2021 290581 678. 647. 710. 95%
#> 5 Area1 2008 Fema~ 1812 302461 588. 560. 618. 95%
#> 6 Area1 2008 Male 1964 292433 668. 638. 700. 95%
#> 7 Area1 2009 Fema~ 2226 281936 806. 770. 843. 95%
#> 8 Area1 2009 Male 1763 284014 606. 575. 637. 95%
#> 9 Area1 2010 Fema~ 1786 287422 617. 587. 649. 95%
#> 10 Area1 2010 Male 2141 292748 766. 733. 801. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate the same dsrs by appending the standard populations to the data frame
df_std %>%
mutate(refpop = rep(esp2013,40)) %>%
group_by(area, year, sex) %>%
phe_dsr(obs,pop, stdpop=refpop, stdpoptype="field")
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 2034 274567 745. 711. 781. 95%
#> 2 Area1 2006 Male 1793 294669 649. 617. 682. 95%
#> 3 Area1 2007 Fema~ 2043 287373 742. 708. 777. 95%
#> 4 Area1 2007 Male 2021 290581 678. 647. 710. 95%
#> 5 Area1 2008 Fema~ 1812 302461 588. 560. 618. 95%
#> 6 Area1 2008 Male 1964 292433 668. 638. 700. 95%
#> 7 Area1 2009 Fema~ 2226 281936 806. 770. 843. 95%
#> 8 Area1 2009 Male 1763 284014 606. 575. 637. 95%
#> 9 Area1 2010 Fema~ 1786 287422 617. 587. 649. 95%
#> 10 Area1 2010 Male 2141 292748 766. 733. 801. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate for under 75s by filtering out records for 75+ from input data frame and standard population
df_std %>%
filter(ageband <= 70) %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013[1:15])
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1537 212682 726. 689. 764. 95%
#> 2 Area1 2006 Male 1483 234729 671. 637. 707. 95%
#> 3 Area1 2007 Fema~ 1587 223775 737. 701. 776. 95%
#> 4 Area1 2007 Male 1660 237251 702. 668. 737. 95%
#> 5 Area1 2008 Fema~ 1441 248228 582. 552. 614. 95%
#> 6 Area1 2008 Male 1688 238220 687. 654. 721. 95%
#> 7 Area1 2009 Fema~ 1800 223206 823. 784. 863. 95%
#> 8 Area1 2009 Male 1310 219212 601. 568. 635. 95%
#> 9 Area1 2010 Fema~ 1278 231722 569. 538. 602. 95%
#> 10 Area1 2010 Male 1785 229198 788. 751. 825. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate dsrs for persons for each area and year)
df_std %>%
group_by(area, year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(area, year) %>%
phe_dsr(obs,pop)
#> # A tibble: 20 x 10
#> # Groups: area [4]
#> area year total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 3827 569236 669. 647. 692. 95%
#> 2 Area1 2007 4064 577954 706. 683. 730. 95%
#> 3 Area1 2008 3776 594894 633. 612. 655. 95%
#> 4 Area1 2009 3989 565950 699. 675. 722. 95%
#> 5 Area1 2010 3927 580170 681. 658. 704. 95%
#> 6 Area2 2006 3721 553521 648. 626. 671. 95%
#> 7 Area2 2007 4343 559070 791. 767. 817. 95%
#> 8 Area2 2008 4086 530565 793. 767. 818. 95%
#> 9 Area2 2009 3997 547794 721. 697. 745. 95%
#> 10 Area2 2010 3802 549293 740. 715. 765. 95%
#> 11 Area3 2006 4464 577663 769. 746. 793. 95%
#> 12 Area3 2007 3778 540250 742. 718. 767. 95%
#> 13 Area3 2008 3853 543948 752. 727. 777. 95%
#> 14 Area3 2009 3428 567772 600. 579. 622. 95%
#> 15 Area3 2010 4159 547879 726. 703. 750. 95%
#> 16 Area4 2006 3413 584511 627. 605. 649. 95%
#> 17 Area4 2007 3656 548666 662. 640. 685. 95%
#> 18 Area4 2008 3938 552582 693. 669. 717. 95%
#> 19 Area4 2009 3755 549062 670. 647. 693. 95%
#> 20 Area4 2010 3021 575179 543. 522. 564. 95%
#> # ... with 2 more variables: statistic <chr>, method <chr>
INPUT: Unlike the phe_dsr function, there is no default standard or reference data for the phe_smr and phe_isr functions. These functions take a single data frame as input, with columns representing the numerators and denominators for each standardisation category, plus reference numerators and denominators for each standardisation category.
The reference data can either be provided in a separate data frame/vectors or as columns within the input data frame:
reference data provided as a data frame or as vectors - the data frame/vectors and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order.
reference data provided as columns within the input data frame - the reference numerators and denominators can be appended to the input data frame prior to execution of the function - if the data is grouped to generate multiple smrs/isrs then the reference data will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the functions output one row per grouping set containing the grouping variable values, the observed and expected counts, the reference rate (isr only), the smr or isr, the lower 95% confidence limit, and the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If reference data are being provided as columns within the input data frame then the user must specify this as the function expects vectors by default. The function also accepts additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
The following code chunk creates a data frame containing the reference data - this example uses the all area data for persons in the baseline year:
df_ref <- df_std %>%
filter(year == 2006) %>%
group_by(ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop))
head(df_ref)
#> # A tibble: 6 x 3
#> ageband obs pop
#> <dbl> <int> <int>
#> 1 0 912 116087
#> 2 5 746 117691
#> 3 10 929 122302
#> 4 15 928 113389
#> 5 20 699 121968
#> 6 25 623 129101
Here are some example code chunks to demonstrate the phe_smr function and the arguments that can optionally be specified
# calculate separate smrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 2034 1851. 1.10 1.05 1.15 95%
#> 2 Area1 2006 Male 1793 1973. 0.909 0.867 0.952 95%
#> 3 Area1 2007 Fema~ 2043 1940. 1.05 1.01 1.10 95%
#> 4 Area1 2007 Male 2021 1959. 1.03 0.987 1.08 95%
#> 5 Area1 2008 Fema~ 1812 2045. 0.886 0.846 0.928 95%
#> 6 Area1 2008 Male 1964 1963. 1.00 0.957 1.05 95%
#> 7 Area1 2009 Fema~ 2226 1930. 1.15 1.11 1.20 95%
#> 8 Area1 2009 Male 1763 1934. 0.912 0.870 0.955 95%
#> 9 Area1 2010 Fema~ 1786 1963. 0.910 0.868 0.953 95%
#> 10 Area1 2010 Male 2141 1988. 1.08 1.03 1.12 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate the same smrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 2034 1851. 1.10 1.05 1.15 95%
#> 2 Area1 2006 Male 1793 1973. 0.909 0.867 0.952 95%
#> 3 Area1 2007 Fema~ 2043 1940. 1.05 1.01 1.10 95%
#> 4 Area1 2007 Male 2021 1959. 1.03 0.987 1.08 95%
#> 5 Area1 2008 Fema~ 1812 2045. 0.886 0.846 0.928 95%
#> 6 Area1 2008 Male 1964 1963. 1.00 0.957 1.05 95%
#> 7 Area1 2009 Fema~ 2226 1930. 1.15 1.11 1.20 95%
#> 8 Area1 2009 Male 1763 1934. 0.912 0.870 0.955 95%
#> 9 Area1 2010 Fema~ 1786 1963. 0.910 0.868 0.953 95%
#> 10 Area1 2010 Male 2141 1988. 1.08 1.03 1.12 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate smrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 6
#> year observed expected value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 15425 15425 1 0.984 1.02
#> 2 2007 15841 15077. 1.05 1.03 1.07
#> 3 2008 15653 14971. 1.05 1.03 1.06
#> 4 2009 15169 15094. 1.00 0.989 1.02
#> 5 2010 14909 15237. 0.978 0.963 0.994
The phe_isr function works exactly the same way but instead of expressing the result as a ratio of the observed and expected rates the result is expressed as a rate and the reference rate is also provided. Here are some examples:
# calculate separate isrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 2034 1851. 675. 742. 710. 775.
#> 2 Area1 2006 Male 1793 1973. 675. 613. 585. 643.
#> 3 Area1 2007 Fema~ 2043 1940. 675. 711. 680. 742.
#> 4 Area1 2007 Male 2021 1959. 675. 696. 666. 727.
#> 5 Area1 2008 Fema~ 1812 2045. 675. 598. 571. 626.
#> 6 Area1 2008 Male 1964 1963. 675. 675. 646. 706.
#> 7 Area1 2009 Fema~ 2226 1930. 675. 779. 747. 812.
#> 8 Area1 2009 Male 1763 1934. 675. 616. 587. 645.
#> 9 Area1 2010 Fema~ 1786 1963. 675. 614. 586. 643.
#> 10 Area1 2010 Male 2141 1988. 675. 727. 697. 758.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate the same isrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 2034 1851. 675. 742. 710. 775.
#> 2 Area1 2006 Male 1793 1973. 675. 613. 585. 643.
#> 3 Area1 2007 Fema~ 2043 1940. 675. 711. 680. 742.
#> 4 Area1 2007 Male 2021 1959. 675. 696. 666. 727.
#> 5 Area1 2008 Fema~ 1812 2045. 675. 598. 571. 626.
#> 6 Area1 2008 Male 1964 1963. 675. 675. 646. 706.
#> 7 Area1 2009 Fema~ 2226 1930. 675. 779. 747. 812.
#> 8 Area1 2009 Male 1763 1934. 675. 616. 587. 645.
#> 9 Area1 2010 Fema~ 1786 1963. 675. 614. 586. 643.
#> 10 Area1 2010 Male 2141 1988. 675. 727. 697. 758.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate separate isrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 7
#> year observed expected ref_rate value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 15425 15425 675. 675. 664. 686.
#> 2 2007 15841 15077. 675. 709. 698. 720.
#> 3 2008 15653 14971. 675. 706. 695. 717.
#> 4 2009 15169 15094. 675. 678. 668. 689.
#> 5 2010 14909 15237. 675. 661. 650. 671.