This vignette introduces the following functions from the PHEindicatormethods package and provides basic sample code to demonstrate their execution. The code included is based on the code provided within the ‘examples’ section of the function documentation. This vignette does not explain the methods applied in detail but these can (optionally) be output alongside the statistics or for a more detailed explanation, please see the references section of the function documentation.
library(PHEindicatormethods)
library(dplyr)
This vignette covers the following functions available within the first release of the package (v1.0.8) but has been updated to apply to these functions in their latest release versions. If further functions are added to the package in future releases these will be explained elsewhere.
Function | Type | Description |
---|---|---|
phe_proportion | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_rate | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_mean | Aggregate | Performs a calculation on each grouping set |
phe_dsr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_smr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_isr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
The following code chunk creates a data frame containing observed number of events and populations for 4 geographical areas over 2 time periods that is used later to demonstrate the PHEindicatormethods package functions:
df <- data.frame(
area = rep(c("Area1","Area2","Area3","Area4"), 2),
year = rep(2015:2016, each = 4),
obs = sample(100, 2 * 4, replace = TRUE),
pop = sample(100:200, 2 * 4, replace = TRUE))
df
#> area year obs pop
#> 1 Area1 2015 20 178
#> 2 Area2 2015 34 139
#> 3 Area3 2015 99 110
#> 4 Area4 2015 89 129
#> 5 Area1 2016 71 104
#> 6 Area2 2016 24 108
#> 7 Area3 2016 100 199
#> 8 Area4 2016 77 196
INPUT: The phe_proportion and phe_rate functions take a single data frame as input with columns representing the numerators and denominators for the statistic. Any other columns present will be retained in the output.
OUTPUT: The functions output the original data frame with additional columns appended. By default the additional columns are the proportion or rate, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The functions also accept additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate these two functions and the arguments that can optionally be specified
# default proportion
phe_proportion(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence
#> 1 Area1 2015 20 178 0.1123596 0.0739225 0.1671747 95%
#> 2 Area2 2015 34 139 0.2446043 0.1806468 0.3222986 95%
#> 3 Area3 2015 99 110 0.9000000 0.8297650 0.9432399 95%
#> 4 Area4 2015 89 129 0.6899225 0.6055856 0.7632751 95%
#> 5 Area1 2016 71 104 0.6826923 0.5881007 0.7642685 95%
#> 6 Area2 2016 24 108 0.2222222 0.1541255 0.3094008 95%
#> 7 Area3 2016 100 199 0.5025126 0.4336577 0.5712723 95%
#> 8 Area4 2016 77 196 0.3928571 0.3271730 0.4626604 95%
#> statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson
# specify confidence level for proportion
phe_proportion(df, obs, pop, confidence=99.8)
#> area year obs pop value lowercl uppercl confidence
#> 1 Area1 2015 20 178 0.1123596 0.05815251 0.2060419 99.8%
#> 2 Area2 2015 34 139 0.2446043 0.15080575 0.3712392 99.8%
#> 3 Area3 2015 99 110 0.9000000 0.77743894 0.9586576 99.8%
#> 4 Area4 2015 89 129 0.6899225 0.55469910 0.7989650 99.8%
#> 5 Area1 2016 71 104 0.6826923 0.53148183 0.8031739 99.8%
#> 6 Area2 2016 24 108 0.2222222 0.12416324 0.3654136 99.8%
#> 7 Area3 2016 100 199 0.5025126 0.39540552 0.6093895 99.8%
#> 8 Area4 2016 77 196 0.3928571 0.29244940 0.5032203 99.8%
#> statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson
# specify to output proportions as percentages
phe_proportion(df, obs, pop, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 20 178 11.23596 7.39225 16.71747 95% percentage
#> 2 Area2 2015 34 139 24.46043 18.06468 32.22986 95% percentage
#> 3 Area3 2015 99 110 90.00000 82.97650 94.32399 95% percentage
#> 4 Area4 2015 89 129 68.99225 60.55856 76.32751 95% percentage
#> 5 Area1 2016 71 104 68.26923 58.81007 76.42685 95% percentage
#> 6 Area2 2016 24 108 22.22222 15.41255 30.94008 95% percentage
#> 7 Area3 2016 100 199 50.25126 43.36577 57.12723 95% percentage
#> 8 Area4 2016 77 196 39.28571 32.71730 46.26604 95% percentage
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify level of detail to output for proportion
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 20 178 11.23596 5.815251 20.60419 99.8% percentage
#> 2 Area2 2015 34 139 24.46043 15.080575 37.12392 99.8% percentage
#> 3 Area3 2015 99 110 90.00000 77.743894 95.86576 99.8% percentage
#> 4 Area4 2015 89 129 68.99225 55.469910 79.89650 99.8% percentage
#> 5 Area1 2016 71 104 68.26923 53.148183 80.31739 99.8% percentage
#> 6 Area2 2016 24 108 22.22222 12.416324 36.54136 99.8% percentage
#> 7 Area3 2016 100 199 50.25126 39.540552 60.93895 99.8% percentage
#> 8 Area4 2016 77 196 39.28571 29.244940 50.32203 99.8% percentage
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify level of detail to output for proportion and remove metadata columns
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100, type="standard")
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 20 178 11.23596 5.815251 20.60419
#> 2 Area2 2015 34 139 24.46043 15.080575 37.12392
#> 3 Area3 2015 99 110 90.00000 77.743894 95.86576
#> 4 Area4 2015 89 129 68.99225 55.469910 79.89650
#> 5 Area1 2016 71 104 68.26923 53.148183 80.31739
#> 6 Area2 2016 24 108 22.22222 12.416324 36.54136
#> 7 Area3 2016 100 199 50.25126 39.540552 60.93895
#> 8 Area4 2016 77 196 39.28571 29.244940 50.32203
# default rate
phe_rate(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence
#> 1 Area1 2015 20 178 11235.96 6860.353 17353.87 95%
#> 2 Area2 2015 34 139 24460.43 16936.943 34182.07 95%
#> 3 Area3 2015 99 110 90000.00 73145.834 109572.81 95%
#> 4 Area4 2015 89 129 68992.25 55404.874 84901.82 95%
#> 5 Area1 2016 71 104 68269.23 53316.611 86113.57 95%
#> 6 Area2 2016 24 108 22222.22 14234.001 33066.31 95%
#> 7 Area3 2016 100 199 50251.26 40885.480 61119.57 95%
#> 8 Area4 2016 77 196 39285.71 31002.547 49101.04 95%
#> statistic method
#> 1 rate per 100000 Byars
#> 2 rate per 100000 Byars
#> 3 rate per 100000 Byars
#> 4 rate per 100000 Byars
#> 5 rate per 100000 Byars
#> 6 rate per 100000 Byars
#> 7 rate per 100000 Byars
#> 8 rate per 100000 Byars
# specify rate parameters
phe_rate(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 20 178 11.23596 5.01281 21.39609 99.8% rate per 100
#> 2 Area2 2015 34 139 24.46043 13.49041 40.42569 99.8% rate per 100
#> 3 Area3 2015 99 110 90.00000 64.59850 121.62697 99.8% rate per 100
#> 4 Area4 2015 89 129 68.99225 48.56588 94.73311 99.8% rate per 100
#> 5 Area1 2016 71 104 68.26923 45.92149 97.22584 99.8% rate per 100
#> 6 Area2 2016 24 108 22.22222 10.75409 40.15731 99.8% rate per 100
#> 7 Area3 2016 100 199 50.25126 36.13251 67.81087 99.8% rate per 100
#> 8 Area4 2016 77 196 39.28571 26.87893 55.19594 99.8% rate per 100
#> method
#> 1 Byars
#> 2 Byars
#> 3 Byars
#> 4 Byars
#> 5 Byars
#> 6 Byars
#> 7 Byars
#> 8 Byars
# specify rate parameters and reduce columns output and remove metadata columns
phe_rate(df, obs, pop, type="standard", confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 20 178 11.23596 5.01281 21.39609
#> 2 Area2 2015 34 139 24.46043 13.49041 40.42569
#> 3 Area3 2015 99 110 90.00000 64.59850 121.62697
#> 4 Area4 2015 89 129 68.99225 48.56588 94.73311
#> 5 Area1 2016 71 104 68.26923 45.92149 97.22584
#> 6 Area2 2016 24 108 22.22222 10.75409 40.15731
#> 7 Area3 2016 100 199 50.25126 36.13251 67.81087
#> 8 Area4 2016 77 196 39.28571 26.87893 55.19594
These functions can also return aggregate data if the input dataframes are grouped:
# default proportion - grouped
df %>%
group_by(year) %>%
phe_proportion(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 242 556 0.435 0.395 0.477 95% proportion of 1 Wilson
#> 2 2016 272 607 0.448 0.409 0.488 95% proportion of 1 Wilson
# default rate - grouped
df %>%
group_by(year) %>%
phe_rate(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 242 556 43525. 38213. 49369. 95% rate per 1000~ Byars
#> 2 2016 272 607 44811. 39643. 50465. 95% rate per 1000~ Byars
The remaining functions aggregate the rows in the input data frame to produce a single statistic. It is also possible to calculate multiple statistics in a single execution of these functions if the input data frame is grouped - for example by indicator ID, geographic area or time period (or all three). The output contains only the grouping variables and the values calculated by the function - any additional unused columns provided in the input data frame will not be retained in the output.
The df test data generated earlier can be used to demonstrate phe_mean:
INPUT: The phe_mean function take a single data frame as input with a column representing the numbers to be averaged.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values (if applicable), the mean, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The function also accepts additional arguments to specify the level of confidence and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_mean function and the arguments that can optionally be specified
# default mean
phe_mean(df,obs)
#> value_sum value_count stdev value lowercl uppercl confidence
#> 1 514 8 33.37985 64.25 36.34375 92.15625 95%
#> statistic method
#> 1 mean Student's t-distribution
# multiple means in a single execution with 99.8% confidence
df %>%
group_by(year) %>%
phe_mean(obs, confidence=0.998)
#> # A tibble: 2 x 10
#> year value_sum value_count stdev value lowercl uppercl confidence
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2015 242 4 39.3 60.5 -140. 261. 99.8%
#> 2 2016 272 4 31.9 68 -94.8 231. 99.8%
#> # ... with 2 more variables: statistic <chr>, method <chr>
# multiple means in a single execution with 99.8% confidence and data-only output
df %>%
group_by(year) %>%
phe_mean(obs, type = "standard", confidence=0.998)
#> # A tibble: 2 x 7
#> year value_sum value_count stdev value lowercl uppercl
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2015 242 4 39.3 60.5 -140. 261.
#> 2 2016 272 4 31.9 68 -94.8 231.
The following code chunk creates a data frame containing observed number of events and populations by age band for 4 areas, 5 time periods and 2 sexes:
df_std <- data.frame(
area = rep(c("Area1", "Area2", "Area3", "Area4"), each = 19 * 2 * 5),
year = rep(2006:2010, each = 19 * 2),
sex = rep(rep(c("Male", "Female"), each = 19), 5),
ageband = rep(c(0, 5,10,15,20,25,30,35,40,45,
50,55,60,65,70,75,80,85,90), times = 10),
obs = sample(200, 19 * 2 * 5 * 4, replace = TRUE),
pop = sample(10000:20000, 19 * 2 * 5 * 4, replace = TRUE))
head(df_std)
#> area year sex ageband obs pop
#> 1 Area1 2006 Male 0 9 13434
#> 2 Area1 2006 Male 5 87 11372
#> 3 Area1 2006 Male 10 173 17767
#> 4 Area1 2006 Male 15 126 14290
#> 5 Area1 2006 Male 20 126 10457
#> 6 Area1 2006 Male 25 48 15674
INPUT: The minimum input requirement for the phe_dsr function is a single data frame with columns representing the numerators and denominators for each standardisation category. This is sufficient if the data is:
The 2013 European Standard Population is provided within the package in vector form (esp2013) and is used by default by this function. Alternative standard populations can be used but must be provided by the user. When the function joins a standard population vector to the input data frame it does this by position so it is important that the data is sorted accordingly. This is a user responsibility.
The function can also accept standard populations provided as a column within the input data frame.
standard populations provided as a vector - the vector and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order
standard populations provided as a column within the input data frame - the standard populations can be appended to the input data frame by the user prior to execution of the function - if the data is grouped to generate multiple dsrs then the standard populations will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values, the total count, the total population, the dsr, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If standard populations are being provided as a column within the input data frame then the user must specify this using the stdpoptype argument as the function expects a vector by default. The function also accepts additional arguments to specify the standard populations, the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_dsr function and the arguments that can optionally be specified
# calculate separate dsrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 2370 285858 858. 821. 895. 95%
#> 2 Area1 2006 Male 2120 283503 706. 673. 739. 95%
#> 3 Area1 2007 Fema~ 1738 277490 636. 605. 669. 95%
#> 4 Area1 2007 Male 1799 268074 638. 605. 671. 95%
#> 5 Area1 2008 Fema~ 1529 298730 531. 504. 560. 95%
#> 6 Area1 2008 Male 1870 294577 682. 650. 716. 95%
#> 7 Area1 2009 Fema~ 2156 278258 800. 764. 838. 95%
#> 8 Area1 2009 Male 1978 272909 749. 714. 786. 95%
#> 9 Area1 2010 Fema~ 1961 306239 688. 656. 722. 95%
#> 10 Area1 2010 Male 1808 287200 597. 567. 628. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate dsrs for each area, year and sex and drop metadata fields from output
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, type="standard")
#> # A tibble: 40 x 8
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 2370 285858 858. 821. 895.
#> 2 Area1 2006 Male 2120 283503 706. 673. 739.
#> 3 Area1 2007 Female 1738 277490 636. 605. 669.
#> 4 Area1 2007 Male 1799 268074 638. 605. 671.
#> 5 Area1 2008 Female 1529 298730 531. 504. 560.
#> 6 Area1 2008 Male 1870 294577 682. 650. 716.
#> 7 Area1 2009 Female 2156 278258 800. 764. 838.
#> 8 Area1 2009 Male 1978 272909 749. 714. 786.
#> 9 Area1 2010 Female 1961 306239 688. 656. 722.
#> 10 Area1 2010 Male 1808 287200 597. 567. 628.
#> # ... with 30 more rows
# calculate same specifying standard population in vector form
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 2370 285858 858. 821. 895. 95%
#> 2 Area1 2006 Male 2120 283503 706. 673. 739. 95%
#> 3 Area1 2007 Fema~ 1738 277490 636. 605. 669. 95%
#> 4 Area1 2007 Male 1799 268074 638. 605. 671. 95%
#> 5 Area1 2008 Fema~ 1529 298730 531. 504. 560. 95%
#> 6 Area1 2008 Male 1870 294577 682. 650. 716. 95%
#> 7 Area1 2009 Fema~ 2156 278258 800. 764. 838. 95%
#> 8 Area1 2009 Male 1978 272909 749. 714. 786. 95%
#> 9 Area1 2010 Fema~ 1961 306239 688. 656. 722. 95%
#> 10 Area1 2010 Male 1808 287200 597. 567. 628. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate the same dsrs by appending the standard populations to the data frame
df_std %>%
mutate(refpop = rep(esp2013,40)) %>%
group_by(area, year, sex) %>%
phe_dsr(obs,pop, stdpop=refpop, stdpoptype="field")
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 2370 285858 858. 821. 895. 95%
#> 2 Area1 2006 Male 2120 283503 706. 673. 739. 95%
#> 3 Area1 2007 Fema~ 1738 277490 636. 605. 669. 95%
#> 4 Area1 2007 Male 1799 268074 638. 605. 671. 95%
#> 5 Area1 2008 Fema~ 1529 298730 531. 504. 560. 95%
#> 6 Area1 2008 Male 1870 294577 682. 650. 716. 95%
#> 7 Area1 2009 Fema~ 2156 278258 800. 764. 838. 95%
#> 8 Area1 2009 Male 1978 272909 749. 714. 786. 95%
#> 9 Area1 2010 Fema~ 1961 306239 688. 656. 722. 95%
#> 10 Area1 2010 Male 1808 287200 597. 567. 628. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate for under 75s by filtering out records for 75+ from input data frame and standard population
df_std %>%
filter(ageband <= 70) %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013[1:15])
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 2000 229818 896. 856. 937. 95%
#> 2 Area1 2006 Male 1541 220982 688. 653. 724. 95%
#> 3 Area1 2007 Fema~ 1465 221012 660. 625. 695. 95%
#> 4 Area1 2007 Male 1185 205276 614. 579. 650. 95%
#> 5 Area1 2008 Fema~ 1234 237581 519. 490. 549. 95%
#> 6 Area1 2008 Male 1462 228084 666. 631. 702. 95%
#> 7 Area1 2009 Fema~ 1693 224768 793. 754. 833. 95%
#> 8 Area1 2009 Male 1606 214973 777. 738. 816. 95%
#> 9 Area1 2010 Fema~ 1599 236802 711. 676. 748. 95%
#> 10 Area1 2010 Male 1191 232198 558. 526. 591. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate dsrs for persons for each area and year)
df_std %>%
group_by(area, year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(area, year) %>%
phe_dsr(obs,pop)
#> # A tibble: 20 x 10
#> # Groups: area [4]
#> area year total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 4490 569361 781. 757. 806. 95%
#> 2 Area1 2007 3537 545564 631. 609. 654. 95%
#> 3 Area1 2008 3399 593307 602. 581. 623. 95%
#> 4 Area1 2009 4134 551167 760. 736. 786. 95%
#> 5 Area1 2010 3769 593439 623. 602. 645. 95%
#> 6 Area2 2006 4032 576854 710. 687. 734. 95%
#> 7 Area2 2007 3900 559042 666. 643. 689. 95%
#> 8 Area2 2008 3920 575995 707. 684. 730. 95%
#> 9 Area2 2009 3758 603377 580. 561. 600. 95%
#> 10 Area2 2010 3618 556090 677. 655. 701. 95%
#> 11 Area3 2006 4118 553110 766. 742. 791. 95%
#> 12 Area3 2007 4035 578825 722. 699. 746. 95%
#> 13 Area3 2008 3771 567526 700. 677. 724. 95%
#> 14 Area3 2009 3681 572989 615. 595. 637. 95%
#> 15 Area3 2010 3916 610865 649. 628. 672. 95%
#> 16 Area4 2006 4041 529902 817. 790. 844. 95%
#> 17 Area4 2007 3283 594363 518. 500. 538. 95%
#> 18 Area4 2008 3924 548693 740. 715. 765. 95%
#> 19 Area4 2009 3338 609596 539. 519. 559. 95%
#> 20 Area4 2010 3355 599294 534. 515. 554. 95%
#> # ... with 2 more variables: statistic <chr>, method <chr>
INPUT: Unlike the phe_dsr function, there is no default standard or reference data for the phe_smr and phe_isr functions. These functions take a single data frame as input, with columns representing the numerators and denominators for each standardisation category, plus reference numerators and denominators for each standardisation category.
The reference data can either be provided in a separate data frame/vectors or as columns within the input data frame:
reference data provided as a data frame or as vectors - the data frame/vectors and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order.
reference data provided as columns within the input data frame - the reference numerators and denominators can be appended to the input data frame prior to execution of the function - if the data is grouped to generate multiple smrs/isrs then the reference data will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the functions output one row per grouping set containing the grouping variable values, the observed and expected counts, the reference rate (isr only), the smr or isr, the lower 95% confidence limit, and the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If reference data are being provided as columns within the input data frame then the user must specify this as the function expects vectors by default. The function also accepts additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
The following code chunk creates a data frame containing the reference data - this example uses the all area data for persons in the baseline year:
df_ref <- df_std %>%
filter(year == 2006) %>%
group_by(ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop))
head(df_ref)
#> # A tibble: 6 x 3
#> ageband obs pop
#> <dbl> <int> <int>
#> 1 0 614 107211
#> 2 5 824 123370
#> 3 10 943 118207
#> 4 15 736 110219
#> 5 20 1060 106109
#> 6 25 857 110917
Here are some example code chunks to demonstrate the phe_smr function and the arguments that can optionally be specified
# calculate separate smrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 2370 2147. 1.10 1.06 1.15 95%
#> 2 Area1 2006 Male 2120 2132. 0.994 0.952 1.04 95%
#> 3 Area1 2007 Fema~ 1738 2077. 0.837 0.798 0.877 95%
#> 4 Area1 2007 Male 1799 2023. 0.889 0.849 0.931 95%
#> 5 Area1 2008 Fema~ 1529 2241. 0.682 0.649 0.717 95%
#> 6 Area1 2008 Male 1870 2215. 0.844 0.807 0.884 95%
#> 7 Area1 2009 Fema~ 2156 2071. 1.04 0.998 1.09 95%
#> 8 Area1 2009 Male 1978 2050. 0.965 0.923 1.01 95%
#> 9 Area1 2010 Fema~ 1961 2296. 0.854 0.817 0.893 95%
#> 10 Area1 2010 Male 1808 2145. 0.843 0.804 0.883 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate the same smrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 2370 2147. 1.10 1.06 1.15 95%
#> 2 Area1 2006 Male 2120 2132. 0.994 0.952 1.04 95%
#> 3 Area1 2007 Fema~ 1738 2077. 0.837 0.798 0.877 95%
#> 4 Area1 2007 Male 1799 2023. 0.889 0.849 0.931 95%
#> 5 Area1 2008 Fema~ 1529 2241. 0.682 0.649 0.717 95%
#> 6 Area1 2008 Male 1870 2215. 0.844 0.807 0.884 95%
#> 7 Area1 2009 Fema~ 2156 2071. 1.04 0.998 1.09 95%
#> 8 Area1 2009 Male 1978 2050. 0.965 0.923 1.01 95%
#> 9 Area1 2010 Fema~ 1961 2296. 0.854 0.817 0.893 95%
#> 10 Area1 2010 Male 1808 2145. 0.843 0.804 0.883 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate smrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 6
#> year observed expected value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 16681 16681 1 0.985 1.02
#> 2 2007 14755 17162. 0.860 0.846 0.874
#> 3 2008 15014 17156. 0.875 0.861 0.889
#> 4 2009 14911 17520. 0.851 0.837 0.865
#> 5 2010 14658 17714. 0.827 0.814 0.841
The phe_isr function works exactly the same way but instead of expressing the result as a ratio of the observed and expected rates the result is expressed as a rate and the reference rate is also provided. Here are some examples:
# calculate separate isrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 2370 2147. 748. 826. 793. 860.
#> 2 Area1 2006 Male 2120 2132. 748. 744. 713. 776.
#> 3 Area1 2007 Fema~ 1738 2077. 748. 626. 597. 656.
#> 4 Area1 2007 Male 1799 2023. 748. 665. 635. 697.
#> 5 Area1 2008 Fema~ 1529 2241. 748. 511. 485. 537.
#> 6 Area1 2008 Male 1870 2215. 748. 632. 604. 661.
#> 7 Area1 2009 Fema~ 2156 2071. 748. 779. 747. 813.
#> 8 Area1 2009 Male 1978 2050. 748. 722. 690. 754.
#> 9 Area1 2010 Fema~ 1961 2296. 748. 639. 611. 668.
#> 10 Area1 2010 Male 1808 2145. 748. 631. 602. 660.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate the same isrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 2370 2147. 748. 826. 793. 860.
#> 2 Area1 2006 Male 2120 2132. 748. 744. 713. 776.
#> 3 Area1 2007 Fema~ 1738 2077. 748. 626. 597. 656.
#> 4 Area1 2007 Male 1799 2023. 748. 665. 635. 697.
#> 5 Area1 2008 Fema~ 1529 2241. 748. 511. 485. 537.
#> 6 Area1 2008 Male 1870 2215. 748. 632. 604. 661.
#> 7 Area1 2009 Fema~ 2156 2071. 748. 779. 747. 813.
#> 8 Area1 2009 Male 1978 2050. 748. 722. 690. 754.
#> 9 Area1 2010 Fema~ 1961 2296. 748. 639. 611. 668.
#> 10 Area1 2010 Male 1808 2145. 748. 631. 602. 660.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate separate isrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 7
#> year observed expected ref_rate value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 16681 16681 748. 748. 737. 760.
#> 2 2007 14755 17162. 748. 643. 633. 654.
#> 3 2008 15014 17156. 748. 655. 644. 665.
#> 4 2009 14911 17520. 748. 637. 627. 647.
#> 5 2010 14658 17714. 748. 619. 609. 629.