This vignette introduces the following functions from the PHEindicatormethods package and provides basic sample code to demonstrate their execution. The code included is based on the code provided within the ‘examples’ section of the function documentation. This vignette does not explain the methods applied in detail but these can (optionally) be output alongside the statistics or for a more detailed explanation, please see the references section of the function documentation.
This vignette covers the following functions available within the first release of the package (v1.0.8) but has been updated to apply to these functions in their latest release versions. If further functions are added to the package in future releases these will be explained elsewhere.
Function | Type | Description |
---|---|---|
phe_proportion | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_rate | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_mean | Aggregate | Performs a calculation on each grouping set |
phe_dsr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_smr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_isr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
The following code chunk creates a data frame containing observed number of events and populations for 4 geographical areas over 2 time periods that is used later to demonstrate the PHEindicatormethods package functions:
df <- data.frame(
area = rep(c("Area1","Area2","Area3","Area4"), 2),
year = rep(2015:2016, each = 4),
obs = sample(100, 2 * 4, replace = TRUE),
pop = sample(100:200, 2 * 4, replace = TRUE))
df
#> area year obs pop
#> 1 Area1 2015 47 128
#> 2 Area2 2015 16 112
#> 3 Area3 2015 78 182
#> 4 Area4 2015 92 163
#> 5 Area1 2016 74 145
#> 6 Area2 2016 41 110
#> 7 Area3 2016 26 117
#> 8 Area4 2016 33 153
INPUT: The phe_proportion and phe_rate functions take a single data frame as input with columns representing the numerators and denominators for the statistic. Any other columns present will be retained in the output.
OUTPUT: The functions output the original data frame with additional columns appended. By default the additional columns are the proportion or rate, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The functions also accept additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate these two functions and the arguments that can optionally be specified
# default proportion
phe_proportion(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence
#> 1 Area1 2015 47 128 0.3671875 0.28868455 0.4534300 95%
#> 2 Area2 2015 16 112 0.1428571 0.08988661 0.2195144 95%
#> 3 Area3 2015 78 182 0.4285714 0.35888349 0.5012123 95%
#> 4 Area4 2015 92 163 0.5644172 0.48768236 0.6381856 95%
#> 5 Area1 2016 74 145 0.5103448 0.42976859 0.5903871 95%
#> 6 Area2 2016 41 110 0.3727273 0.28809596 0.4659479 95%
#> 7 Area3 2016 26 117 0.2222222 0.15640397 0.3057012 95%
#> 8 Area4 2016 33 153 0.2156863 0.15790572 0.2873940 95%
#> statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson
# specify confidence level for proportion
phe_proportion(df, obs, pop, confidence=99.8)
#> area year obs pop value lowercl uppercl confidence
#> 1 Area1 2015 47 128 0.3671875 0.24906241 0.5037539 99.8%
#> 2 Area2 2015 16 112 0.1428571 0.06889889 0.2729332 99.8%
#> 3 Area3 2015 78 182 0.4285714 0.32157994 0.5426849 99.8%
#> 4 Area4 2015 92 163 0.5644172 0.44415166 0.6775525 99.8%
#> 5 Area1 2016 74 145 0.5103448 0.38544314 0.6339681 99.8%
#> 6 Area2 2016 41 110 0.3727273 0.24585624 0.5199312 99.8%
#> 7 Area3 2016 26 117 0.2222222 0.12707158 0.3592956 99.8%
#> 8 Area4 2016 33 153 0.2156863 0.13130903 0.3334695 99.8%
#> statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson
# specify to output proportions as percentages
phe_proportion(df, obs, pop, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 47 128 36.71875 28.868455 45.34300 95% percentage
#> 2 Area2 2015 16 112 14.28571 8.988661 21.95144 95% percentage
#> 3 Area3 2015 78 182 42.85714 35.888349 50.12123 95% percentage
#> 4 Area4 2015 92 163 56.44172 48.768236 63.81856 95% percentage
#> 5 Area1 2016 74 145 51.03448 42.976859 59.03871 95% percentage
#> 6 Area2 2016 41 110 37.27273 28.809596 46.59479 95% percentage
#> 7 Area3 2016 26 117 22.22222 15.640397 30.57012 95% percentage
#> 8 Area4 2016 33 153 21.56863 15.790572 28.73940 95% percentage
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify level of detail to output for proportion
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 47 128 36.71875 24.906241 50.37539 99.8% percentage
#> 2 Area2 2015 16 112 14.28571 6.889889 27.29332 99.8% percentage
#> 3 Area3 2015 78 182 42.85714 32.157994 54.26849 99.8% percentage
#> 4 Area4 2015 92 163 56.44172 44.415166 67.75525 99.8% percentage
#> 5 Area1 2016 74 145 51.03448 38.544314 63.39681 99.8% percentage
#> 6 Area2 2016 41 110 37.27273 24.585624 51.99312 99.8% percentage
#> 7 Area3 2016 26 117 22.22222 12.707158 35.92956 99.8% percentage
#> 8 Area4 2016 33 153 21.56863 13.130903 33.34695 99.8% percentage
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify level of detail to output for proportion and remove metadata columns
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100, type="standard")
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 47 128 36.71875 24.906241 50.37539
#> 2 Area2 2015 16 112 14.28571 6.889889 27.29332
#> 3 Area3 2015 78 182 42.85714 32.157994 54.26849
#> 4 Area4 2015 92 163 56.44172 44.415166 67.75525
#> 5 Area1 2016 74 145 51.03448 38.544314 63.39681
#> 6 Area2 2016 41 110 37.27273 24.585624 51.99312
#> 7 Area3 2016 26 117 22.22222 12.707158 35.92956
#> 8 Area4 2016 33 153 21.56863 13.130903 33.34695
# default rate
phe_rate(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence
#> 1 Area1 2015 47 128 36718.75 26977.233 48829.25 95%
#> 2 Area2 2015 16 112 14285.71 8160.276 23200.39 95%
#> 3 Area3 2015 78 182 42857.14 33875.592 53488.33 95%
#> 4 Area4 2015 92 163 56441.72 45498.775 69221.48 95%
#> 5 Area1 2016 74 145 51034.48 40071.490 64069.99 95%
#> 6 Area2 2016 41 110 37272.73 26744.610 50565.98 95%
#> 7 Area3 2016 26 117 22222.22 14512.612 32562.00 95%
#> 8 Area4 2016 33 153 21568.63 14844.428 30291.35 95%
#> statistic method
#> 1 rate per 100000 Byars
#> 2 rate per 100000 Byars
#> 3 rate per 100000 Byars
#> 4 rate per 100000 Byars
#> 5 rate per 100000 Byars
#> 6 rate per 100000 Byars
#> 7 rate per 100000 Byars
#> 8 rate per 100000 Byars
# specify rate parameters
phe_rate(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 47 128 36.71875 22.342317 56.49356 99.8% rate per 100
#> 2 Area2 2015 16 112 14.28571 5.684786 29.17111 99.8% rate per 100
#> 3 Area3 2015 78 182 42.85714 29.399746 60.08691 99.8% rate per 100
#> 4 Area4 2015 92 163 56.44172 39.977631 77.11026 99.8% rate per 100
#> 5 Area1 2016 74 145 51.03448 34.631101 72.17572 99.8% rate per 100
#> 6 Area2 2016 41 110 37.27273 21.810353 59.02921 99.8% rate per 100
#> 7 Area3 2016 26 117 22.22222 11.111580 39.29392 99.8% rate per 100
#> 8 Area4 2016 33 153 21.56863 11.776009 35.90156 99.8% rate per 100
#> method
#> 1 Byars
#> 2 Byars
#> 3 Byars
#> 4 Byars
#> 5 Byars
#> 6 Byars
#> 7 Byars
#> 8 Byars
# specify rate parameters and reduce columns output and remove metadata columns
phe_rate(df, obs, pop, type="standard", confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 47 128 36.71875 22.342317 56.49356
#> 2 Area2 2015 16 112 14.28571 5.684786 29.17111
#> 3 Area3 2015 78 182 42.85714 29.399746 60.08691
#> 4 Area4 2015 92 163 56.44172 39.977631 77.11026
#> 5 Area1 2016 74 145 51.03448 34.631101 72.17572
#> 6 Area2 2016 41 110 37.27273 21.810353 59.02921
#> 7 Area3 2016 26 117 22.22222 11.111580 39.29392
#> 8 Area4 2016 33 153 21.56863 11.776009 35.90156
These functions can also return aggregate data if the input dataframes are grouped:
# default proportion - grouped
df %>%
group_by(year) %>%
phe_proportion(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 233 585 0.398 0.359 0.439 95% proportion of 1 Wilson
#> 2 2016 174 525 0.331 0.293 0.373 95% proportion of 1 Wilson
# default rate - grouped
df %>%
group_by(year) %>%
phe_rate(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 233 585 39829. 34879. 45285. 95% rate per 1000~ Byars
#> 2 2016 174 525 33143. 28401. 38450. 95% rate per 1000~ Byars
The remaining functions aggregate the rows in the input data frame to produce a single statistic. It is also possible to calculate multiple statistics in a single execution of these functions if the input data frame is grouped - for example by indicator ID, geographic area or time period (or all three). The output contains only the grouping variables and the values calculated by the function - any additional unused columns provided in the input data frame will not be retained in the output.
The df test data generated earlier can be used to demonstrate phe_mean:
INPUT: The phe_mean function take a single data frame as input with a column representing the numbers to be averaged.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values (if applicable), the mean, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The function also accepts additional arguments to specify the level of confidence and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_mean function and the arguments that can optionally be specified
# default mean
phe_mean(df,obs)
#> value_sum value_count stdev value lowercl uppercl confidence
#> 1 407 8 27.33097 50.875 28.02573 73.72427 95%
#> statistic method
#> 1 mean Student's t-distribution
# multiple means in a single execution with 99.8% confidence
df %>%
group_by(year) %>%
phe_mean(obs, confidence=0.998)
#> # A tibble: 2 x 10
#> year value_sum value_count stdev value lowercl uppercl confidence
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2015 233 4 33.9 58.2 -115. 231. 99.8%
#> 2 2016 174 4 21.2 43.5 -65.0 152. 99.8%
#> # ... with 2 more variables: statistic <chr>, method <chr>
# multiple means in a single execution with 99.8% confidence and data-only output
df %>%
group_by(year) %>%
phe_mean(obs, type = "standard", confidence=0.998)
#> # A tibble: 2 x 7
#> year value_sum value_count stdev value lowercl uppercl
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2015 233 4 33.9 58.2 -115. 231.
#> 2 2016 174 4 21.2 43.5 -65.0 152.
The following code chunk creates a data frame containing observed number of events and populations by age band for 4 areas, 5 time periods and 2 sexes:
df_std <- data.frame(
area = rep(c("Area1", "Area2", "Area3", "Area4"), each = 19 * 2 * 5),
year = rep(2006:2010, each = 19 * 2),
sex = rep(rep(c("Male", "Female"), each = 19), 5),
ageband = rep(c(0, 5,10,15,20,25,30,35,40,45,
50,55,60,65,70,75,80,85,90), times = 10),
obs = sample(200, 19 * 2 * 5 * 4, replace = TRUE),
pop = sample(10000:20000, 19 * 2 * 5 * 4, replace = TRUE))
head(df_std)
#> area year sex ageband obs pop
#> 1 Area1 2006 Male 0 49 18515
#> 2 Area1 2006 Male 5 120 12429
#> 3 Area1 2006 Male 10 27 18295
#> 4 Area1 2006 Male 15 172 13585
#> 5 Area1 2006 Male 20 142 18873
#> 6 Area1 2006 Male 25 128 19736
INPUT: The minimum input requirement for the phe_dsr function is a single data frame with columns representing the numerators and denominators for each standardisation category. This is sufficient if the data is:
The 2013 European Standard Population is provided within the package in vector form (esp2013) and is used by default by this function. Alternative standard populations can be used but must be provided by the user. When the function joins a standard population vector to the input data frame it does this by position so it is important that the data is sorted accordingly. This is a user responsibility.
The function can also accept standard populations provided as a column within the input data frame.
standard populations provided as a vector - the vector and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order
standard populations provided as a column within the input data frame - the standard populations can be appended to the input data frame by the user prior to execution of the function - if the data is grouped to generate multiple dsrs then the standard populations will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values, the total count, the total population, the dsr, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If standard populations are being provided as a column within the input data frame then the user must specify this using the stdpoptype argument as the function expects a vector by default. The function also accepts additional arguments to specify the standard populations, the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_dsr function and the arguments that can optionally be specified
# calculate separate dsrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1588 269585 542. 513. 573. 95%
#> 2 Area1 2006 Male 2051 312726 682. 651. 713. 95%
#> 3 Area1 2007 Fema~ 1976 271347 748. 712. 785. 95%
#> 4 Area1 2007 Male 2071 277302 741. 707. 776. 95%
#> 5 Area1 2008 Fema~ 1654 293895 597. 566. 629. 95%
#> 6 Area1 2008 Male 1692 285377 604. 574. 636. 95%
#> 7 Area1 2009 Fema~ 1829 305132 659. 628. 692. 95%
#> 8 Area1 2009 Male 1755 288565 598. 568. 629. 95%
#> 9 Area1 2010 Fema~ 2254 292523 722. 690. 755. 95%
#> 10 Area1 2010 Male 1951 284716 671. 640. 704. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate dsrs for each area, year and sex and drop metadata fields from output
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, type="standard")
#> # A tibble: 40 x 8
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 1588 269585 542. 513. 573.
#> 2 Area1 2006 Male 2051 312726 682. 651. 713.
#> 3 Area1 2007 Female 1976 271347 748. 712. 785.
#> 4 Area1 2007 Male 2071 277302 741. 707. 776.
#> 5 Area1 2008 Female 1654 293895 597. 566. 629.
#> 6 Area1 2008 Male 1692 285377 604. 574. 636.
#> 7 Area1 2009 Female 1829 305132 659. 628. 692.
#> 8 Area1 2009 Male 1755 288565 598. 568. 629.
#> 9 Area1 2010 Female 2254 292523 722. 690. 755.
#> 10 Area1 2010 Male 1951 284716 671. 640. 704.
#> # ... with 30 more rows
# calculate same specifying standard population in vector form
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1588 269585 542. 513. 573. 95%
#> 2 Area1 2006 Male 2051 312726 682. 651. 713. 95%
#> 3 Area1 2007 Fema~ 1976 271347 748. 712. 785. 95%
#> 4 Area1 2007 Male 2071 277302 741. 707. 776. 95%
#> 5 Area1 2008 Fema~ 1654 293895 597. 566. 629. 95%
#> 6 Area1 2008 Male 1692 285377 604. 574. 636. 95%
#> 7 Area1 2009 Fema~ 1829 305132 659. 628. 692. 95%
#> 8 Area1 2009 Male 1755 288565 598. 568. 629. 95%
#> 9 Area1 2010 Fema~ 2254 292523 722. 690. 755. 95%
#> 10 Area1 2010 Male 1951 284716 671. 640. 704. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate the same dsrs by appending the standard populations to the data frame
df_std %>%
mutate(refpop = rep(esp2013,40)) %>%
group_by(area, year, sex) %>%
phe_dsr(obs,pop, stdpop=refpop, stdpoptype="field")
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1588 269585 542. 513. 573. 95%
#> 2 Area1 2006 Male 2051 312726 682. 651. 713. 95%
#> 3 Area1 2007 Fema~ 1976 271347 748. 712. 785. 95%
#> 4 Area1 2007 Male 2071 277302 741. 707. 776. 95%
#> 5 Area1 2008 Fema~ 1654 293895 597. 566. 629. 95%
#> 6 Area1 2008 Male 1692 285377 604. 574. 636. 95%
#> 7 Area1 2009 Fema~ 1829 305132 659. 628. 692. 95%
#> 8 Area1 2009 Male 1755 288565 598. 568. 629. 95%
#> 9 Area1 2010 Fema~ 2254 292523 722. 690. 755. 95%
#> 10 Area1 2010 Male 1951 284716 671. 640. 704. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate for under 75s by filtering out records for 75+ from input data frame and standard population
df_std %>%
filter(ageband <= 70) %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013[1:15])
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1034 216699 501. 470. 533. 95%
#> 2 Area1 2006 Male 1728 252257 708. 675. 743. 95%
#> 3 Area1 2007 Fema~ 1526 210474 740. 702. 780. 95%
#> 4 Area1 2007 Male 1570 215934 725. 689. 763. 95%
#> 5 Area1 2008 Fema~ 1271 235630 593. 559. 628. 95%
#> 6 Area1 2008 Male 1410 226722 620. 588. 654. 95%
#> 7 Area1 2009 Fema~ 1610 241629 691. 657. 727. 95%
#> 8 Area1 2009 Male 1344 218049 602. 569. 635. 95%
#> 9 Area1 2010 Fema~ 1584 235585 667. 634. 702. 95%
#> 10 Area1 2010 Male 1599 230790 680. 646. 714. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate dsrs for persons for each area and year)
df_std %>%
group_by(area, year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(area, year) %>%
phe_dsr(obs,pop)
#> # A tibble: 20 x 10
#> # Groups: area [4]
#> area year total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 3639 582311 601. 580. 623. 95%
#> 2 Area1 2007 4047 548649 739. 715. 764. 95%
#> 3 Area1 2008 3346 579272 587. 566. 608. 95%
#> 4 Area1 2009 3584 593697 624. 603. 646. 95%
#> 5 Area1 2010 4205 577239 698. 675. 721. 95%
#> 6 Area2 2006 3732 595337 620. 599. 642. 95%
#> 7 Area2 2007 3656 576631 671. 648. 694. 95%
#> 8 Area2 2008 3367 604795 604. 583. 626. 95%
#> 9 Area2 2009 3611 563165 652. 629. 675. 95%
#> 10 Area2 2010 3159 602672 513. 494. 532. 95%
#> 11 Area3 2006 4166 574100 730. 707. 755. 95%
#> 12 Area3 2007 3791 585599 658. 636. 681. 95%
#> 13 Area3 2008 3907 560264 728. 704. 753. 95%
#> 14 Area3 2009 3744 602071 613. 592. 634. 95%
#> 15 Area3 2010 4227 589856 747. 723. 771. 95%
#> 16 Area4 2006 3545 553076 662. 639. 685. 95%
#> 17 Area4 2007 3911 571843 696. 673. 720. 95%
#> 18 Area4 2008 3860 582601 693. 670. 716. 95%
#> 19 Area4 2009 3349 588160 602. 581. 624. 95%
#> 20 Area4 2010 4227 571729 724. 701. 748. 95%
#> # ... with 2 more variables: statistic <chr>, method <chr>
INPUT: Unlike the phe_dsr function, there is no default standard or reference data for the phe_smr and phe_isr functions. These functions take a single data frame as input, with columns representing the numerators and denominators for each standardisation category, plus reference numerators and denominators for each standardisation category.
The reference data can either be provided in a separate data frame/vectors or as columns within the input data frame:
reference data provided as a data frame or as vectors - the data frame/vectors and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order.
reference data provided as columns within the input data frame - the reference numerators and denominators can be appended to the input data frame prior to execution of the function - if the data is grouped to generate multiple smrs/isrs then the reference data will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the functions output one row per grouping set containing the grouping variable values, the observed and expected counts, the reference rate (isr only), the smr or isr, the lower 95% confidence limit, and the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If reference data are being provided as columns within the input data frame then the user must specify this as the function expects vectors by default. The function also accepts additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
The following code chunk creates a data frame containing the reference data - this example uses the all area data for persons in the baseline year:
df_ref <- df_std %>%
filter(year == 2006) %>%
group_by(ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop))
head(df_ref)
#> # A tibble: 6 x 3
#> ageband obs pop
#> <dbl> <int> <int>
#> 1 0 599 129182
#> 2 5 651 115841
#> 3 10 617 120517
#> 4 15 577 134855
#> 5 20 839 123381
#> 6 25 915 134561
Here are some example code chunks to demonstrate the phe_smr function and the arguments that can optionally be specified
# calculate separate smrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1588 1760. 0.902 0.858 0.948 95%
#> 2 Area1 2006 Male 2051 2047. 1.00 0.959 1.05 95%
#> 3 Area1 2007 Fema~ 1976 1827. 1.08 1.03 1.13 95%
#> 4 Area1 2007 Male 2071 1850. 1.12 1.07 1.17 95%
#> 5 Area1 2008 Fema~ 1654 1968. 0.841 0.801 0.882 95%
#> 6 Area1 2008 Male 1692 1886. 0.897 0.855 0.941 95%
#> 7 Area1 2009 Fema~ 1829 2024. 0.904 0.863 0.946 95%
#> 8 Area1 2009 Male 1755 1913. 0.917 0.875 0.961 95%
#> 9 Area1 2010 Fema~ 2254 1932. 1.17 1.12 1.22 95%
#> 10 Area1 2010 Male 1951 1879. 1.04 0.993 1.09 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate the same smrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1588 1760. 0.902 0.858 0.948 95%
#> 2 Area1 2006 Male 2051 2047. 1.00 0.959 1.05 95%
#> 3 Area1 2007 Fema~ 1976 1827. 1.08 1.03 1.13 95%
#> 4 Area1 2007 Male 2071 1850. 1.12 1.07 1.17 95%
#> 5 Area1 2008 Fema~ 1654 1968. 0.841 0.801 0.882 95%
#> 6 Area1 2008 Male 1692 1886. 0.897 0.855 0.941 95%
#> 7 Area1 2009 Fema~ 1829 2024. 0.904 0.863 0.946 95%
#> 8 Area1 2009 Male 1755 1913. 0.917 0.875 0.961 95%
#> 9 Area1 2010 Fema~ 2254 1932. 1.17 1.12 1.22 95%
#> 10 Area1 2010 Male 1951 1879. 1.04 0.993 1.09 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate smrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 6
#> year observed expected value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 15082 15082 1 0.984 1.02
#> 2 2007 15405 15086. 1.02 1.01 1.04
#> 3 2008 14480 15383. 0.941 0.926 0.957
#> 4 2009 14288 15485. 0.923 0.908 0.938
#> 5 2010 15818 15432. 1.03 1.01 1.04
The phe_isr function works exactly the same way but instead of expressing the result as a ratio of the observed and expected rates the result is expressed as a rate and the reference rate is also provided. Here are some examples:
# calculate separate isrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 1588 1760. 654. 590. 562. 620.
#> 2 Area1 2006 Male 2051 2047. 654. 656. 628. 685.
#> 3 Area1 2007 Fema~ 1976 1827. 654. 708. 677. 740.
#> 4 Area1 2007 Male 2071 1850. 654. 732. 701. 765.
#> 5 Area1 2008 Fema~ 1654 1968. 654. 550. 524. 577.
#> 6 Area1 2008 Male 1692 1886. 654. 587. 559. 616.
#> 7 Area1 2009 Fema~ 1829 2024. 654. 591. 565. 619.
#> 8 Area1 2009 Male 1755 1913. 654. 600. 573. 629.
#> 9 Area1 2010 Fema~ 2254 1932. 654. 763. 732. 796.
#> 10 Area1 2010 Male 1951 1879. 654. 680. 650. 710.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate the same isrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 1588 1760. 654. 590. 562. 620.
#> 2 Area1 2006 Male 2051 2047. 654. 656. 628. 685.
#> 3 Area1 2007 Fema~ 1976 1827. 654. 708. 677. 740.
#> 4 Area1 2007 Male 2071 1850. 654. 732. 701. 765.
#> 5 Area1 2008 Fema~ 1654 1968. 654. 550. 524. 577.
#> 6 Area1 2008 Male 1692 1886. 654. 587. 559. 616.
#> 7 Area1 2009 Fema~ 1829 2024. 654. 591. 565. 619.
#> 8 Area1 2009 Male 1755 1913. 654. 600. 573. 629.
#> 9 Area1 2010 Fema~ 2254 1932. 654. 763. 732. 796.
#> 10 Area1 2010 Male 1951 1879. 654. 680. 650. 710.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate separate isrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 7
#> year observed expected ref_rate value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 15082 15082 654. 654. 644. 665.
#> 2 2007 15405 15086. 654. 668. 658. 679.
#> 3 2008 14480 15383. 654. 616. 606. 626.
#> 4 2009 14288 15485. 654. 604. 594. 614.
#> 5 2010 15818 15432. 654. 671. 660. 681.