This vignette introduces the following functions from the PHEindicatormethods package and provides basic sample code to demonstrate their execution. The code included is based on the code provided within the ‘examples’ section of the function documentation. This vignette does not explain the methods applied in detail but these can (optionally) be output alongside the statistics or for a more detailed explanation, please see the references section of the function documentation.
This vignette covers the following functions available within the first release of the package (v1.0.8) but has been updated to apply to these functions in their latest release versions. If further functions are added to the package in future releases these will be explained elsewhere.
Function | Type | Description |
---|---|---|
phe_proportion | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_rate | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_mean | Aggregate | Performs a calculation on each grouping set |
phe_dsr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_smr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_isr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
The following code chunk creates a data frame containing observed number of events and populations for 4 geographical areas over 2 time periods that is used later to demonstrate the PHEindicatormethods package functions:
df <- data.frame(
area = rep(c("Area1","Area2","Area3","Area4"), 2),
year = rep(2015:2016, each = 4),
obs = sample(100, 2 * 4, replace = TRUE),
pop = sample(100:200, 2 * 4, replace = TRUE))
df
#> area year obs pop
#> 1 Area1 2015 59 143
#> 2 Area2 2015 38 177
#> 3 Area3 2015 67 192
#> 4 Area4 2015 35 114
#> 5 Area1 2016 31 113
#> 6 Area2 2016 19 174
#> 7 Area3 2016 28 124
#> 8 Area4 2016 55 126
INPUT: The phe_proportion and phe_rate functions take a single data frame as input with columns representing the numerators and denominators for the statistic. Any other columns present will be retained in the output.
OUTPUT: The functions output the original data frame with additional columns appended. By default the additional columns are the proportion or rate, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The functions also accept additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate these two functions and the arguments that can optionally be specified
# default proportion
phe_proportion(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 59 143 0.4125874 0.33521563 0.4945327 95% proportion of 1
#> 2 Area2 2015 38 177 0.2146893 0.16059916 0.2809006 95% proportion of 1
#> 3 Area3 2015 67 192 0.3489583 0.28509984 0.4187422 95% proportion of 1
#> 4 Area4 2015 35 114 0.3070175 0.22979102 0.3968260 95% proportion of 1
#> 5 Area1 2016 31 113 0.2743363 0.20051412 0.3629970 95% proportion of 1
#> 6 Area2 2016 19 174 0.1091954 0.07102821 0.1642457 95% proportion of 1
#> 7 Area3 2016 28 124 0.2258065 0.16110096 0.3069902 95% proportion of 1
#> 8 Area4 2016 55 126 0.4365079 0.35305937 0.5237134 95% proportion of 1
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify confidence level for proportion
phe_proportion(df, obs, pop, confidence=99.8)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 59 143 0.4125874 0.29476498 0.5413538 99.8% proportion of 1
#> 2 Area2 2015 38 177 0.2146893 0.13525246 0.3233364 99.8% proportion of 1
#> 3 Area3 2015 67 192 0.3489583 0.25211746 0.4601121 99.8% proportion of 1
#> 4 Area4 2015 35 114 0.3070175 0.19283219 0.4510353 99.8% proportion of 1
#> 5 Area1 2016 31 113 0.2743363 0.16613541 0.4177063 99.8% proportion of 1
#> 6 Area2 2016 19 174 0.1091954 0.05554019 0.2035154 99.8% proportion of 1
#> 7 Area3 2016 28 124 0.2258065 0.13190131 0.3589243 99.8% proportion of 1
#> 8 Area4 2016 55 126 0.4365079 0.30926677 0.5726952 99.8% proportion of 1
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify to output proportions as percentages
phe_proportion(df, obs, pop, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic method
#> 1 Area1 2015 59 143 41.25874 33.521563 49.45327 95% percentage Wilson
#> 2 Area2 2015 38 177 21.46893 16.059916 28.09006 95% percentage Wilson
#> 3 Area3 2015 67 192 34.89583 28.509984 41.87422 95% percentage Wilson
#> 4 Area4 2015 35 114 30.70175 22.979102 39.68260 95% percentage Wilson
#> 5 Area1 2016 31 113 27.43363 20.051412 36.29970 95% percentage Wilson
#> 6 Area2 2016 19 174 10.91954 7.102821 16.42457 95% percentage Wilson
#> 7 Area3 2016 28 124 22.58065 16.110096 30.69902 95% percentage Wilson
#> 8 Area4 2016 55 126 43.65079 35.305937 52.37134 95% percentage Wilson
# specify level of detail to output for proportion
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic method
#> 1 Area1 2015 59 143 41.25874 29.476498 54.13538 99.8% percentage Wilson
#> 2 Area2 2015 38 177 21.46893 13.525246 32.33364 99.8% percentage Wilson
#> 3 Area3 2015 67 192 34.89583 25.211746 46.01121 99.8% percentage Wilson
#> 4 Area4 2015 35 114 30.70175 19.283219 45.10353 99.8% percentage Wilson
#> 5 Area1 2016 31 113 27.43363 16.613541 41.77063 99.8% percentage Wilson
#> 6 Area2 2016 19 174 10.91954 5.554019 20.35154 99.8% percentage Wilson
#> 7 Area3 2016 28 124 22.58065 13.190131 35.89243 99.8% percentage Wilson
#> 8 Area4 2016 55 126 43.65079 30.926677 57.26952 99.8% percentage Wilson
# specify level of detail to output for proportion and remove metadata columns
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100, type="standard")
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 59 143 41.25874 29.476498 54.13538
#> 2 Area2 2015 38 177 21.46893 13.525246 32.33364
#> 3 Area3 2015 67 192 34.89583 25.211746 46.01121
#> 4 Area4 2015 35 114 30.70175 19.283219 45.10353
#> 5 Area1 2016 31 113 27.43363 16.613541 41.77063
#> 6 Area2 2016 19 174 10.91954 5.554019 20.35154
#> 7 Area3 2016 28 124 22.58065 13.190131 35.89243
#> 8 Area4 2016 55 126 43.65079 30.926677 57.26952
# default rate
phe_rate(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 59 143 41258.74 31406.256 53221.70 95% rate per 100000
#> 2 Area2 2015 38 177 21468.93 15190.773 29468.61 95% rate per 100000
#> 3 Area3 2015 67 192 34895.83 27042.544 44317.15 95% rate per 100000
#> 4 Area4 2015 35 114 30701.75 21381.764 42700.01 95% rate per 100000
#> 5 Area1 2016 31 113 27433.63 18636.408 38941.18 95% rate per 100000
#> 6 Area2 2016 19 174 10919.54 6571.253 17053.07 95% rate per 100000
#> 7 Area3 2016 28 124 22580.65 15001.360 32636.54 95% rate per 100000
#> 8 Area4 2016 55 126 43650.79 32881.608 56818.59 95% rate per 100000
#> method
#> 1 Byars
#> 2 Byars
#> 3 Byars
#> 4 Byars
#> 5 Byars
#> 6 Byars
#> 7 Byars
#> 8 Byars
# specify rate parameters
phe_rate(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic method
#> 1 Area1 2015 59 143 41.25874 26.611873 60.72303 99.8% rate per 100 Byars
#> 2 Area2 2015 38 177 21.46893 12.274562 34.57934 99.8% rate per 100 Byars
#> 3 Area3 2015 67 192 34.89583 23.177487 50.19647 99.8% rate per 100 Byars
#> 4 Area4 2015 35 114 30.70175 17.096388 50.39502 99.8% rate per 100 Byars
#> 5 Area1 2016 31 113 27.43363 14.655538 46.36546 99.8% rate per 100 Byars
#> 6 Area2 2016 19 174 10.91954 4.752588 21.11794 99.8% rate per 100 Byars
#> 7 Area3 2016 28 124 22.58065 11.620588 39.15808 99.8% rate per 100 Byars
#> 8 Area4 2016 55 126 43.65079 27.675745 65.09821 99.8% rate per 100 Byars
# specify rate parameters and reduce columns output and remove metadata columns
phe_rate(df, obs, pop, type="standard", confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 59 143 41.25874 26.611873 60.72303
#> 2 Area2 2015 38 177 21.46893 12.274562 34.57934
#> 3 Area3 2015 67 192 34.89583 23.177487 50.19647
#> 4 Area4 2015 35 114 30.70175 17.096388 50.39502
#> 5 Area1 2016 31 113 27.43363 14.655538 46.36546
#> 6 Area2 2016 19 174 10.91954 4.752588 21.11794
#> 7 Area3 2016 28 124 22.58065 11.620588 39.15808
#> 8 Area4 2016 55 126 43.65079 27.675745 65.09821
These functions can also return aggregate data if the input dataframes are grouped:
# default proportion - grouped
df %>%
group_by(year) %>%
phe_proportion(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> * <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 199 626 0.318 0.283 0.355 95% proportion of 1 Wilson
#> 2 2016 133 537 0.248 0.213 0.286 95% proportion of 1 Wilson
# default rate - grouped
df %>%
group_by(year) %>%
phe_rate(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> * <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 199 626 31789. 27525. 36526. 95% rate per 100000 Byars
#> 2 2016 133 537 24767. 20737. 29352. 95% rate per 100000 Byars
The remaining functions aggregate the rows in the input data frame to produce a single statistic. It is also possible to calculate multiple statistics in a single execution of these functions if the input data frame is grouped - for example by indicator ID, geographic area or time period (or all three). The output contains only the grouping variables and the values calculated by the function - any additional unused columns provided in the input data frame will not be retained in the output.
The df test data generated earlier can be used to demonstrate phe_mean:
INPUT: The phe_mean function take a single data frame as input with a column representing the numbers to be averaged.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values (if applicable), the mean, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The function also accepts additional arguments to specify the level of confidence and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_mean function and the arguments that can optionally be specified
# default mean
phe_mean(df,obs)
#> value_sum value_count stdev value lowercl uppercl confidence statistic
#> 1 332 8 16.86925 41.5 27.39696 55.60304 95% mean
#> method
#> 1 Student's t-distribution
# multiple means in a single execution with 99.8% confidence
df %>%
group_by(year) %>%
phe_mean(obs, confidence=0.998)
#> # A tibble: 2 x 10
#> year value_sum value_count stdev value lowercl uppercl confidence statistic
#> * <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 2015 199 4 15.7 49.8 -30.4 130. 99.8% mean
#> 2 2016 133 4 15.4 33.2 -45.3 112. 99.8% mean
#> # ... with 1 more variable: method <chr>
# multiple means in a single execution with 99.8% confidence and data-only output
df %>%
group_by(year) %>%
phe_mean(obs, type = "standard", confidence=0.998)
#> # A tibble: 2 x 7
#> year value_sum value_count stdev value lowercl uppercl
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2015 199 4 15.7 49.8 -30.4 130.
#> 2 2016 133 4 15.4 33.2 -45.3 112.
The following code chunk creates a data frame containing observed number of events and populations by age band for 4 areas, 5 time periods and 2 sexes:
df_std <- data.frame(
area = rep(c("Area1", "Area2", "Area3", "Area4"), each = 19 * 2 * 5),
year = rep(2006:2010, each = 19 * 2),
sex = rep(rep(c("Male", "Female"), each = 19), 5),
ageband = rep(c(0, 5,10,15,20,25,30,35,40,45,
50,55,60,65,70,75,80,85,90), times = 10),
obs = sample(200, 19 * 2 * 5 * 4, replace = TRUE),
pop = sample(10000:20000, 19 * 2 * 5 * 4, replace = TRUE))
head(df_std)
#> area year sex ageband obs pop
#> 1 Area1 2006 Male 0 27 13797
#> 2 Area1 2006 Male 5 51 13539
#> 3 Area1 2006 Male 10 17 14187
#> 4 Area1 2006 Male 15 157 19348
#> 5 Area1 2006 Male 20 100 15825
#> 6 Area1 2006 Male 25 82 16161
INPUT: The minimum input requirement for the phe_dsr function is a single data frame with columns representing the numerators and denominators for each standardisation category. This is sufficient if the data is:
The 2013 European Standard Population is provided within the package in vector form (esp2013) and is used by default by this function. Alternative standard populations can be used but must be provided by the user. When the function joins a standard population vector to the input data frame it does this by position so it is important that the data is sorted accordingly. This is a user responsibility.
The function can also accept standard populations provided as a column within the input data frame.
standard populations provided as a vector - the vector and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order
standard populations provided as a column within the input data frame - the standard populations can be appended to the input data frame by the user prior to execution of the function - if the data is grouped to generate multiple dsrs then the standard populations will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values, the total count, the total population, the dsr, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If standard populations are being provided as a column within the input data frame then the user must specify this using the stdpoptype argument as the function expects a vector by default. The function also accepts additional arguments to specify the standard populations, the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_dsr function and the arguments that can optionally be specified
# calculate separate dsrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1671 278507 541. 513. 569. 95%
#> 2 Area1 2006 Male 1696 297231 514. 487. 542. 95%
#> 3 Area1 2007 Fema~ 1880 307114 607. 577. 638. 95%
#> 4 Area1 2007 Male 1726 290219 585. 556. 615. 95%
#> 5 Area1 2008 Fema~ 1780 305196 601. 571. 632. 95%
#> 6 Area1 2008 Male 2091 309328 709. 678. 742. 95%
#> 7 Area1 2009 Fema~ 2036 286227 816. 780. 854. 95%
#> 8 Area1 2009 Male 1981 293111 680. 649. 712. 95%
#> 9 Area1 2010 Fema~ 2287 289740 878. 839. 917. 95%
#> 10 Area1 2010 Male 1448 292873 523. 495. 551. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate separate dsrs for each area, year and sex and drop metadata fields from output
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, type="standard")
#> # A tibble: 40 x 8
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 1671 278507 541. 513. 569.
#> 2 Area1 2006 Male 1696 297231 514. 487. 542.
#> 3 Area1 2007 Female 1880 307114 607. 577. 638.
#> 4 Area1 2007 Male 1726 290219 585. 556. 615.
#> 5 Area1 2008 Female 1780 305196 601. 571. 632.
#> 6 Area1 2008 Male 2091 309328 709. 678. 742.
#> 7 Area1 2009 Female 2036 286227 816. 780. 854.
#> 8 Area1 2009 Male 1981 293111 680. 649. 712.
#> 9 Area1 2010 Female 2287 289740 878. 839. 917.
#> 10 Area1 2010 Male 1448 292873 523. 495. 551.
#> # ... with 30 more rows
# calculate same specifying standard population in vector form
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1671 278507 541. 513. 569. 95%
#> 2 Area1 2006 Male 1696 297231 514. 487. 542. 95%
#> 3 Area1 2007 Fema~ 1880 307114 607. 577. 638. 95%
#> 4 Area1 2007 Male 1726 290219 585. 556. 615. 95%
#> 5 Area1 2008 Fema~ 1780 305196 601. 571. 632. 95%
#> 6 Area1 2008 Male 2091 309328 709. 678. 742. 95%
#> 7 Area1 2009 Fema~ 2036 286227 816. 780. 854. 95%
#> 8 Area1 2009 Male 1981 293111 680. 649. 712. 95%
#> 9 Area1 2010 Fema~ 2287 289740 878. 839. 917. 95%
#> 10 Area1 2010 Male 1448 292873 523. 495. 551. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate the same dsrs by appending the standard populations to the data frame
df_std %>%
mutate(refpop = rep(esp2013,40)) %>%
group_by(area, year, sex) %>%
phe_dsr(obs,pop, stdpop=refpop, stdpoptype="field")
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1671 278507 541. 513. 569. 95%
#> 2 Area1 2006 Male 1696 297231 514. 487. 542. 95%
#> 3 Area1 2007 Fema~ 1880 307114 607. 577. 638. 95%
#> 4 Area1 2007 Male 1726 290219 585. 556. 615. 95%
#> 5 Area1 2008 Fema~ 1780 305196 601. 571. 632. 95%
#> 6 Area1 2008 Male 2091 309328 709. 678. 742. 95%
#> 7 Area1 2009 Fema~ 2036 286227 816. 780. 854. 95%
#> 8 Area1 2009 Male 1981 293111 680. 649. 712. 95%
#> 9 Area1 2010 Fema~ 2287 289740 878. 839. 917. 95%
#> 10 Area1 2010 Male 1448 292873 523. 495. 551. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate for under 75s by filtering out records for 75+ from input data frame and standard population
df_std %>%
filter(ageband <= 70) %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013[1:15])
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1206 217417 530. 500. 561. 95%
#> 2 Area1 2006 Male 1162 224515 511. 482. 542. 95%
#> 3 Area1 2007 Fema~ 1313 233772 591. 559. 624. 95%
#> 4 Area1 2007 Male 1217 229816 544. 513. 576. 95%
#> 5 Area1 2008 Fema~ 1433 248103 613. 581. 647. 95%
#> 6 Area1 2008 Male 1798 248580 728. 694. 763. 95%
#> 7 Area1 2009 Fema~ 1809 233792 831. 791. 871. 95%
#> 8 Area1 2009 Male 1472 238309 639. 606. 673. 95%
#> 9 Area1 2010 Fema~ 1869 227840 895. 853. 938. 95%
#> 10 Area1 2010 Male 1220 239605 525. 496. 556. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate separate dsrs for persons for each area and year)
df_std %>%
group_by(area, year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(area, year) %>%
phe_dsr(obs,pop)
#> # A tibble: 20 x 10
#> # Groups: area [4]
#> area year total_count total_pop value lowercl uppercl confidence statistic
#> <fct> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 Area1 2006 3367 575738 544. 524. 565. 95% dsr per ~
#> 2 Area1 2007 3606 597333 581. 560. 602. 95% dsr per ~
#> 3 Area1 2008 3871 614524 638. 617. 659. 95% dsr per ~
#> 4 Area1 2009 4017 579338 726. 703. 750. 95% dsr per ~
#> 5 Area1 2010 3735 582613 682. 659. 705. 95% dsr per ~
#> 6 Area2 2006 3879 566640 692. 669. 716. 95% dsr per ~
#> 7 Area2 2007 3793 565429 651. 628. 673. 95% dsr per ~
#> 8 Area2 2008 3327 564452 596. 575. 618. 95% dsr per ~
#> 9 Area2 2009 3972 548964 726. 702. 750. 95% dsr per ~
#> 10 Area2 2010 3642 586793 606. 585. 627. 95% dsr per ~
#> 11 Area3 2006 3370 550080 623. 601. 645. 95% dsr per ~
#> 12 Area3 2007 3710 561789 648. 626. 671. 95% dsr per ~
#> 13 Area3 2008 3816 567972 649. 627. 672. 95% dsr per ~
#> 14 Area3 2009 4265 571561 753. 729. 777. 95% dsr per ~
#> 15 Area3 2010 3997 582175 677. 655. 700. 95% dsr per ~
#> 16 Area4 2006 3743 593992 641. 619. 663. 95% dsr per ~
#> 17 Area4 2007 4148 573292 739. 715. 763. 95% dsr per ~
#> 18 Area4 2008 4133 558561 755. 731. 779. 95% dsr per ~
#> 19 Area4 2009 3681 575834 646. 624. 669. 95% dsr per ~
#> 20 Area4 2010 3865 562536 665. 643. 688. 95% dsr per ~
#> # ... with 1 more variable: method <chr>
INPUT: Unlike the phe_dsr function, there is no default standard or reference data for the phe_smr and phe_isr functions. These functions take a single data frame as input, with columns representing the numerators and denominators for each standardisation category, plus reference numerators and denominators for each standardisation category.
The reference data can either be provided in a separate data frame/vectors or as columns within the input data frame:
reference data provided as a data frame or as vectors - the data frame/vectors and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order.
reference data provided as columns within the input data frame - the reference numerators and denominators can be appended to the input data frame prior to execution of the function - if the data is grouped to generate multiple smrs/isrs then the reference data will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the functions output one row per grouping set containing the grouping variable values, the observed and expected counts, the reference rate (isr only), the smr or isr, the lower 95% confidence limit, and the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If reference data are being provided as columns within the input data frame then the user must specify this as the function expects vectors by default. The function also accepts additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
The following code chunk creates a data frame containing the reference data - this example uses the all area data for persons in the baseline year:
df_ref <- df_std %>%
filter(year == 2006) %>%
group_by(ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop))
head(df_ref)
#> # A tibble: 6 x 3
#> ageband obs pop
#> <dbl> <int> <int>
#> 1 0 631 105398
#> 2 5 473 115954
#> 3 10 666 124087
#> 4 15 1185 132560
#> 5 20 609 132595
#> 6 25 451 111008
Here are some example code chunks to demonstrate the phe_smr function and the arguments that can optionally be specified
# calculate separate smrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1671 1748. 0.956 0.911 1.00 95%
#> 2 Area1 2006 Male 1696 1870. 0.907 0.864 0.951 95%
#> 3 Area1 2007 Fema~ 1880 1960. 0.959 0.916 1.00 95%
#> 4 Area1 2007 Male 1726 1815. 0.951 0.907 0.997 95%
#> 5 Area1 2008 Fema~ 1780 1901. 0.936 0.893 0.981 95%
#> 6 Area1 2008 Male 2091 1907. 1.10 1.05 1.14 95%
#> 7 Area1 2009 Fema~ 2036 1790. 1.14 1.09 1.19 95%
#> 8 Area1 2009 Male 1981 1836. 1.08 1.03 1.13 95%
#> 9 Area1 2010 Fema~ 2287 1803. 1.27 1.22 1.32 95%
#> 10 Area1 2010 Male 1448 1810. 0.800 0.759 0.842 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate the same smrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1671 1748. 0.956 0.911 1.00 95%
#> 2 Area1 2006 Male 1696 1870. 0.907 0.864 0.951 95%
#> 3 Area1 2007 Fema~ 1880 1960. 0.959 0.916 1.00 95%
#> 4 Area1 2007 Male 1726 1815. 0.951 0.907 0.997 95%
#> 5 Area1 2008 Fema~ 1780 1901. 0.936 0.893 0.981 95%
#> 6 Area1 2008 Male 2091 1907. 1.10 1.05 1.14 95%
#> 7 Area1 2009 Fema~ 2036 1790. 1.14 1.09 1.19 95%
#> 8 Area1 2009 Male 1981 1836. 1.08 1.03 1.13 95%
#> 9 Area1 2010 Fema~ 2287 1803. 1.27 1.22 1.32 95%
#> 10 Area1 2010 Male 1448 1810. 0.800 0.759 0.842 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate separate smrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 6
#> year observed expected value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 14359 14359 1 0.984 1.02
#> 2 2007 15257 14495. 1.05 1.04 1.07
#> 3 2008 15147 14457. 1.05 1.03 1.06
#> 4 2009 15935 14250. 1.12 1.10 1.14
#> 5 2010 15239 14422. 1.06 1.04 1.07
The phe_isr function works exactly the same way but instead of expressing the result as a ratio of the observed and expected rates the result is expressed as a rate and the reference rate is also provided. Here are some examples:
# calculate separate isrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1671 1748. 628. 600. 572. 630. 95%
#> 2 Area1 2006 Male 1696 1870. 628. 570. 543. 597. 95%
#> 3 Area1 2007 Fema~ 1880 1960. 628. 602. 575. 630. 95%
#> 4 Area1 2007 Male 1726 1815. 628. 597. 569. 626. 95%
#> 5 Area1 2008 Fema~ 1780 1901. 628. 588. 561. 616. 95%
#> 6 Area1 2008 Male 2091 1907. 628. 689. 659. 719. 95%
#> 7 Area1 2009 Fema~ 2036 1790. 628. 714. 684. 746. 95%
#> 8 Area1 2009 Male 1981 1836. 628. 677. 648. 708. 95%
#> 9 Area1 2010 Fema~ 2287 1803. 628. 797. 764. 830. 95%
#> 10 Area1 2010 Male 1448 1810. 628. 502. 477. 529. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate the same isrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1671 1748. 628. 600. 572. 630. 95%
#> 2 Area1 2006 Male 1696 1870. 628. 570. 543. 597. 95%
#> 3 Area1 2007 Fema~ 1880 1960. 628. 602. 575. 630. 95%
#> 4 Area1 2007 Male 1726 1815. 628. 597. 569. 626. 95%
#> 5 Area1 2008 Fema~ 1780 1901. 628. 588. 561. 616. 95%
#> 6 Area1 2008 Male 2091 1907. 628. 689. 659. 719. 95%
#> 7 Area1 2009 Fema~ 2036 1790. 628. 714. 684. 746. 95%
#> 8 Area1 2009 Male 1981 1836. 628. 677. 648. 708. 95%
#> 9 Area1 2010 Fema~ 2287 1803. 628. 797. 764. 830. 95%
#> 10 Area1 2010 Male 1448 1810. 628. 502. 477. 529. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate separate isrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 7
#> year observed expected ref_rate value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 14359 14359 628. 628. 618. 638.
#> 2 2007 15257 14495. 628. 661. 651. 672.
#> 3 2008 15147 14457. 628. 658. 648. 669.
#> 4 2009 15935 14250. 628. 702. 691. 713.
#> 5 2010 15239 14422. 628. 664. 653. 674.