This vignette introduces each of the functions within the PHEindicatormethods package and provides basic sample code to demonstrate their execution. The code included is based on the code provided within the ‘examples’ section of the function documentation. This vignette does not explain the methods applied in detail but these can (optionally) be output alongside the statistics or for a more detailed explanation, please see the references section of the function documentation.
This vignette covers the following functions available within the first release of the package (v1.0.0). If further functions are added to the package in future releases these will be explained elsewhere.
Function | Type | Description |
---|---|---|
phe_proportion | Non-aggregate | Performs a calculation on each row of data |
phe_rate | Non-aggregate | Performs a calculation on each row of data |
phe_mean | Aggregate | Performs a calculation on each grouping set |
phe_dsr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_smr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_isr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
The following code chunk creates a data frame containing observed number of events and populations for 4 geographical areas over 2 time periods that is used later to demonstrate the PHEindicatormethods package functions:
df <- data.frame(
area = rep(c("Area1","Area2","Area3","Area4"), 2),
year = rep(2015:2016, each = 4),
obs = sample(100, 2 * 4, replace = TRUE),
pop = sample(100:200, 2 * 4, replace = TRUE))
df
#> area year obs pop
#> 1 Area1 2015 92 162
#> 2 Area2 2015 66 123
#> 3 Area3 2015 12 124
#> 4 Area4 2015 51 161
#> 5 Area1 2016 27 155
#> 6 Area2 2016 68 147
#> 7 Area3 2016 98 131
#> 8 Area4 2016 66 139
INPUT: The phe_proportion and phe_rate functions take a single data frame as input with columns representing the numerators and denominators for the statistic. Any other columns present will be retained in the output.
OUTPUT: The functions output the original data frame with additional columns appended. By default the additional columns are the proportion or rate, the lower 95% confidence limit, and the upper 95% confidence limit
OPTIONS: The functions also accept additional arguments to specify the level of confidence, the multiplier and the level of detail to be output.
Here are some example code chunks to demonstrate these two functions and the arguments that can optionally be specified
# default proportion
phe_proportion(df, obs, pop)
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 92 162 0.56790123 0.49091928 0.6417375
#> 2 Area2 2015 66 123 0.53658537 0.44868986 0.6222649
#> 3 Area3 2015 12 124 0.09677419 0.05622821 0.1615529
#> 4 Area4 2015 51 161 0.31677019 0.24989359 0.3921867
#> 5 Area1 2016 27 155 0.17419355 0.12256671 0.2415791
#> 6 Area2 2016 68 147 0.46258503 0.38396418 0.5431116
#> 7 Area3 2016 98 131 0.74809160 0.66741220 0.8146354
#> 8 Area4 2016 66 139 0.47482014 0.39360287 0.5573917
# specify confidence level for proportion
phe_proportion(df, obs, pop, confidence=99.8)
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 92 162 0.56790123 0.44718460 0.6810582
#> 2 Area2 2015 66 123 0.53658537 0.40007734 0.6678218
#> 3 Area3 2015 12 124 0.09677419 0.04145507 0.2097591
#> 4 Area4 2015 51 161 0.31677019 0.21646938 0.4375901
#> 5 Area1 2016 27 155 0.17419355 0.09979681 0.2864063
#> 6 Area2 2016 68 147 0.46258503 0.34170145 0.5880332
#> 7 Area3 2016 98 131 0.74809160 0.61683121 0.8456392
#> 8 Area4 2016 66 139 0.47482014 0.34981672 0.6030610
# specify to output proportions as percentages
phe_proportion(df, obs, pop, percentage=TRUE)
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 92 162 56.790123 49.091928 64.17375
#> 2 Area2 2015 66 123 53.658537 44.868986 62.22649
#> 3 Area3 2015 12 124 9.677419 5.622821 16.15529
#> 4 Area4 2015 51 161 31.677019 24.989359 39.21867
#> 5 Area1 2016 27 155 17.419355 12.256671 24.15791
#> 6 Area2 2016 68 147 46.258503 38.396418 54.31116
#> 7 Area3 2016 98 131 74.809160 66.741220 81.46354
#> 8 Area4 2016 66 139 47.482014 39.360287 55.73917
# specify level of detail to output for proportion
phe_proportion(df, obs, pop, confidence=99.8, percentage=TRUE, type="full")
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 92 162 56.790123 44.718460 68.10582 99.8% percentage
#> 2 Area2 2015 66 123 53.658537 40.007734 66.78218 99.8% percentage
#> 3 Area3 2015 12 124 9.677419 4.145507 20.97591 99.8% percentage
#> 4 Area4 2015 51 161 31.677019 21.646938 43.75901 99.8% percentage
#> 5 Area1 2016 27 155 17.419355 9.979681 28.64063 99.8% percentage
#> 6 Area2 2016 68 147 46.258503 34.170145 58.80332 99.8% percentage
#> 7 Area3 2016 98 131 74.809160 61.683121 84.56392 99.8% percentage
#> 8 Area4 2016 66 139 47.482014 34.981672 60.30610 99.8% percentage
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# default rate
phe_rate(df, obs, pop)
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 92 162 56790.123 45779.632 69648.77
#> 2 Area2 2015 66 123 53658.537 41497.543 68267.87
#> 3 Area3 2015 12 124 9677.419 4994.766 16905.53
#> 4 Area4 2015 51 161 31677.019 23583.864 41650.29
#> 5 Area1 2016 27 155 17419.355 11476.751 25345.22
#> 6 Area2 2016 68 147 46258.503 35919.923 58644.55
#> 7 Area3 2016 98 131 74809.160 60732.236 91169.28
#> 8 Area4 2016 66 139 47482.014 36720.847 60409.70
# specify rate parameters
phe_rate(df, obs, pop, type="full", confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 92 162 56.790123 40.22441 77.58625 99.8% rate per 100
#> 2 Area2 2015 66 123 53.658537 35.52017 77.38981 99.8% rate per 100
#> 3 Area3 2015 12 124 9.677419 3.22611 21.83948 99.8% rate per 100
#> 4 Area4 2015 51 161 31.677019 19.70060 47.94048 99.8% rate per 100
#> 5 Area1 2016 27 155 17.419355 8.84022 30.49519 99.8% rate per 100
#> 6 Area2 2016 68 147 46.258503 30.82527 66.36981 99.8% rate per 100
#> 7 Area3 2016 98 131 74.809160 53.59832 101.24804 99.8% rate per 100
#> 8 Area4 2016 66 139 47.482014 31.43151 68.48163 99.8% rate per 100
#> method
#> 1 Byars
#> 2 Byars
#> 3 Byars
#> 4 Byars
#> 5 Byars
#> 6 Byars
#> 7 Byars
#> 8 Byars
The remaining functions aggregate the rows in the input data frame to produce a single statistic. It is also possible to calculate multiple statistics in a single execution of these functions if the input data frame is grouped - for example by indicator ID, geographic area or time period (or all three). The output contains only the grouping variables and the values calculated by the function - any additional unused columns provided in the input data frame will not be retained in the output.
The df test data generated earlier can be used to demonstrate phe_mean:
INPUT: The phe_mean function take a single data frame as input with a column representing the numbers to be averaged.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values (if applicable), the mean, the lower 95% confidence limit, and the upper 95% confidence limit
OPTIONS: The function also accepts additional arguments to specify the level of confidence and the level of detail to be output.
Here are some example code chunks to demonstrate the phe_mean function and the arguments that can optionally be specified
# default mean
phe_mean(df,obs)
#> value lowercl uppercl
#> 1 60 35.36523 84.63477
# multiple means in a single execution with 99.8% confidence and full output
df %>%
group_by(year) %>%
phe_mean(obs,type="full", confidence=0.998)
#> # A tibble: 2 x 10
#> year value_sum value_count stdev value lowercl uppercl confidence
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2015 221 4 33.4 55.2 -116. 226. 99.8%
#> 2 2016 259 4 29.1 64.8 -83.9 213. 99.8%
#> # ... with 2 more variables: statistic <chr>, method <chr>
The following code chunk creates a data frame containing observed number of events and populations by age band for 4 areas, 5 time periods and 2 sexes:
df_std <- data.frame(
area = rep(c("Area1", "Area2", "Area3", "Area4"), each = 19 * 2 * 5),
year = rep(2006:2010, each = 19 * 2),
sex = rep(rep(c("Male", "Female"), each = 19), 5),
ageband = rep(c(0, 5,10,15,20,25,30,35,40,45,
50,55,60,65,70,75,80,85,90), times = 10),
obs = sample(200, 19 * 2 * 5 * 4, replace = TRUE),
pop = sample(10000:20000, 19 * 2 * 5 * 4, replace = TRUE))
head(df_std)
#> area year sex ageband obs pop
#> 1 Area1 2006 Male 0 88 15554
#> 2 Area1 2006 Male 5 101 11182
#> 3 Area1 2006 Male 10 52 19038
#> 4 Area1 2006 Male 15 8 10034
#> 5 Area1 2006 Male 20 171 17450
#> 6 Area1 2006 Male 25 180 13091
INPUT: The minimum input requirement for the phe_dsr function is a single data frame with columns representing the numerators and denominators for each standardisation category. This is sufficient if the data is:
The 2013 European Standard Population is provided within the package in vector form (esp2013) and is used by default by this function. Alternative standard populations can be used but must be provided by the user. When the function joins a standard population vector to the input data frame it does this by position so it is important that the data is sorted accordingly. This is a user responsibility.
The function can also accept standard populations provided as a column within the input data frame.
standard populations provided as a vector - the vector and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order
standard populations provided as a column within the input data frame - the standard populations can be appended to the input data frame by the user prior to execution of the function - if the data is grouped to generate multiple dsrs then the standard populations will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values, the dsr, the lower 95% confidence limit, and the upper 95% confidence limit
OPTIONS: If standard populations are being provided as a column within the input data frame then the user must specify this using the stdpoptype argument as the function expects a vector by default. The function also accepts additional arguments to specify the standard populations, the level of confidence, the multiplier and the level of detail to be output.
Here are some example code chunks to demonstrate the phe_dsr function and the arguments that can optionally be specified
# calculate separate dsrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop)
#> # A tibble: 40 x 6
#> # Groups: area, year [20]
#> area year sex value lowercl uppercl
#> <fct> <int> <fct> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 771. 736. 806.
#> 2 Area1 2006 Male 658. 626. 690.
#> 3 Area1 2007 Female 685. 652. 720.
#> 4 Area1 2007 Male 830. 792. 870.
#> 5 Area1 2008 Female 704. 668. 741.
#> 6 Area1 2008 Male 703. 670. 738.
#> 7 Area1 2009 Female 585. 556. 616.
#> 8 Area1 2009 Male 608. 577. 639.
#> 9 Area1 2010 Female 912. 874. 952.
#> 10 Area1 2010 Male 731. 697. 766.
#> # ... with 30 more rows
# calculate same specifying standard population in vector form
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013)
#> # A tibble: 40 x 6
#> # Groups: area, year [20]
#> area year sex value lowercl uppercl
#> <fct> <int> <fct> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 771. 736. 806.
#> 2 Area1 2006 Male 658. 626. 690.
#> 3 Area1 2007 Female 685. 652. 720.
#> 4 Area1 2007 Male 830. 792. 870.
#> 5 Area1 2008 Female 704. 668. 741.
#> 6 Area1 2008 Male 703. 670. 738.
#> 7 Area1 2009 Female 585. 556. 616.
#> 8 Area1 2009 Male 608. 577. 639.
#> 9 Area1 2010 Female 912. 874. 952.
#> 10 Area1 2010 Male 731. 697. 766.
#> # ... with 30 more rows
# calculate the same dsrs by appending the standard populations to the data frame
df_std %>%
mutate(refpop = rep(esp2013,40)) %>%
group_by(area, year, sex) %>%
phe_dsr(obs,pop, stdpop=refpop, stdpoptype="field")
#> # A tibble: 40 x 6
#> # Groups: area, year [20]
#> area year sex value lowercl uppercl
#> <fct> <int> <fct> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 771. 736. 806.
#> 2 Area1 2006 Male 658. 626. 690.
#> 3 Area1 2007 Female 685. 652. 720.
#> 4 Area1 2007 Male 830. 792. 870.
#> 5 Area1 2008 Female 704. 668. 741.
#> 6 Area1 2008 Male 703. 670. 738.
#> 7 Area1 2009 Female 585. 556. 616.
#> 8 Area1 2009 Male 608. 577. 639.
#> 9 Area1 2010 Female 912. 874. 952.
#> 10 Area1 2010 Male 731. 697. 766.
#> # ... with 30 more rows
# calculate for under 75s by filtering out records for 75+ from input data frame and standard population
check <- df_std %>%
filter(ageband <= 70) %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013[1:15])
# calculate separate dsrs for persons for each area and year)
df_std %>%
group_by(area, year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(area, year) %>%
phe_dsr(obs,pop, type="full")
#> # A tibble: 20 x 10
#> # Groups: area [4]
#> area year total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 3937 578616 697. 674. 720. 95%
#> 2 Area1 2007 3876 551569 736. 711. 761. 95%
#> 3 Area1 2008 3603 572850 692. 668. 716. 95%
#> 4 Area1 2009 3611 596369 589. 568. 610. 95%
#> 5 Area1 2010 4454 572323 781. 757. 806. 95%
#> 6 Area2 2006 3614 569492 648. 626. 671. 95%
#> 7 Area2 2007 4079 580956 748. 724. 772. 95%
#> 8 Area2 2008 3796 548106 714. 690. 739. 95%
#> 9 Area2 2009 3794 569715 696. 673. 720. 95%
#> 10 Area2 2010 3145 581697 540. 521. 561. 95%
#> 11 Area3 2006 3417 574017 599. 578. 621. 95%
#> 12 Area3 2007 3450 568075 643. 620. 665. 95%
#> 13 Area3 2008 4184 567538 737. 714. 761. 95%
#> 14 Area3 2009 4435 550263 891. 863. 919. 95%
#> 15 Area3 2010 4216 582870 756. 732. 781. 95%
#> 16 Area4 2006 4132 577898 737. 714. 761. 95%
#> 17 Area4 2007 4092 583720 706. 684. 730. 95%
#> 18 Area4 2008 3894 562343 688. 665. 711. 95%
#> 19 Area4 2009 2858 583706 477. 458. 497. 95%
#> 20 Area4 2010 4654 570502 826. 801. 851. 95%
#> # ... with 2 more variables: statistic <chr>, method <chr>
INPUT: Unlike the phe_dsr function, there is no default standard or reference data for the phe_smr and phe_isr functions. These functions take a single data frame as input, with columns representing the numerators and denominators for each standardisation category, PLUS reference numerators and denominators for each standardisation category.
The reference data can either be provided in a separate data frame/vectors or as columns within the input data frame:
reference data provided as a data frame or as vectors - the data frame/vectors and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order.
reference data provided as columns within the input data frame - the reference numerators and denominators can be appended to the input data frame prior to execution of the function - if the data is grouped to generate multiple smrs/isrs then the reference data will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the functions output one row per grouping set containing the grouping variable values, the smr or isr, the lower 95% confidence limit, and the upper 95% confidence limit
OPTIONS: If reference data are being provided as columns within the input data frame then the user must specify this as the function expects vectors by default. The function also accepts additional arguments to specify the level of confidence, the multiplier and the level of detail to be output.
The following code chunk creates a data frame containing the reference data - this example uses the all area data for persons in the baseline year:
df_ref <- df_std %>%
filter(year == 2006) %>%
group_by(ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop))
head(df_ref)
#> # A tibble: 6 x 3
#> ageband obs pop
#> <dbl> <int> <int>
#> 1 0 872 114224
#> 2 5 1066 128102
#> 3 10 784 120156
#> 4 15 566 123104
#> 5 20 778 125081
#> 6 25 760 128416
Here are some example code chunks to demonstrate the phe_smr function and the arguments that can optionally be specified
# calculate separate smrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 6
#> # Groups: area, year [20]
#> area year sex value lowercl uppercl
#> <fct> <int> <fct> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 1.11 1.07 1.16
#> 2 Area1 2006 Male 0.960 0.916 1.01
#> 3 Area1 2007 Female 1.06 1.01 1.11
#> 4 Area1 2007 Male 1.12 1.07 1.17
#> 5 Area1 2008 Female 0.946 0.902 0.992
#> 6 Area1 2008 Male 0.941 0.898 0.984
#> 7 Area1 2009 Female 0.916 0.874 0.959
#> 8 Area1 2009 Male 0.958 0.914 1.00
#> 9 Area1 2010 Female 1.33 1.28 1.39
#> 10 Area1 2010 Male 1.06 1.01 1.10
#> # ... with 30 more rows
# calculate the same smrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 6
#> # Groups: area, year [20]
#> area year sex value lowercl uppercl
#> <fct> <int> <fct> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 1.11 1.07 1.16
#> 2 Area1 2006 Male 0.960 0.916 1.01
#> 3 Area1 2007 Female 1.06 1.01 1.11
#> 4 Area1 2007 Male 1.12 1.07 1.17
#> 5 Area1 2008 Female 0.946 0.902 0.992
#> 6 Area1 2008 Male 0.941 0.898 0.984
#> 7 Area1 2009 Female 0.916 0.874 0.959
#> 8 Area1 2009 Male 0.958 0.914 1.00
#> 9 Area1 2010 Female 1.33 1.28 1.39
#> 10 Area1 2010 Male 1.06 1.01 1.10
#> # ... with 30 more rows
# calculate separate smrs for each year
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop, type="full")
#> # A tibble: 5 x 9
#> year observed expected value lowercl uppercl confidence statistic
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 2006 15100 15100 1 0.984 1.02 95% smr x 1
#> 2 2007 15497 14935. 1.04 1.02 1.05 95% smr x 1
#> 3 2008 15477 14817. 1.04 1.03 1.06 95% smr x 1
#> 4 2009 14698 14970. 0.982 0.966 0.998 95% smr x 1
#> 5 2010 16469 15115. 1.09 1.07 1.11 95% smr x 1
#> # ... with 1 more variable: method <chr>
The phe_isr function works exactly the same way but instead of expressing the result as a ratio of the observed and expected rates the result is expressed as a rate and the reference rate is also provided. Here are some examples:
# calculate separate isrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 6
#> # Groups: area, year [20]
#> area year sex value lowercl uppercl
#> <fct> <int> <fct> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 732. 701. 763.
#> 2 Area1 2006 Male 631. 602. 661.
#> 3 Area1 2007 Female 697. 666. 728.
#> 4 Area1 2007 Male 734. 701. 767.
#> 5 Area1 2008 Female 621. 592. 651.
#> 6 Area1 2008 Male 618. 590. 646.
#> 7 Area1 2009 Female 601. 574. 629.
#> 8 Area1 2009 Male 629. 600. 658.
#> 9 Area1 2010 Female 874. 840. 910.
#> 10 Area1 2010 Male 695. 665. 725.
#> # ... with 30 more rows
# calculate the same isrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 6
#> # Groups: area, year [20]
#> area year sex value lowercl uppercl
#> <fct> <int> <fct> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 732. 701. 763.
#> 2 Area1 2006 Male 631. 602. 661.
#> 3 Area1 2007 Female 697. 666. 728.
#> 4 Area1 2007 Male 734. 701. 767.
#> 5 Area1 2008 Female 621. 592. 651.
#> 6 Area1 2008 Male 618. 590. 646.
#> 7 Area1 2009 Female 601. 574. 629.
#> 8 Area1 2009 Male 629. 600. 658.
#> 9 Area1 2010 Female 874. 840. 910.
#> 10 Area1 2010 Male 695. 665. 725.
#> # ... with 30 more rows
# calculate separate isrs for each year
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop, type="full")
#> # A tibble: 5 x 10
#> year observed expected ref_rate value lowercl uppercl confidence
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2006 15100 15100 657. 657. 646. 667. 95%
#> 2 2007 15497 14935. 657. 681. 671. 692. 95%
#> 3 2008 15477 14817. 657. 686. 675. 697. 95%
#> 4 2009 14698 14970. 657. 645. 634. 655. 95%
#> 5 2010 16469 15115. 657. 715. 704. 726. 95%
#> # ... with 2 more variables: statistic <chr>, method <chr>