This vignette introduces the following functions from the PHEindicatormethods package and provides basic sample code to demonstrate their execution. The code included is based on the code provided within the ‘examples’ section of the function documentation. This vignette does not explain the methods applied in detail but these can (optionally) be output alongside the statistics or for a more detailed explanation, please see the references section of the function documentation.
This vignette covers the following functions available within the first release of the package (v1.0.8) but has been updated to apply to these functions in their latest release versions. If further functions are added to the package in future releases these will be explained elsewhere.
Function | Type | Description |
---|---|---|
phe_proportion | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_rate | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_mean | Aggregate | Performs a calculation on each grouping set |
phe_dsr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_smr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_isr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
The following code chunk creates a data frame containing observed number of events and populations for 4 geographical areas over 2 time periods that is used later to demonstrate the PHEindicatormethods package functions:
df <- data.frame(
area = rep(c("Area1","Area2","Area3","Area4"), 2),
year = rep(2015:2016, each = 4),
obs = sample(100, 2 * 4, replace = TRUE),
pop = sample(100:200, 2 * 4, replace = TRUE))
df
#> area year obs pop
#> 1 Area1 2015 19 133
#> 2 Area2 2015 22 130
#> 3 Area3 2015 54 126
#> 4 Area4 2015 95 177
#> 5 Area1 2016 77 152
#> 6 Area2 2016 69 162
#> 7 Area3 2016 81 179
#> 8 Area4 2016 98 136
INPUT: The phe_proportion and phe_rate functions take a single data frame as input with columns representing the numerators and denominators for the statistic. Any other columns present will be retained in the output.
OUTPUT: The functions output the original data frame with additional columns appended. By default the additional columns are the proportion or rate, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The functions also accept additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate these two functions and the arguments that can optionally be specified
# default proportion
phe_proportion(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 19 133 0.1428571 0.09340234 0.2123636 95% proportion of 1
#> 2 Area2 2015 22 130 0.1692308 0.11449563 0.2429531 95% proportion of 1
#> 3 Area3 2015 54 126 0.4285714 0.34553800 0.5158314 95% proportion of 1
#> 4 Area4 2015 95 177 0.5367232 0.46326231 0.6086239 95% proportion of 1
#> 5 Area1 2016 77 152 0.5065789 0.42792215 0.5849114 95% proportion of 1
#> 6 Area2 2016 69 162 0.4259259 0.35236420 0.5029193 95% proportion of 1
#> 7 Area3 2016 81 179 0.4525140 0.38135857 0.5256647 95% proportion of 1
#> 8 Area4 2016 98 136 0.7205882 0.63991250 0.7891448 95% proportion of 1
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify confidence level for proportion
phe_proportion(df, obs, pop, confidence=99.8)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 19 133 0.1428571 0.07310538 0.2604596 99.8% proportion of 1
#> 2 Area2 2015 22 130 0.1692308 0.09120193 0.2925295 99.8% proportion of 1
#> 3 Area3 2015 54 126 0.4285714 0.30215585 0.5650514 99.8% proportion of 1
#> 4 Area4 2015 95 177 0.5367232 0.42200675 0.6476798 99.8% proportion of 1
#> 5 Area1 2016 77 152 0.5065789 0.38463503 0.6277451 99.8% proportion of 1
#> 6 Area2 2016 69 162 0.4259259 0.31330969 0.5467890 99.8% proportion of 1
#> 7 Area3 2016 81 179 0.4525140 0.34287697 0.5669610 99.8% proportion of 1
#> 8 Area4 2016 98 136 0.7205882 0.59027293 0.8219579 99.8% proportion of 1
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify to output proportions as percentages
phe_proportion(df, obs, pop, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic method
#> 1 Area1 2015 19 133 14.28571 9.340234 21.23636 95% percentage Wilson
#> 2 Area2 2015 22 130 16.92308 11.449563 24.29531 95% percentage Wilson
#> 3 Area3 2015 54 126 42.85714 34.553800 51.58314 95% percentage Wilson
#> 4 Area4 2015 95 177 53.67232 46.326231 60.86239 95% percentage Wilson
#> 5 Area1 2016 77 152 50.65789 42.792215 58.49114 95% percentage Wilson
#> 6 Area2 2016 69 162 42.59259 35.236420 50.29193 95% percentage Wilson
#> 7 Area3 2016 81 179 45.25140 38.135857 52.56647 95% percentage Wilson
#> 8 Area4 2016 98 136 72.05882 63.991250 78.91448 95% percentage Wilson
# specify level of detail to output for proportion
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic method
#> 1 Area1 2015 19 133 14.28571 7.310538 26.04596 99.8% percentage Wilson
#> 2 Area2 2015 22 130 16.92308 9.120193 29.25295 99.8% percentage Wilson
#> 3 Area3 2015 54 126 42.85714 30.215585 56.50514 99.8% percentage Wilson
#> 4 Area4 2015 95 177 53.67232 42.200675 64.76798 99.8% percentage Wilson
#> 5 Area1 2016 77 152 50.65789 38.463503 62.77451 99.8% percentage Wilson
#> 6 Area2 2016 69 162 42.59259 31.330969 54.67890 99.8% percentage Wilson
#> 7 Area3 2016 81 179 45.25140 34.287697 56.69610 99.8% percentage Wilson
#> 8 Area4 2016 98 136 72.05882 59.027293 82.19579 99.8% percentage Wilson
# specify level of detail to output for proportion and remove metadata columns
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100, type="standard")
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 19 133 14.28571 7.310538 26.04596
#> 2 Area2 2015 22 130 16.92308 9.120193 29.25295
#> 3 Area3 2015 54 126 42.85714 30.215585 56.50514
#> 4 Area4 2015 95 177 53.67232 42.200675 64.76798
#> 5 Area1 2016 77 152 50.65789 38.463503 62.77451
#> 6 Area2 2016 69 162 42.59259 31.330969 54.67890
#> 7 Area3 2016 81 179 45.25140 34.287697 56.69610
#> 8 Area4 2016 98 136 72.05882 59.027293 82.19579
# default rate
phe_rate(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 19 133 14285.71 8596.978 22310.03 95% rate per 100000
#> 2 Area2 2015 22 130 16923.08 10601.917 25622.91 95% rate per 100000
#> 3 Area3 2015 54 126 42857.14 32193.437 55920.36 95% rate per 100000
#> 4 Area4 2015 95 177 53672.32 43423.011 65612.28 95% rate per 100000
#> 5 Area1 2016 77 152 50657.89 39976.969 63314.49 95% rate per 100000
#> 6 Area2 2016 69 162 42592.59 33138.127 53904.44 95% rate per 100000
#> 7 Area3 2016 81 179 45251.40 35934.947 56244.04 95% rate per 100000
#> 8 Area4 2016 98 136 72058.82 58499.433 87817.47 95% rate per 100000
#> method
#> 1 Byars
#> 2 Byars
#> 3 Byars
#> 4 Byars
#> 5 Byars
#> 6 Byars
#> 7 Byars
#> 8 Byars
# specify rate parameters
phe_rate(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic method
#> 1 Area1 2015 19 133 14.28571 6.217672 27.62798 99.8% rate per 100 Byars
#> 2 Area2 2015 22 130 16.92308 7.887686 31.33951 99.8% rate per 100 Byars
#> 3 Area3 2015 54 126 42.85714 27.047717 64.14027 99.8% rate per 100 Byars
#> 4 Area4 2015 95 177 53.67232 38.240074 72.97515 99.8% rate per 100 Byars
#> 5 Area1 2016 77 152 50.65789 34.659672 71.17371 99.8% rate per 100 Byars
#> 6 Area2 2016 69 162 42.59259 28.473385 60.95598 99.8% rate per 100 Byars
#> 7 Area3 2016 81 179 45.25140 31.278577 63.05818 99.8% rate per 100 Byars
#> 8 Area4 2016 98 136 72.05882 51.627792 97.52569 99.8% rate per 100 Byars
# specify rate parameters and reduce columns output and remove metadata columns
phe_rate(df, obs, pop, type="standard", confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 19 133 14.28571 6.217672 27.62798
#> 2 Area2 2015 22 130 16.92308 7.887686 31.33951
#> 3 Area3 2015 54 126 42.85714 27.047717 64.14027
#> 4 Area4 2015 95 177 53.67232 38.240074 72.97515
#> 5 Area1 2016 77 152 50.65789 34.659672 71.17371
#> 6 Area2 2016 69 162 42.59259 28.473385 60.95598
#> 7 Area3 2016 81 179 45.25140 31.278577 63.05818
#> 8 Area4 2016 98 136 72.05882 51.627792 97.52569
These functions can also return aggregate data if the input dataframes are grouped:
# default proportion - grouped
df %>%
group_by(year) %>%
phe_proportion(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 190 566 0.336 0.298 0.376 95% proportion of 1 Wilson
#> 2 2016 325 629 0.517 0.478 0.556 95% proportion of 1 Wilson
# default rate - grouped
df %>%
group_by(year) %>%
phe_rate(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 190 566 33569. 28965. 38697. 95% rate per 100000 Byars
#> 2 2016 325 629 51669. 46204. 57604. 95% rate per 100000 Byars
The remaining functions aggregate the rows in the input data frame to produce a single statistic. It is also possible to calculate multiple statistics in a single execution of these functions if the input data frame is grouped - for example by indicator ID, geographic area or time period (or all three). The output contains only the grouping variables and the values calculated by the function - any additional unused columns provided in the input data frame will not be retained in the output.
The df test data generated earlier can be used to demonstrate phe_mean:
INPUT: The phe_mean function take a single data frame as input with a column representing the numbers to be averaged.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values (if applicable), the mean, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The function also accepts additional arguments to specify the level of confidence and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_mean function and the arguments that can optionally be specified
# default mean
phe_mean(df,obs)
#> value_sum value_count stdev value lowercl uppercl confidence statistic
#> 1 515 8 30.44404 64.375 38.92315 89.82685 95% mean
#> method
#> 1 Student's t-distribution
# multiple means in a single execution with 99.8% confidence
df %>%
group_by(year) %>%
phe_mean(obs, confidence=0.998)
#> # A tibble: 2 x 10
#> year value_sum value_count stdev value lowercl uppercl confidence statistic
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 2015 190 4 35.4 47.5 -133. 228. 99.8% mean
#> 2 2016 325 4 12.2 81.2 18.8 144. 99.8% mean
#> # ... with 1 more variable: method <chr>
# multiple means in a single execution with 99.8% confidence and data-only output
df %>%
group_by(year) %>%
phe_mean(obs, type = "standard", confidence=0.998)
#> # A tibble: 2 x 7
#> year value_sum value_count stdev value lowercl uppercl
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2015 190 4 35.4 47.5 -133. 228.
#> 2 2016 325 4 12.2 81.2 18.8 144.
The following code chunk creates a data frame containing observed number of events and populations by age band for 4 areas, 5 time periods and 2 sexes:
df_std <- data.frame(
area = rep(c("Area1", "Area2", "Area3", "Area4"), each = 19 * 2 * 5),
year = rep(2006:2010, each = 19 * 2),
sex = rep(rep(c("Male", "Female"), each = 19), 5),
ageband = rep(c(0, 5,10,15,20,25,30,35,40,45,
50,55,60,65,70,75,80,85,90), times = 10),
obs = sample(200, 19 * 2 * 5 * 4, replace = TRUE),
pop = sample(10000:20000, 19 * 2 * 5 * 4, replace = TRUE))
head(df_std)
#> area year sex ageband obs pop
#> 1 Area1 2006 Male 0 57 16548
#> 2 Area1 2006 Male 5 74 19238
#> 3 Area1 2006 Male 10 25 14721
#> 4 Area1 2006 Male 15 150 17799
#> 5 Area1 2006 Male 20 56 16806
#> 6 Area1 2006 Male 25 155 11338
INPUT: The minimum input requirement for the phe_dsr function is a single data frame with columns representing the numerators and denominators for each standardisation category. This is sufficient if the data is:
The 2013 European Standard Population is provided within the package in vector form (esp2013) and is used by default by this function. Alternative standard populations can be used but must be provided by the user. When the function joins a standard population vector to the input data frame it does this by position so it is important that the data is sorted accordingly. This is a user responsibility.
The function can also accept standard populations provided as a column within the input data frame.
standard populations provided as a vector - the vector and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order
standard populations provided as a column within the input data frame - the standard populations can be appended to the input data frame by the user prior to execution of the function - if the data is grouped to generate multiple dsrs then the standard populations will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values, the total count, the total population, the dsr, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If standard populations are being provided as a column within the input data frame then the user must specify this using the stdpoptype argument as the function expects a vector by default. The function also accepts additional arguments to specify the standard populations, the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_dsr function and the arguments that can optionally be specified
# calculate separate dsrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop)
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex total_count total_pop standardised_pop value lowercl
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 1902 280649 760. 760. 724.
#> 2 Area1 2006 Male 2238 284526 776. 776. 742.
#> 3 Area1 2007 Fema~ 1939 283938 714. 714. 681.
#> 4 Area1 2007 Male 1427 286021 533. 533. 504.
#> 5 Area1 2008 Fema~ 1901 270756 782. 782. 744.
#> 6 Area1 2008 Male 1404 277536 547. 547. 517.
#> 7 Area1 2009 Fema~ 1813 289587 731. 731. 696.
#> 8 Area1 2009 Male 2510 266885 973. 973. 932.
#> 9 Area1 2010 Fema~ 1724 275556 683. 683. 650.
#> 10 Area1 2010 Male 2273 280717 768. 768. 734.
#> # ... with 30 more rows, and 4 more variables: uppercl <dbl>, confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate separate dsrs for each area, year and sex and drop metadata fields from output
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, type="standard")
#> # A tibble: 40 x 8
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 1902 280649 760. 724. 797.
#> 2 Area1 2006 Male 2238 284526 776. 742. 812.
#> 3 Area1 2007 Female 1939 283938 714. 681. 748.
#> 4 Area1 2007 Male 1427 286021 533. 504. 564.
#> 5 Area1 2008 Female 1901 270756 782. 744. 821.
#> 6 Area1 2008 Male 1404 277536 547. 517. 579.
#> 7 Area1 2009 Female 1813 289587 731. 696. 766.
#> 8 Area1 2009 Male 2510 266885 973. 932. 1014.
#> 9 Area1 2010 Female 1724 275556 683. 650. 717.
#> 10 Area1 2010 Male 2273 280717 768. 734. 803.
#> # ... with 30 more rows
# calculate same specifying standard population in vector form
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013)
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex total_count total_pop standardised_pop value lowercl
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 1902 280649 760. 760. 724.
#> 2 Area1 2006 Male 2238 284526 776. 776. 742.
#> 3 Area1 2007 Fema~ 1939 283938 714. 714. 681.
#> 4 Area1 2007 Male 1427 286021 533. 533. 504.
#> 5 Area1 2008 Fema~ 1901 270756 782. 782. 744.
#> 6 Area1 2008 Male 1404 277536 547. 547. 517.
#> 7 Area1 2009 Fema~ 1813 289587 731. 731. 696.
#> 8 Area1 2009 Male 2510 266885 973. 973. 932.
#> 9 Area1 2010 Fema~ 1724 275556 683. 683. 650.
#> 10 Area1 2010 Male 2273 280717 768. 768. 734.
#> # ... with 30 more rows, and 4 more variables: uppercl <dbl>, confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate the same dsrs by appending the standard populations to the data frame
df_std %>%
mutate(refpop = rep(esp2013,40)) %>%
group_by(area, year, sex) %>%
phe_dsr(obs,pop, stdpop=refpop, stdpoptype="field")
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex total_count total_pop standardised_pop value lowercl
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 1902 280649 760. 760. 724.
#> 2 Area1 2006 Male 2238 284526 776. 776. 742.
#> 3 Area1 2007 Fema~ 1939 283938 714. 714. 681.
#> 4 Area1 2007 Male 1427 286021 533. 533. 504.
#> 5 Area1 2008 Fema~ 1901 270756 782. 782. 744.
#> 6 Area1 2008 Male 1404 277536 547. 547. 517.
#> 7 Area1 2009 Fema~ 1813 289587 731. 731. 696.
#> 8 Area1 2009 Male 2510 266885 973. 973. 932.
#> 9 Area1 2010 Fema~ 1724 275556 683. 683. 650.
#> 10 Area1 2010 Male 2273 280717 768. 768. 734.
#> # ... with 30 more rows, and 4 more variables: uppercl <dbl>, confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate for under 75s by filtering out records for 75+ from input data frame and standard population
df_std %>%
filter(ageband <= 70) %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013[1:15])
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex total_count total_pop standardised_pop value lowercl
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 1570 222016 713. 783. 744.
#> 2 Area1 2006 Male 1624 228497 672. 739. 703.
#> 3 Area1 2007 Fema~ 1510 232395 626. 687. 652.
#> 4 Area1 2007 Male 1066 220432 470. 517. 486.
#> 5 Area1 2008 Fema~ 1500 215180 719. 790. 749.
#> 6 Area1 2008 Male 1158 221122 503. 552. 520.
#> 7 Area1 2009 Fema~ 1636 226142 698. 767. 730.
#> 8 Area1 2009 Male 1907 213639 872. 959. 915.
#> 9 Area1 2010 Fema~ 1429 217487 617. 678. 643.
#> 10 Area1 2010 Male 1729 221139 703. 773. 736.
#> # ... with 30 more rows, and 4 more variables: uppercl <dbl>, confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate separate dsrs for persons for each area and year)
df_std %>%
group_by(area, year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(area, year) %>%
phe_dsr(obs,pop)
#> # A tibble: 20 x 11
#> # Groups: area [4]
#> area year total_count total_pop standardised_pop value lowercl uppercl
#> <fct> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Area1 2006 4140 565175 730. 730. 707. 754.
#> 2 Area1 2007 3366 569959 614. 614. 593. 636.
#> 3 Area1 2008 3305 548292 645. 645. 622. 669.
#> 4 Area1 2009 4323 556472 822. 822. 797. 849.
#> 5 Area1 2010 3997 556273 712. 712. 689. 735.
#> 6 Area2 2006 4309 583335 752. 752. 728. 777.
#> 7 Area2 2007 3770 545306 682. 682. 659. 705.
#> 8 Area2 2008 4131 558671 742. 742. 718. 767.
#> 9 Area2 2009 4093 576157 707. 707. 684. 731.
#> 10 Area2 2010 4049 583252 698. 698. 675. 721.
#> 11 Area3 2006 3519 513099 695. 695. 671. 720.
#> 12 Area3 2007 4183 585328 742. 742. 718. 765.
#> 13 Area3 2008 3523 561450 637. 637. 615. 660.
#> 14 Area3 2009 3979 572040 694. 694. 671. 717.
#> 15 Area3 2010 4219 573184 750. 750. 726. 774.
#> 16 Area4 2006 3205 557733 613. 613. 591. 635.
#> 17 Area4 2007 3631 605148 602. 602. 581. 623.
#> 18 Area4 2008 3908 543962 780. 780. 754. 806.
#> 19 Area4 2009 3750 581381 670. 670. 647. 692.
#> 20 Area4 2010 3506 581500 617. 617. 595. 639.
#> # ... with 3 more variables: confidence <chr>, statistic <chr>, method <chr>
INPUT: Unlike the phe_dsr function, there is no default standard or reference data for the phe_smr and phe_isr functions. These functions take a single data frame as input, with columns representing the numerators and denominators for each standardisation category, plus reference numerators and denominators for each standardisation category.
The reference data can either be provided in a separate data frame/vectors or as columns within the input data frame:
reference data provided as a data frame or as vectors - the data frame/vectors and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order.
reference data provided as columns within the input data frame - the reference numerators and denominators can be appended to the input data frame prior to execution of the function - if the data is grouped to generate multiple smrs/isrs then the reference data will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the functions output one row per grouping set containing the grouping variable values, the observed and expected counts, the reference rate (isr only), the smr or isr, the lower 95% confidence limit, and the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If reference data are being provided as columns within the input data frame then the user must specify this as the function expects vectors by default. The function also accepts additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
The following code chunk creates a data frame containing the reference data - this example uses the all area data for persons in the baseline year:
df_ref <- df_std %>%
filter(year == 2006) %>%
group_by(ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop))
head(df_ref)
#> # A tibble: 6 x 3
#> ageband obs pop
#> <dbl> <int> <int>
#> 1 0 616 128749
#> 2 5 809 113432
#> 3 10 795 120597
#> 4 15 921 105706
#> 5 20 751 141603
#> 6 25 896 120961
Here are some example code chunks to demonstrate the phe_smr function and the arguments that can optionally be specified
# calculate separate smrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1902 1904. 0.999 0.955 1.05 95%
#> 2 Area1 2006 Male 2238 1952. 1.15 1.10 1.19 95%
#> 3 Area1 2007 Fema~ 1939 1965. 0.987 0.943 1.03 95%
#> 4 Area1 2007 Male 1427 1969. 0.725 0.687 0.763 95%
#> 5 Area1 2008 Fema~ 1901 1849. 1.03 0.982 1.08 95%
#> 6 Area1 2008 Male 1404 1882. 0.746 0.708 0.786 95%
#> 7 Area1 2009 Fema~ 1813 2010. 0.902 0.861 0.945 95%
#> 8 Area1 2009 Male 2510 1810. 1.39 1.33 1.44 95%
#> 9 Area1 2010 Fema~ 1724 1886. 0.914 0.872 0.958 95%
#> 10 Area1 2010 Male 2273 1935. 1.17 1.13 1.22 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate the same smrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1902 1904. 0.999 0.955 1.05 95%
#> 2 Area1 2006 Male 2238 1952. 1.15 1.10 1.19 95%
#> 3 Area1 2007 Fema~ 1939 1965. 0.987 0.943 1.03 95%
#> 4 Area1 2007 Male 1427 1969. 0.725 0.687 0.763 95%
#> 5 Area1 2008 Fema~ 1901 1849. 1.03 0.982 1.08 95%
#> 6 Area1 2008 Male 1404 1882. 0.746 0.708 0.786 95%
#> 7 Area1 2009 Fema~ 1813 2010. 0.902 0.861 0.945 95%
#> 8 Area1 2009 Male 2510 1810. 1.39 1.33 1.44 95%
#> 9 Area1 2010 Fema~ 1724 1886. 0.914 0.872 0.958 95%
#> 10 Area1 2010 Male 2273 1935. 1.17 1.13 1.22 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate separate smrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 6
#> year observed expected value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 15173 15173 1 0.984 1.02
#> 2 2007 14950 15864. 0.942 0.927 0.958
#> 3 2008 14867 15145. 0.982 0.966 0.998
#> 4 2009 16145 15687. 1.03 1.01 1.05
#> 5 2010 15771 15758. 1.00 0.985 1.02
The phe_isr function works exactly the same way but instead of expressing the result as a ratio of the observed and expected rates the result is expressed as a rate and the reference rate is also provided. Here are some examples:
# calculate separate isrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1902 1904. 684. 683. 653. 715. 95%
#> 2 Area1 2006 Male 2238 1952. 684. 784. 752. 817. 95%
#> 3 Area1 2007 Fema~ 1939 1965. 684. 675. 645. 705. 95%
#> 4 Area1 2007 Male 1427 1969. 684. 495. 470. 522. 95%
#> 5 Area1 2008 Fema~ 1901 1849. 684. 703. 672. 735. 95%
#> 6 Area1 2008 Male 1404 1882. 684. 510. 484. 538. 95%
#> 7 Area1 2009 Fema~ 1813 2010. 684. 617. 589. 646. 95%
#> 8 Area1 2009 Male 2510 1810. 684. 948. 911. 986. 95%
#> 9 Area1 2010 Fema~ 1724 1886. 684. 625. 596. 655. 95%
#> 10 Area1 2010 Male 2273 1935. 684. 803. 771. 837. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate the same isrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1902 1904. 684. 683. 653. 715. 95%
#> 2 Area1 2006 Male 2238 1952. 684. 784. 752. 817. 95%
#> 3 Area1 2007 Fema~ 1939 1965. 684. 675. 645. 705. 95%
#> 4 Area1 2007 Male 1427 1969. 684. 495. 470. 522. 95%
#> 5 Area1 2008 Fema~ 1901 1849. 684. 703. 672. 735. 95%
#> 6 Area1 2008 Male 1404 1882. 684. 510. 484. 538. 95%
#> 7 Area1 2009 Fema~ 1813 2010. 684. 617. 589. 646. 95%
#> 8 Area1 2009 Male 2510 1810. 684. 948. 911. 986. 95%
#> 9 Area1 2010 Fema~ 1724 1886. 684. 625. 596. 655. 95%
#> 10 Area1 2010 Male 2273 1935. 684. 803. 771. 837. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>, method <chr>
# calculate separate isrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 7
#> year observed expected ref_rate value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 15173 15173 684. 684. 673. 695.
#> 2 2007 14950 15864. 684. 644. 634. 655.
#> 3 2008 14867 15145. 684. 671. 660. 682.
#> 4 2009 16145 15687. 684. 704. 693. 715.
#> 5 2010 15771 15758. 684. 684. 674. 695.