Introduction to PHEindicatormethods

Georgina Anderson

2018-07-24

Introduction

This vignette introduces each of the functions within the PHEindicatormethods package and provides basic sample code to demonstrate their execution. The code included is based on the code provided within the ‘examples’ section of the function documentation. This vignette does not explain the methods applied in detail but these can (optionally) be output alongside the statistics or for a more detailed explanation, please see the references section of the function documentation.

The following packages must be installed and loaded if not already available

Package functions

This vignette covers the following functions available within the first release of the package (v1.0.0). If further functions are added to the package in future releases these will be explained elsewhere.

Function Type Description
phe_proportion Non-aggregate Performs a calculation on each row of data
phe_rate Non-aggregate Performs a calculation on each row of data
phe_mean Aggregate Performs a calculation on each grouping set
phe_dsr Aggregate, standardised Performs a calculation on each grouping set and requires additional reference inputs
phe_smr Aggregate, standardised Performs a calculation on each grouping set and requires additional reference inputs
phe_isr Aggregate, standardised Performs a calculation on each grouping set and requires additional reference inputs

Non-aggregate functions

Create some test data for the non-aggregate functions

The following code chunk creates a data frame containing observed number of events and populations for 4 geographical areas over 2 time periods that is used later to demonstrate the PHEindicatormethods package functions:

Execute phe_proportion and phe_rate

INPUT: The phe_proportion and phe_rate functions take a single data frame as input with columns representing the numerators and denominators for the statistic. Any other columns present will be retained in the output.

OUTPUT: The functions output the original data frame with additional columns appended. By default the additional columns are the proportion or rate, the lower 95% confidence limit, and the upper 95% confidence limit

OPTIONS: The functions also accept additional arguments to specify the level of confidence, the multiplier and the level of detail to be output.

Here are some example code chunks to demonstrate these two functions and the arguments that can optionally be specified


# default proportion
phe_proportion(df, obs, pop)
#>    area year obs pop      value    lowercl   uppercl
#> 1 Area1 2015  92 162 0.56790123 0.49091928 0.6417375
#> 2 Area2 2015  66 123 0.53658537 0.44868986 0.6222649
#> 3 Area3 2015  12 124 0.09677419 0.05622821 0.1615529
#> 4 Area4 2015  51 161 0.31677019 0.24989359 0.3921867
#> 5 Area1 2016  27 155 0.17419355 0.12256671 0.2415791
#> 6 Area2 2016  68 147 0.46258503 0.38396418 0.5431116
#> 7 Area3 2016  98 131 0.74809160 0.66741220 0.8146354
#> 8 Area4 2016  66 139 0.47482014 0.39360287 0.5573917

# specify confidence level for proportion
phe_proportion(df, obs, pop, confidence=99.8)
#>    area year obs pop      value    lowercl   uppercl
#> 1 Area1 2015  92 162 0.56790123 0.44718460 0.6810582
#> 2 Area2 2015  66 123 0.53658537 0.40007734 0.6678218
#> 3 Area3 2015  12 124 0.09677419 0.04145507 0.2097591
#> 4 Area4 2015  51 161 0.31677019 0.21646938 0.4375901
#> 5 Area1 2016  27 155 0.17419355 0.09979681 0.2864063
#> 6 Area2 2016  68 147 0.46258503 0.34170145 0.5880332
#> 7 Area3 2016  98 131 0.74809160 0.61683121 0.8456392
#> 8 Area4 2016  66 139 0.47482014 0.34981672 0.6030610

# specify to output proportions as percentages
phe_proportion(df, obs, pop, percentage=TRUE)
#>    area year obs pop     value   lowercl  uppercl
#> 1 Area1 2015  92 162 56.790123 49.091928 64.17375
#> 2 Area2 2015  66 123 53.658537 44.868986 62.22649
#> 3 Area3 2015  12 124  9.677419  5.622821 16.15529
#> 4 Area4 2015  51 161 31.677019 24.989359 39.21867
#> 5 Area1 2016  27 155 17.419355 12.256671 24.15791
#> 6 Area2 2016  68 147 46.258503 38.396418 54.31116
#> 7 Area3 2016  98 131 74.809160 66.741220 81.46354
#> 8 Area4 2016  66 139 47.482014 39.360287 55.73917

# specify level of detail to output for proportion
phe_proportion(df, obs, pop, confidence=99.8, percentage=TRUE, type="full")
#>    area year obs pop     value   lowercl  uppercl confidence  statistic
#> 1 Area1 2015  92 162 56.790123 44.718460 68.10582      99.8% percentage
#> 2 Area2 2015  66 123 53.658537 40.007734 66.78218      99.8% percentage
#> 3 Area3 2015  12 124  9.677419  4.145507 20.97591      99.8% percentage
#> 4 Area4 2015  51 161 31.677019 21.646938 43.75901      99.8% percentage
#> 5 Area1 2016  27 155 17.419355  9.979681 28.64063      99.8% percentage
#> 6 Area2 2016  68 147 46.258503 34.170145 58.80332      99.8% percentage
#> 7 Area3 2016  98 131 74.809160 61.683121 84.56392      99.8% percentage
#> 8 Area4 2016  66 139 47.482014 34.981672 60.30610      99.8% percentage
#>   method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson

# default rate
phe_rate(df, obs, pop)
#>    area year obs pop     value   lowercl  uppercl
#> 1 Area1 2015  92 162 56790.123 45779.632 69648.77
#> 2 Area2 2015  66 123 53658.537 41497.543 68267.87
#> 3 Area3 2015  12 124  9677.419  4994.766 16905.53
#> 4 Area4 2015  51 161 31677.019 23583.864 41650.29
#> 5 Area1 2016  27 155 17419.355 11476.751 25345.22
#> 6 Area2 2016  68 147 46258.503 35919.923 58644.55
#> 7 Area3 2016  98 131 74809.160 60732.236 91169.28
#> 8 Area4 2016  66 139 47482.014 36720.847 60409.70

# specify rate parameters
phe_rate(df, obs, pop, type="full", confidence=99.8, multiplier=100)
#>    area year obs pop     value  lowercl   uppercl confidence    statistic
#> 1 Area1 2015  92 162 56.790123 40.22441  77.58625      99.8% rate per 100
#> 2 Area2 2015  66 123 53.658537 35.52017  77.38981      99.8% rate per 100
#> 3 Area3 2015  12 124  9.677419  3.22611  21.83948      99.8% rate per 100
#> 4 Area4 2015  51 161 31.677019 19.70060  47.94048      99.8% rate per 100
#> 5 Area1 2016  27 155 17.419355  8.84022  30.49519      99.8% rate per 100
#> 6 Area2 2016  68 147 46.258503 30.82527  66.36981      99.8% rate per 100
#> 7 Area3 2016  98 131 74.809160 53.59832 101.24804      99.8% rate per 100
#> 8 Area4 2016  66 139 47.482014 31.43151  68.48163      99.8% rate per 100
#>   method
#> 1  Byars
#> 2  Byars
#> 3  Byars
#> 4  Byars
#> 5  Byars
#> 6  Byars
#> 7  Byars
#> 8  Byars



Aggregate functions

The remaining functions aggregate the rows in the input data frame to produce a single statistic. It is also possible to calculate multiple statistics in a single execution of these functions if the input data frame is grouped - for example by indicator ID, geographic area or time period (or all three). The output contains only the grouping variables and the values calculated by the function - any additional unused columns provided in the input data frame will not be retained in the output.

The df test data generated earlier can be used to demonstrate phe_mean:

Execute phe_mean

INPUT: The phe_mean function take a single data frame as input with a column representing the numbers to be averaged.

OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values (if applicable), the mean, the lower 95% confidence limit, and the upper 95% confidence limit

OPTIONS: The function also accepts additional arguments to specify the level of confidence and the level of detail to be output.

Here are some example code chunks to demonstrate the phe_mean function and the arguments that can optionally be specified

Standardised Aggregate functions

Create some test data for the standardised aggregate functions

The following code chunk creates a data frame containing observed number of events and populations by age band for 4 areas, 5 time periods and 2 sexes:

Execute phe_dsr

INPUT: The minimum input requirement for the phe_dsr function is a single data frame with columns representing the numerators and denominators for each standardisation category. This is sufficient if the data is:

The 2013 European Standard Population is provided within the package in vector form (esp2013) and is used by default by this function. Alternative standard populations can be used but must be provided by the user. When the function joins a standard population vector to the input data frame it does this by position so it is important that the data is sorted accordingly. This is a user responsibility.

The function can also accept standard populations provided as a column within the input data frame.

OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values, the dsr, the lower 95% confidence limit, and the upper 95% confidence limit

OPTIONS: If standard populations are being provided as a column within the input data frame then the user must specify this using the stdpoptype argument as the function expects a vector by default. The function also accepts additional arguments to specify the standard populations, the level of confidence, the multiplier and the level of detail to be output.

Here are some example code chunks to demonstrate the phe_dsr function and the arguments that can optionally be specified


# calculate separate dsrs for each area, year and sex
df_std %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs, pop)
#> # A tibble: 40 x 6
#> # Groups:   area, year [20]
#>    area   year sex    value lowercl uppercl
#>    <fct> <int> <fct>  <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Female  771.    736.    806.
#>  2 Area1  2006 Male    658.    626.    690.
#>  3 Area1  2007 Female  685.    652.    720.
#>  4 Area1  2007 Male    830.    792.    870.
#>  5 Area1  2008 Female  704.    668.    741.
#>  6 Area1  2008 Male    703.    670.    738.
#>  7 Area1  2009 Female  585.    556.    616.
#>  8 Area1  2009 Male    608.    577.    639.
#>  9 Area1  2010 Female  912.    874.    952.
#> 10 Area1  2010 Male    731.    697.    766.
#> # ... with 30 more rows

# calculate same specifying standard population in vector form
df_std %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs, pop, stdpop = esp2013)
#> # A tibble: 40 x 6
#> # Groups:   area, year [20]
#>    area   year sex    value lowercl uppercl
#>    <fct> <int> <fct>  <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Female  771.    736.    806.
#>  2 Area1  2006 Male    658.    626.    690.
#>  3 Area1  2007 Female  685.    652.    720.
#>  4 Area1  2007 Male    830.    792.    870.
#>  5 Area1  2008 Female  704.    668.    741.
#>  6 Area1  2008 Male    703.    670.    738.
#>  7 Area1  2009 Female  585.    556.    616.
#>  8 Area1  2009 Male    608.    577.    639.
#>  9 Area1  2010 Female  912.    874.    952.
#> 10 Area1  2010 Male    731.    697.    766.
#> # ... with 30 more rows

# calculate the same dsrs by appending the standard populations to the data frame
df_std %>%
    mutate(refpop = rep(esp2013,40)) %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs,pop, stdpop=refpop, stdpoptype="field")
#> # A tibble: 40 x 6
#> # Groups:   area, year [20]
#>    area   year sex    value lowercl uppercl
#>    <fct> <int> <fct>  <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Female  771.    736.    806.
#>  2 Area1  2006 Male    658.    626.    690.
#>  3 Area1  2007 Female  685.    652.    720.
#>  4 Area1  2007 Male    830.    792.    870.
#>  5 Area1  2008 Female  704.    668.    741.
#>  6 Area1  2008 Male    703.    670.    738.
#>  7 Area1  2009 Female  585.    556.    616.
#>  8 Area1  2009 Male    608.    577.    639.
#>  9 Area1  2010 Female  912.    874.    952.
#> 10 Area1  2010 Male    731.    697.    766.
#> # ... with 30 more rows

# calculate for under 75s by filtering out records for 75+ from input data frame and standard population
check <- df_std %>%
    filter(ageband <= 70) %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs, pop, stdpop = esp2013[1:15])
    
# calculate separate dsrs for persons for each area and year)
df_std %>%
    group_by(area, year, ageband) %>%
    summarise(obs = sum(obs),
              pop = sum(pop)) %>%
    group_by(area, year) %>%
    phe_dsr(obs,pop, type="full")
#> # A tibble: 20 x 10
#> # Groups:   area [4]
#>    area   year total_count total_pop value lowercl uppercl confidence
#>    <fct> <int>       <int>     <int> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006        3937    578616  697.    674.    720. 95%       
#>  2 Area1  2007        3876    551569  736.    711.    761. 95%       
#>  3 Area1  2008        3603    572850  692.    668.    716. 95%       
#>  4 Area1  2009        3611    596369  589.    568.    610. 95%       
#>  5 Area1  2010        4454    572323  781.    757.    806. 95%       
#>  6 Area2  2006        3614    569492  648.    626.    671. 95%       
#>  7 Area2  2007        4079    580956  748.    724.    772. 95%       
#>  8 Area2  2008        3796    548106  714.    690.    739. 95%       
#>  9 Area2  2009        3794    569715  696.    673.    720. 95%       
#> 10 Area2  2010        3145    581697  540.    521.    561. 95%       
#> 11 Area3  2006        3417    574017  599.    578.    621. 95%       
#> 12 Area3  2007        3450    568075  643.    620.    665. 95%       
#> 13 Area3  2008        4184    567538  737.    714.    761. 95%       
#> 14 Area3  2009        4435    550263  891.    863.    919. 95%       
#> 15 Area3  2010        4216    582870  756.    732.    781. 95%       
#> 16 Area4  2006        4132    577898  737.    714.    761. 95%       
#> 17 Area4  2007        4092    583720  706.    684.    730. 95%       
#> 18 Area4  2008        3894    562343  688.    665.    711. 95%       
#> 19 Area4  2009        2858    583706  477.    458.    497. 95%       
#> 20 Area4  2010        4654    570502  826.    801.    851. 95%       
#> # ... with 2 more variables: statistic <chr>, method <chr>

Execute phe_smr and phe_isr

INPUT: Unlike the phe_dsr function, there is no default standard or reference data for the phe_smr and phe_isr functions. These functions take a single data frame as input, with columns representing the numerators and denominators for each standardisation category, PLUS reference numerators and denominators for each standardisation category.

The reference data can either be provided in a separate data frame/vectors or as columns within the input data frame:

OUTPUT: By default, the functions output one row per grouping set containing the grouping variable values, the smr or isr, the lower 95% confidence limit, and the upper 95% confidence limit

OPTIONS: If reference data are being provided as columns within the input data frame then the user must specify this as the function expects vectors by default. The function also accepts additional arguments to specify the level of confidence, the multiplier and the level of detail to be output.

The following code chunk creates a data frame containing the reference data - this example uses the all area data for persons in the baseline year:

Here are some example code chunks to demonstrate the phe_smr function and the arguments that can optionally be specified


# calculate separate smrs for each area, year and sex
df_std %>%
    group_by(area, year, sex) %>%
    phe_smr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 6
#> # Groups:   area, year [20]
#>    area   year sex    value lowercl uppercl
#>    <fct> <int> <fct>  <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Female 1.11    1.07    1.16 
#>  2 Area1  2006 Male   0.960   0.916   1.01 
#>  3 Area1  2007 Female 1.06    1.01    1.11 
#>  4 Area1  2007 Male   1.12    1.07    1.17 
#>  5 Area1  2008 Female 0.946   0.902   0.992
#>  6 Area1  2008 Male   0.941   0.898   0.984
#>  7 Area1  2009 Female 0.916   0.874   0.959
#>  8 Area1  2009 Male   0.958   0.914   1.00 
#>  9 Area1  2010 Female 1.33    1.28    1.39 
#> 10 Area1  2010 Male   1.06    1.01    1.10 
#> # ... with 30 more rows

# calculate the same smrs by appending the reference data to the data frame
df_std %>%
    mutate(refobs = rep(df_ref$obs,40),
           refpop = rep(df_ref$pop,40)) %>%
    group_by(area, year, sex) %>%
    phe_smr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 6
#> # Groups:   area, year [20]
#>    area   year sex    value lowercl uppercl
#>    <fct> <int> <fct>  <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Female 1.11    1.07    1.16 
#>  2 Area1  2006 Male   0.960   0.916   1.01 
#>  3 Area1  2007 Female 1.06    1.01    1.11 
#>  4 Area1  2007 Male   1.12    1.07    1.17 
#>  5 Area1  2008 Female 0.946   0.902   0.992
#>  6 Area1  2008 Male   0.941   0.898   0.984
#>  7 Area1  2009 Female 0.916   0.874   0.959
#>  8 Area1  2009 Male   0.958   0.914   1.00 
#>  9 Area1  2010 Female 1.33    1.28    1.39 
#> 10 Area1  2010 Male   1.06    1.01    1.10 
#> # ... with 30 more rows

# calculate separate smrs for each year
df_std %>%
    group_by(year, ageband) %>%
    summarise(obs = sum(obs),
              pop = sum(pop)) %>%
    group_by(year) %>%
    phe_smr(obs, pop, df_ref$obs, df_ref$pop, type="full")
#> # A tibble: 5 x 9
#>    year observed expected value lowercl uppercl confidence statistic
#>   <int>    <int>    <dbl> <dbl>   <dbl>   <dbl> <chr>      <chr>    
#> 1  2006    15100   15100  1       0.984   1.02  95%        smr x 1  
#> 2  2007    15497   14935. 1.04    1.02    1.05  95%        smr x 1  
#> 3  2008    15477   14817. 1.04    1.03    1.06  95%        smr x 1  
#> 4  2009    14698   14970. 0.982   0.966   0.998 95%        smr x 1  
#> 5  2010    16469   15115. 1.09    1.07    1.11  95%        smr x 1  
#> # ... with 1 more variable: method <chr>

The phe_isr function works exactly the same way but instead of expressing the result as a ratio of the observed and expected rates the result is expressed as a rate and the reference rate is also provided. Here are some examples:


# calculate separate isrs for each area, year and sex
df_std %>%
    group_by(area, year, sex) %>%
    phe_isr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 6
#> # Groups:   area, year [20]
#>    area   year sex    value lowercl uppercl
#>    <fct> <int> <fct>  <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Female  732.    701.    763.
#>  2 Area1  2006 Male    631.    602.    661.
#>  3 Area1  2007 Female  697.    666.    728.
#>  4 Area1  2007 Male    734.    701.    767.
#>  5 Area1  2008 Female  621.    592.    651.
#>  6 Area1  2008 Male    618.    590.    646.
#>  7 Area1  2009 Female  601.    574.    629.
#>  8 Area1  2009 Male    629.    600.    658.
#>  9 Area1  2010 Female  874.    840.    910.
#> 10 Area1  2010 Male    695.    665.    725.
#> # ... with 30 more rows

# calculate the same isrs by appending the reference data to the data frame
df_std %>%
    mutate(refobs = rep(df_ref$obs,40),
           refpop = rep(df_ref$pop,40)) %>%
    group_by(area, year, sex) %>%
    phe_isr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 6
#> # Groups:   area, year [20]
#>    area   year sex    value lowercl uppercl
#>    <fct> <int> <fct>  <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Female  732.    701.    763.
#>  2 Area1  2006 Male    631.    602.    661.
#>  3 Area1  2007 Female  697.    666.    728.
#>  4 Area1  2007 Male    734.    701.    767.
#>  5 Area1  2008 Female  621.    592.    651.
#>  6 Area1  2008 Male    618.    590.    646.
#>  7 Area1  2009 Female  601.    574.    629.
#>  8 Area1  2009 Male    629.    600.    658.
#>  9 Area1  2010 Female  874.    840.    910.
#> 10 Area1  2010 Male    695.    665.    725.
#> # ... with 30 more rows

# calculate separate isrs for each year
df_std %>%
    group_by(year, ageband) %>%
    summarise(obs = sum(obs),
              pop = sum(pop)) %>%
    group_by(year) %>%
    phe_isr(obs, pop, df_ref$obs, df_ref$pop, type="full")
#> # A tibble: 5 x 10
#>    year observed expected ref_rate value lowercl uppercl confidence
#>   <int>    <int>    <dbl>    <dbl> <dbl>   <dbl>   <dbl> <chr>     
#> 1  2006    15100   15100      657.  657.    646.    667. 95%       
#> 2  2007    15497   14935.     657.  681.    671.    692. 95%       
#> 3  2008    15477   14817.     657.  686.    675.    697. 95%       
#> 4  2009    14698   14970.     657.  645.    634.    655. 95%       
#> 5  2010    16469   15115.     657.  715.    704.    726. 95%       
#> # ... with 2 more variables: statistic <chr>, method <chr>