Implementing case definitions for epidemiological analysis in R

10 November 2019

Introduction

Deduplicating events is a major part of establishing a case definition in epidemiological analyses. In such analyses, you select a reference event ("Case (C)") which is taken as the start of an episode. Subsequent events within a specified period after this reference event are then considered duplicates "(D)".

fixed_episodes(), rolling_episodes() and episode_group() aim to provide a simple but flexible way of grouping records from multiple data sources into episodes. This then allows for the deduplication of the dataset, or a sub-analysis within each episode.

Uses

The flexible application of record_group and, fixed_episodes(), rolling_episodes() or episode_group() can allow you apply different case definitions to a dataset. Some examples below;

  1. Create episodes of events that occur at a point in time e.g. medical diagnoses or traffic incidents
  2. Create episodes of overlapping periods of events e.g. hospital stay. See interval grouping
  3. Create episodes that reoccur within a defined period (recurrence_length) from the last event
  4. Specify which event (or period of event) is taken as the "Case (C)". See case assignment
  5. Create episodes with a different case_length and/or recurrence_length for different subsets of a dataset. See stratified grouping

Implementation

Overview

An episode as produced by fixed_episodes(), rolling_episodes() or episode_group() is a set of events (or periods of events) within a specific period of time. Each begins with a reference event - the "Case (C)", and may contain a "Recurrent (R)" event. The remaining events are considered duplicates "(D)".

Two type of episodes can be produced;

"fixed" and "rolling" episodes are created with fixed_episdoes() and rolling_episodes() respectively, while episode_group can create both.

Episode windows and recurrence periods

An episode window (epid_interval) is the period between the "Case" and the last event in the episode (or the end point of the last period). A "fixed" episode is never longer than its "case_length".

A recurrence period is a specified period (recurrence_length) after the last event in an episode. The first new record within this recurrence_length is taken as a "Recurrent (R)" record. Others after that period are taken as a duplicates "(D)" of that recurrent "(R)" record. This means that a rolling episode will continue to expand as far as there’s a record within the recurrence period of the last event in the episode. The last event could be a "Case (C)", "Duplicate (C)" or "Recurrent (R)" record. As a result, a rolling episode can be longer than its case_length.

The examples below demonstrates the difference;

NOTE; to_s4 and to_s4() changes these functions’ output from a data.frame (current default) to epid objects. epid objects will be the default output in the next release.

library(dplyr); library(lubridate); library(diyar)

data("infections_2")
dates <- infections_2$date

# Fixed episodes
f <- fixed_episodes(date = dates, case_length=14, group_stats = TRUE, to_s4 =TRUE)
#> Episode or recurrence window 1.
#> 3 of 6 record(s) grouped into episodes. 3 records not yet grouped.
#> Episode or recurrence window 2.
#> 2 of 3 record(s) grouped into episodes. 1 records not yet grouped.
#> Episode or recurrence window 3.
#> 1 of 1 record(s) grouped into episodes. 0 records not yet grouped.
#> 
#> Episode grouping complete - 1 record(s) assinged a unique ID.

# Rolling episodes
r <- rolling_episodes(date = dates, case_length=14, group_stats = TRUE, display = FALSE, to_s4 =TRUE)
#> Episode grouping complete - 1 record(s) assinged a unique ID.

dates # dates
#> [1] "2019-04-01" "2019-04-06" "2019-04-11" "2019-04-16" "2019-04-21"
#> [6] "2019-06-04"

f # fixed episode identifiers
#> [1] "E-1 2019-04-01 -> 2019-04-11 (C)" "E-1 2019-04-01 -> 2019-04-11 (D)"
#> [3] "E-1 2019-04-01 -> 2019-04-11 (D)" "E-4 2019-04-16 -> 2019-04-21 (C)"
#> [5] "E-4 2019-04-16 -> 2019-04-21 (D)" "E-6 2019-06-04 == 2019-06-04 (C)"

r # rolling episode identifiers
#> [1] "E-1 2019-04-01 -> 2019-04-21 (C)" "E-1 2019-04-01 -> 2019-04-21 (D)"
#> [3] "E-1 2019-04-01 -> 2019-04-21 (D)" "E-1 2019-04-01 -> 2019-04-21 (R)"
#> [5] "E-1 2019-04-01 -> 2019-04-21 (D)" "E-6 2019-06-04 == 2019-06-04 (C)"

In the "fixed" episodes example, records 1 to 3 are assigned a unique episode ID ("E-1"). In this instance, record 1 is the "Case (C)" and records 2 and 3 its duplicates "(D)". This is because records 2 and 3 are within 15 days (difference of 14 days) of record 1. Record 4 occurred after this 15-day period and so begins a new "Case (C)" which is assigned a new episode ID ("E-4"). Record 5 is within 15 days of record 4, so is considered a "Duplicate (D)" of record 4 and assigned the same episode ID as record 4 ("E-4"). This process continues chronologically until all records have been assigned an episode ID.

In the "rolling" episodes example, records 1 to 3 are also grouped together as episode "E-1" however, unlike the "fixed" episode example, record 4 is not considered a new "Case (C)". Instead its assigned to episode "E-1" again as a "Recurrent (R)" record. This occurred because it was within 8 days (difference of 7 days) of the last "Duplicate (D)" (record 3). Record 5 is also assigned to episode "E-1" because it’s within 8 days of record 4. Record 6 is not within 15 days of the initial "Case (C)" (record 1) and not within 8 days of the last record at this stage (record 5), so it’s now considered a new "Case (C)" and assigned a new episode ID ("E-6").

Figure 1: "Fixed" and "rolling" episodes with their respective case_lenth and recurrence_length

If your case definition does not explicitly require a rolling episode, use fixed_episode(). It takes less time to complete.

Case assignment

This section covers the different ways you can determine which record is taken as the "Case (C)".

Number of recurrence periods and episodes

You can choose to have a certain number of episodes per strata by using episode_max. When episode_max is reached, any record not yet grouped is assigned a unique episode ID, making them unique cases "(C)".

data("infections_3");
dbs <- infections_3[c("pid","date")]; dbs
#> # A tibble: 11 x 2
#>      pid date      
#>    <dbl> <date>    
#>  1     1 2019-04-01
#>  2     1 2019-04-02
#>  3     1 2019-04-03
#>  4     1 2019-04-04
#>  5     1 2019-04-05
#>  6     1 2019-04-06
#>  7     1 2019-04-07
#>  8     1 2019-04-08
#>  9     1 2019-04-09
#> 10     1 2019-04-10
#> 11     1 2019-04-11

# Maximum of one episode with one recurrence period
dbs$eps_1 <- rolling_episodes(strata = dbs$pid, date =dbs$date, case_length = 1,display = FALSE, 
                              rolls_max = 1, episodes_max = 1, to_s4 = TRUE)
#> Episode grouping complete - 8 record(s) assinged a unique ID.

# Maximum of two episodes with one recurrence period
dbs$eps_2 <- rolling_episodes(strata = dbs$pid, date =dbs$date, case_length = 1,display = FALSE, 
                              rolls_max = 1, episodes_max = 2, to_s4 = TRUE)
#> Episode grouping complete - 5 record(s) assinged a unique ID.

dbs
#> # A tibble: 11 x 4
#>      pid date       eps_1    eps_2   
#>    <dbl> <date>     <epid>   <epid>  
#>  1     1 2019-04-01 E-01 (C) E-01 (C)
#>  2     1 2019-04-02 E-01 (D) E-01 (D)
#>  3     1 2019-04-03 E-01 (R) E-01 (R)
#>  4     1 2019-04-04 E-04 (C) E-04 (C)
#>  5     1 2019-04-05 E-05 (C) E-04 (D)
#>  6     1 2019-04-06 E-06 (C) E-04 (R)
#>  7     1 2019-04-07 E-07 (C) E-07 (C)
#>  8     1 2019-04-08 E-08 (C) E-08 (C)
#>  9     1 2019-04-09 E-09 (C) E-09 (C)
#> 10     1 2019-04-10 E-10 (C) E-10 (C)
#> 11     1 2019-04-11 E-11 (C) E-11 (C)

By default, rolling_episodes() will continue checking for "Recurrent" (R) records indefinitely, but you can limit the number of recurrence periods an episode can have by using rolls_max.

# Infinite recurrence periods per episode (Default)
dbs$eps_3 <- rolling_episodes(strata = dbs$pid, date =dbs$date, case_length = 2,display = FALSE, to_s4 = TRUE)
#> Episode grouping complete - 0 record(s) assinged a unique ID.

# Maximum of one recurrence period per episode
dbs$eps_4 <- rolling_episodes(strata = dbs$pid, date =dbs$date, case_length = 2,display = FALSE, 
                              rolls_max = 1, to_s4 = TRUE)
#> Episode grouping complete - 1 record(s) assinged a unique ID.

# Maximum of two recurrence periods per episode
dbs$eps_5 <- rolling_episodes(strata = dbs$pid, date =dbs$date, case_length = 2,display = FALSE, 
                              rolls_max = 2, to_s4 = TRUE)
#> Episode grouping complete - 0 record(s) assinged a unique ID.

dbs
#> # A tibble: 11 x 7
#>      pid date       eps_1    eps_2    eps_3   eps_4    eps_5  
#>    <dbl> <date>     <epid>   <epid>   <epid>  <epid>   <epid> 
#>  1     1 2019-04-01 E-01 (C) E-01 (C) E-1 (C) E-01 (C) E-1 (C)
#>  2     1 2019-04-02 E-01 (D) E-01 (D) E-1 (D) E-01 (D) E-1 (D)
#>  3     1 2019-04-03 E-01 (R) E-01 (R) E-1 (D) E-01 (D) E-1 (D)
#>  4     1 2019-04-04 E-04 (C) E-04 (C) E-1 (R) E-01 (R) E-1 (R)
#>  5     1 2019-04-05 E-05 (C) E-04 (D) E-1 (D) E-01 (D) E-1 (D)
#>  6     1 2019-04-06 E-06 (C) E-04 (R) E-1 (R) E-06 (C) E-1 (R)
#>  7     1 2019-04-07 E-07 (C) E-07 (C) E-1 (D) E-06 (D) E-1 (D)
#>  8     1 2019-04-08 E-08 (C) E-08 (C) E-1 (R) E-06 (D) E-8 (C)
#>  9     1 2019-04-09 E-09 (C) E-09 (C) E-1 (D) E-06 (R) E-8 (D)
#> 10     1 2019-04-10 E-10 (C) E-10 (C) E-1 (R) E-06 (D) E-8 (D)
#> 11     1 2019-04-11 E-11 (C) E-11 (C) E-1 (D) E-11 (C) E-8 (R)

Chronological order of events

By default, episode grouping begins at the earliest event (or period of event) and proceeds to the most recent one, making the earliest event the "Case (C)". The opposite of this is to begin episode grouping at the most recent record and proceed backwards in time. This results in taking the most recent record as the "Case (C)". To do this, change from_last to TRUE.

dbs <- infections_2[c("date")]; dbs
#> # A tibble: 6 x 1
#>   date      
#>   <date>    
#> 1 2019-04-01
#> 2 2019-04-06
#> 3 2019-04-11
#> 4 2019-04-16
#> 5 2019-04-21
#> 6 2019-06-04

# Episode grouping in chronological order
dbs$forward <- fixed_episodes(date=dbs$date, case_length= 14,
                              group_stats = TRUE, display = FALSE, to_s4=TRUE)
#> Episode grouping complete - 1 record(s) assinged a unique ID.

# Episode grouping in reverse chronological order
dbs$backward <- fixed_episodes(date=dbs$date, case_length= 14, group_stats = TRUE, display = FALSE,
                               from_last=TRUE, to_s4=TRUE)
#> Episode grouping complete - 1 record(s) assinged a unique ID.

dbs[c("forward","backward")]
#> # A tibble: 6 x 2
#>   forward                          backward                        
#>   <epid>                           <epid>                          
#> 1 E-1 2019-04-01 -> 2019-04-11 (C) E-2 2019-04-06 <- 2019-04-01 (D)
#> 2 E-1 2019-04-01 -> 2019-04-11 (D) E-2 2019-04-06 <- 2019-04-01 (C)
#> 3 E-1 2019-04-01 -> 2019-04-11 (D) E-5 2019-04-21 <- 2019-04-11 (D)
#> 4 E-4 2019-04-16 -> 2019-04-21 (C) E-5 2019-04-21 <- 2019-04-11 (D)
#> 5 E-4 2019-04-16 -> 2019-04-21 (D) E-5 2019-04-21 <- 2019-04-11 (C)
#> 6 E-6 2019-06-04 == 2019-06-04 (C) E-6 2019-06-04 == 2019-06-04 (C)

User defined case assignment

You can use a specific preference for case assignment by using custom_sort. In fixed_episodes() and rolling_episodes(), this is a vector whose values when sorted in ascending order specifies the required preference. For example, records with a custom_sort value of 1 will be preferentially taken as the start of an episode over one with a value of 2. This is prioritised over the chronological order of records. See the example below.

dates <- c("01/04/2019", "05/04/2019", "07/04/2019")
dates <- as.Date(dates,"%d/%m/%Y")
user_sort <- c(2,1,2)

# preference determined by from_last 
fixed_episodes(date=dates, case_length=6, to_s4=TRUE, display=FALSE, group_stats = TRUE)
#> Episode grouping complete - 0 record(s) assinged a unique ID.
#> [1] "E-1 2019-04-01 -> 2019-04-07 (C)" "E-1 2019-04-01 -> 2019-04-07 (D)"
#> [3] "E-1 2019-04-01 -> 2019-04-07 (D)"

# user defined preference via custom sort is prioritised before from_last 
fixed_episodes(date=dates, case_length=6, to_s4=TRUE, custom_sort = user_sort, display=FALSE, group_stats = TRUE)
#> Episode grouping complete - 1 record(s) assinged a unique ID.
#> [1] "E-1 2019-04-01 == 2019-04-01 (C)" "E-2 2019-04-05 -> 2019-04-07 (C)"
#> [3] "E-2 2019-04-05 -> 2019-04-07 (D)"

# user defined preference via custom sort is prioritised before from_last. Duplicates flagged from both directions
fixed_episodes(date=dates, case_length=6, to_s4=TRUE, custom_sort = user_sort, display=FALSE, 
               bi_direction = TRUE, group_stats = TRUE)
#> Episode grouping complete - 0 record(s) assinged a unique ID.
#> [1] "E-2 2019-04-01 -> 2019-04-07 (D)" "E-2 2019-04-01 -> 2019-04-07 (C)"
#> [3] "E-2 2019-04-01 -> 2019-04-07 (D)"

In the second example above, even though the second record occurred after the first, episode grouping began at the second one and continued chronologically in the direction specified by from_last.

A consequence of using custom_sort was that record 1 was not grouped together with record 2 even though its within the case_length of record 2. This is because duplicates are tracked in one direction which is determined by from_last. To track duplicates from both directions use bi_direction as shown in the third example.

Note that example 1 and 3 now have the same number of duplicates (and episodes) but different cases - records 1 and 2 respectively.

custom_sort will follow R standard sort behaviour. For example, a factor will sort on its levels not the descriptive label.

For a practical example, we’ll use this feature for a case definition where E. coli urinary tract infections (UTI) are considered precursors to E. coli bloodstream infections (BSI). This means that episodes need to be created in such a way that; if there are UTI and BSI records within the same case_length, the UTI record will be taken as the "Case (C)".

dbs <- infections_2[c("date","infx")]; dbs
#> # A tibble: 6 x 2
#>   date       infx       
#>   <date>     <chr>      
#> 1 2019-04-01 E. coli BSI
#> 2 2019-04-06 E. coli BSI
#> 3 2019-04-11 E. coli BSI
#> 4 2019-04-16 E. coli BSI
#> 5 2019-04-21 E. coli BSI
#> 6 2019-06-04 E. coli BSI
dbs$infx <- gsub("E. coli ","",dbs$infx)
dbs$infx[c(2,5)] <- "UTI"

dbs$epids_1 <- fixed_episodes(date=dbs$date, case_length=14, 
                custom_sort = dbs$infx, display = FALSE, to_s4 = TRUE)
#> Episode grouping complete - 1 record(s) assinged a unique ID.

dbs$infx_f <- factor(dbs$infx, levels = c("UTI","BSI"))

dbs$epids_2 <- fixed_episodes(date=dbs$date, case_length=14, 
                custom_sort = dbs$infx_f, display = FALSE, to_s4 = TRUE)
#> Episode grouping complete - 3 record(s) assinged a unique ID.

dbs$epids_3 <- fixed_episodes(date=dbs$date, case_length=14, 
                custom_sort = dbs$infx_f, display = FALSE, to_s4 = TRUE, bi_direction = TRUE)
#> Episode grouping complete - 2 record(s) assinged a unique ID.

dbs
#> # A tibble: 6 x 6
#>   date       infx  epids_1 infx_f epids_2 epids_3
#>   <date>     <chr> <epid>  <fct>  <epid>  <epid> 
#> 1 2019-04-01 BSI   E-1 (C) BSI    E-1 (C) E-2 (D)
#> 2 2019-04-06 UTI   E-1 (D) UTI    E-2 (C) E-2 (C)
#> 3 2019-04-11 BSI   E-1 (D) BSI    E-2 (D) E-2 (D)
#> 4 2019-04-16 BSI   E-4 (C) BSI    E-2 (D) E-2 (D)
#> 5 2019-04-21 UTI   E-4 (D) UTI    E-5 (C) E-5 (C)
#> 6 2019-06-04 BSI   E-6 (C) BSI    E-6 (C) E-6 (C)

In epids_2, after changing the sort preference using factor levels, record 1 (E. coli BSI) and record 2 (E. coli UTI) are no longer part of the same episode. This is because record 2, is now the reference record where episode grouping began, and since record 1 occurred before record 2, they are not grouped together. epids_3 is episode grouping when bi_direction is used.

In episode_group(), you can implement custom_sort in levels. You do this by creating a column for each level. The column names, listed in the preferred order (level) is then supplied to custom_sort.

Interval grouping

In this section we discuss the process of grouping periods of events into episodes. Each period/interval is essentially a record with a start and end point in time. For the purpose of episode grouping, these periods are created as number_line objects and supplied to the date argument.

Below are simple examples;

dbs <- tibble(date=c("01/04/2019", "05/04/2019"))

dbs$date <- as.Date(dbs$date, "%d/%M/%Y")
dbs$period <- number_line(dbs$date, dbs$date + 10)

dbs
#> # A tibble: 2 x 2
#>   date       period                  
#>   <date>     <numbr_ln>              
#> 1 2019-11-01 2019-11-01 -> 2019-11-11
#> 2 2019-11-05 2019-11-05 -> 2019-11-15

# Grouping events
fixed_episodes(date=dbs$date, case_length=30, to_s4=TRUE, display=FALSE, group_stat=TRUE)
#> Episode grouping complete - 0 record(s) assinged a unique ID.
#> [1] "E-1 2019-11-01 -> 2019-11-05 (C)" "E-1 2019-11-01 -> 2019-11-05 (D)"

# Grouping periods
fixed_episodes(date=dbs$period, case_length=30, to_s4=TRUE, display=FALSE, group_stat=TRUE)
#> Episode grouping complete - 0 record(s) assinged a unique ID.
#> [1] "E-1 2019-11-01 -> 2019-11-15 (C)" "E-1 2019-11-01 -> 2019-11-15 (D)"

As a practical example, below can group periods of hospital stay into episodes. This is different from grouping the actual admission or discharge events.

data("hospital_admissions"); hospital_admissions
#> # A tibble: 9 x 4
#>   rd_id admin_dt   discharge_dt epi_len
#>   <int> <date>     <date>         <dbl>
#> 1     1 2019-01-01 2019-01-01         0
#> 2     2 2019-01-01 2019-01-10         0
#> 3     3 2019-01-10 2019-01-13         0
#> 4     4 2019-01-05 2019-01-06         0
#> 5     5 2019-01-05 2019-01-15         0
#> 6     6 2019-01-07 2019-01-15         0
#> 7     7 2019-01-04 2019-01-13         0
#> 8     8 2019-01-20 2019-01-30         0
#> 9     9 2019-01-26 2019-01-31         0

hospital_admissions$admin_period <- number_line(hospital_admissions$admin_dt, hospital_admissions$discharge_dt)

# Grouping the actual admissions into episodes
fixed_episodes(date=hospital_admissions$admin_dt, sn=hospital_admissions$rd_id, case_length = 0, 
                display = FALSE, to_s4 = TRUE, group_stats = TRUE)
#> Episode grouping complete - 5 record(s) assinged a unique ID.
#> [1] "E-1 2019-01-01 == 2019-01-01 (C)" "E-1 2019-01-01 == 2019-01-01 (D)"
#> [3] "E-3 2019-01-10 == 2019-01-10 (C)" "E-4 2019-01-05 == 2019-01-05 (C)"
#> [5] "E-4 2019-01-05 == 2019-01-05 (D)" "E-6 2019-01-07 == 2019-01-07 (C)"
#> [7] "E-7 2019-01-04 == 2019-01-04 (C)" "E-8 2019-01-20 == 2019-01-20 (C)"
#> [9] "E-9 2019-01-26 == 2019-01-26 (C)"

# Grouping the periods of stay (admission -> discharge)
fixed_episodes(date=hospital_admissions$admin_period, sn=hospital_admissions$rd_id, case_length = 0, 
                display = FALSE, to_s4 = TRUE, group_stats = TRUE)
#> Episode grouping complete - 0 record(s) assinged a unique ID.
#> [1] "E-2 2019-01-01 -> 2019-01-15 (D)" "E-2 2019-01-01 -> 2019-01-15 (C)"
#> [3] "E-2 2019-01-01 -> 2019-01-15 (D)" "E-2 2019-01-01 -> 2019-01-15 (D)"
#> [5] "E-2 2019-01-01 -> 2019-01-15 (D)" "E-2 2019-01-01 -> 2019-01-15 (D)"
#> [7] "E-2 2019-01-01 -> 2019-01-15 (D)" "E-8 2019-01-20 -> 2019-01-31 (C)"
#> [9] "E-8 2019-01-20 -> 2019-01-31 (D)"

Periods are grouped into the same episode if they overlap. Since this can happen in different ways, you can choose how this happens by using overlap_method. The options available are; "across", "inbetween", "chain", "aligns_start" and "aligns_end". The default option is to use all. Below are examples demonstrating each option.

# Overlapping intervals
across <- fixed_episodes(date=hospital_admissions$admin_period, sn=hospital_admissions$rd_id, case_length = 0, 
               overlap_method = "across", display = FALSE, to_s4 = TRUE, group_stats = TRUE)
#> Episode grouping complete - 3 record(s) assinged a unique ID.

across
#> [1] "E-1 2019-01-01 == 2019-01-01 (C)" "E-2 2019-01-01 -> 2019-01-15 (C)"
#> [3] "E-3 2019-01-10 -> 2019-01-13 (C)" "E-4 2019-01-05 -> 2019-01-06 (C)"
#> [5] "E-2 2019-01-01 -> 2019-01-15 (D)" "E-2 2019-01-01 -> 2019-01-15 (D)"
#> [7] "E-2 2019-01-01 -> 2019-01-15 (D)" "E-8 2019-01-20 -> 2019-01-31 (C)"
#> [9] "E-8 2019-01-20 -> 2019-01-31 (D)"

# Chained intervals
chain <- fixed_episodes(date=hospital_admissions$admin_period, sn=hospital_admissions$rd_id, case_length = 0, 
               overlap_method = "chain", display = FALSE, to_s4 = TRUE, group_stats = TRUE)
#> Episode grouping complete - 6 record(s) assinged a unique ID.

chain
#> [1] "E-2 2019-01-01 -> 2019-01-13 (D)" "E-2 2019-01-01 -> 2019-01-13 (C)"
#> [3] "E-2 2019-01-01 -> 2019-01-13 (D)" "E-4 2019-01-05 -> 2019-01-06 (C)"
#> [5] "E-5 2019-01-05 -> 2019-01-15 (C)" "E-6 2019-01-07 -> 2019-01-15 (C)"
#> [7] "E-7 2019-01-04 -> 2019-01-13 (C)" "E-8 2019-01-20 -> 2019-01-30 (C)"
#> [9] "E-9 2019-01-26 -> 2019-01-31 (C)"

# Intervals with aligned end points
aligns_end <- fixed_episodes(date=hospital_admissions$admin_period, sn=hospital_admissions$rd_id, case_length = 0, 
               overlap_method = "aligns_end", display = FALSE, to_s4 = TRUE, group_stats = TRUE)
#> Episode grouping complete - 5 record(s) assinged a unique ID.

aligns_end
#> [1] "E-1 2019-01-01 == 2019-01-01 (C)" "E-2 2019-01-01 -> 2019-01-10 (C)"
#> [3] "E-7 2019-01-04 -> 2019-01-13 (D)" "E-4 2019-01-05 -> 2019-01-06 (C)"
#> [5] "E-5 2019-01-05 -> 2019-01-15 (C)" "E-5 2019-01-05 -> 2019-01-15 (D)"
#> [7] "E-7 2019-01-04 -> 2019-01-13 (C)" "E-8 2019-01-20 -> 2019-01-30 (C)"
#> [9] "E-9 2019-01-26 -> 2019-01-31 (C)"

# Intervals with aligned start points
aligns_start <- fixed_episodes(date=hospital_admissions$admin_period, sn=hospital_admissions$rd_id, case_length = 0, 
               overlap_method = "aligns_start", display = FALSE, to_s4 = TRUE, group_stats = TRUE)
#> Episode grouping complete - 5 record(s) assinged a unique ID.

aligns_start
#> [1] "E-2 2019-01-01 -> 2019-01-10 (D)" "E-2 2019-01-01 -> 2019-01-10 (C)"
#> [3] "E-3 2019-01-10 -> 2019-01-13 (C)" "E-5 2019-01-05 -> 2019-01-15 (D)"
#> [5] "E-5 2019-01-05 -> 2019-01-15 (C)" "E-6 2019-01-07 -> 2019-01-15 (C)"
#> [7] "E-7 2019-01-04 -> 2019-01-13 (C)" "E-8 2019-01-20 -> 2019-01-30 (C)"
#> [9] "E-9 2019-01-26 -> 2019-01-31 (C)"

# Intervals occurring completely within others
inbetween <- fixed_episodes(date=hospital_admissions$admin_period, sn=hospital_admissions$rd_id, case_length = 0, 
               overlap_method = "inbetween", display = FALSE, to_s4 = TRUE, group_stats = TRUE)
#> Episode grouping complete - 5 record(s) assinged a unique ID.

inbetween
#> [1] "E-1 2019-01-01 == 2019-01-01 (C)" "E-2 2019-01-01 -> 2019-01-10 (C)"
#> [3] "E-5 2019-01-05 -> 2019-01-15 (D)" "E-2 2019-01-01 -> 2019-01-10 (D)"
#> [5] "E-5 2019-01-05 -> 2019-01-15 (C)" "E-6 2019-01-07 -> 2019-01-15 (C)"
#> [7] "E-7 2019-01-04 -> 2019-01-13 (C)" "E-8 2019-01-20 -> 2019-01-30 (C)"
#> [9] "E-9 2019-01-26 -> 2019-01-31 (C)"

# Chained intervals and those occurring completely within others
chain_inbetween <- fixed_episodes(date=hospital_admissions$admin_period, sn=hospital_admissions$rd_id, case_length = 0, 
               overlap_method = c("chain","inbetween"), display = FALSE, to_s4 = TRUE, group_stats = TRUE)
#> Episode grouping complete - 5 record(s) assinged a unique ID.

chain_inbetween
#> [1] "E-2 2019-01-01 -> 2019-01-13 (D)" "E-2 2019-01-01 -> 2019-01-13 (C)"
#> [3] "E-2 2019-01-01 -> 2019-01-13 (D)" "E-2 2019-01-01 -> 2019-01-13 (D)"
#> [5] "E-5 2019-01-05 -> 2019-01-15 (C)" "E-6 2019-01-07 -> 2019-01-15 (C)"
#> [7] "E-7 2019-01-04 -> 2019-01-13 (C)" "E-8 2019-01-20 -> 2019-01-30 (C)"
#> [9] "E-9 2019-01-26 -> 2019-01-31 (C)"

Figure 2: Different options for overlap_method using a case_length of "0" days.

Figure 3: Different options for overlap_method using a case_length of "30" days

Stratified episode grouping

Episode grouping can be done separately for different subsets (strata) of the dataset. Examples of a strata could be patient IDs, type of pathogen, source of infection or any combination of these. Episodes will be limited to each strata however, episodes from different strata can have different case_length and/or recurrence_length.

record_group() is useful for creating group identifiers which can used as a strata. See record group for further details.

As an example, using the infections dataset, a case definition may specify the following;

In the example below, adding infection to the strata argument means that "UTI" and "BSI" records will not be in the same episode. Furthermore, the case_length and recurrence_length for each will be different as specified in the epi and recur columns respectively

data(infections)
dbs <- infections[c("date","infection")]
dbs$epi <- ifelse(dbs$infection=="UTI", 7, 14)
dbs$recur <- ifelse(dbs$infection=="UTI", 30, 0)

dbs$epids <- rolling_episodes(date=dbs$date, case_length =dbs$epi, to_s4 =TRUE,
                            recurrence_length = dbs$recur, display = FALSE, group_stats = TRUE)
#> Episode grouping complete - 0 record(s) assinged a unique ID.

dbs
#> # A tibble: 11 x 5
#>    date       infection   epi recur epids                           
#>    <date>     <chr>     <dbl> <dbl> <epid>                          
#>  1 2018-04-01 BSI          14     0 E-1 2018-04-01 -> 2018-05-13 (C)
#>  2 2018-04-07 UTI           7    30 E-1 2018-04-01 -> 2018-05-13 (D)
#>  3 2018-04-13 UTI           7    30 E-1 2018-04-01 -> 2018-05-13 (D)
#>  4 2018-04-19 UTI           7    30 E-1 2018-04-01 -> 2018-05-13 (R)
#>  5 2018-04-25 BSI          14     0 E-1 2018-04-01 -> 2018-05-13 (D)
#>  6 2018-05-01 UTI           7    30 E-1 2018-04-01 -> 2018-05-13 (D)
#>  7 2018-05-07 BSI          14     0 E-1 2018-04-01 -> 2018-05-13 (D)
#>  8 2018-05-13 BSI          14     0 E-1 2018-04-01 -> 2018-05-13 (D)
#>  9 2018-05-19 RTI          14     0 E-9 2018-05-19 -> 2018-05-31 (C)
#> 10 2018-05-25 RTI          14     0 E-9 2018-05-19 -> 2018-05-31 (D)
#> 11 2018-05-31 BSI          14     0 E-9 2018-05-19 -> 2018-05-31 (D)

Sub-strata

A sub-strata is created when records within a strata have a different case_length or recurrence_length. The case definition below demonstrates how this can be used.

  • UTI and BSI records are different episodes regardless of when they occur.
  • UTI have a case_length of 7 days and recurrence period of 30 days
  • BSI have a case_length of 14 days if not treated, OR case_length of 4 days if treated, and no recurrence period in both situations
  • Respiratory tract infections (RTI) have a case_length of 28 days and recurrence period of 5 days

In this example, whether or not the infection is treated should be taken as the sub-strata, and the source of infection taken as the strata.

data("infections_4"); 
dbs <- infections_4

dbs$epids <- episode_group(infections_4, sn=rid, strata=c(pid, organism, source), date=date, 
                case_length =epi, episode_type = "rolling", recurrence_length = recur,
                display = FALSE, to_s4 = TRUE)
#> Episode grouping complete - 3 record(s) assinged a unique ID.

dbs
#> # A tibble: 11 x 9
#>      rid date         pid organism source treated   epi recur epids   
#>    <int> <date>     <dbl> <chr>    <chr>  <chr>   <dbl> <dbl> <epid>  
#>  1     1 2019-04-01     1 E. coli  UTI    -           7    30 E-01 (C)
#>  2     2 2019-04-06     1 E. coli  UTI    -           7    30 E-01 (D)
#>  3     3 2019-04-11     1 E. coli  BSI    Y           4     0 E-03 (C)
#>  4     4 2019-04-16     1 E. coli  BSI    N          14     0 E-04 (C)
#>  5     5 2019-04-21     1 E. coli  BSI    Y           4     0 E-04 (D)
#>  6     6 2019-04-26     1 E. coli  RTI    Y          28     5 E-06 (C)
#>  7     7 2019-05-01     1 E. coli  RTI    N          28     5 E-06 (D)
#>  8     8 2019-05-06     1 E. coli  BSI    Y           4     0 E-08 (C)
#>  9     9 2019-05-11     1 E. coli  BSI    N          14     0 E-09 (C)
#> 10    10 2019-05-16     1 E. coli  UTI    N           7    30 E-10 (C)
#> 11    11 2019-05-21     1 E. coli  UTI    N           7    30 E-10 (D)

There are a few things to note with stratified grouping;

  • Unless required, case_length and recurrence_length should be consistent across each strata otherwise, you’ll inadvertently create a sub-strata
  • Episode grouping with and without a sub-strata is different and could lead to different results
  • Using sub-strata is not the same as adding that sub-strata to the strata argument. In the example above, adding treated to strata will group treated infections separately from untreated infections. While this could be the desired outcome depending on your case definition, the case definition above did not required treated and untreated infections to be grouped separately, only that the treated infections last longer. The example below demonstrates this difference;

Stratified grouping is the same as a separate analysis for each subset (strata) of the dataset.

Useful ways of using these function

Episode grouping across other units of time

In the examples above, episode grouping was done by days (episode_unit). However, it can be done in other units of time e.g. hours, days or weeks. Acceptable options are those supported by lubridate's duration() function. Below is an example of episode grouping by the hour.

data("hourly_data"); hourly_data
#>    rid            datetime category epi recur
#> 1    1 2019-04-01 00:00:00      GP1   5     9
#> 2    2 2019-04-01 02:00:00      GP2   5     9
#> 3    3 2019-04-01 04:00:00      GP1   5     9
#> 4    4 2019-04-01 06:00:00      GP2   5     9
#> 5    5 2019-04-01 08:00:00      GP1   5     9
#> 6    6 2019-04-01 10:00:00      GP2   5     9
#> 7    7 2019-04-01 12:00:00      GP1   5     9
#> 8    8 2019-04-01 14:00:00      GP2   5     9
#> 9    9 2019-04-01 16:00:00      GP3   5     9
#> 10  10 2019-04-01 18:00:00      GP3   5     9
#> 11  11 2019-04-01 20:00:00      GP3   5     9
#> 12  12 2019-04-01 22:00:00      GP3   5     9
#> 13  13 2019-04-02 00:00:00      GP3   5     9
dbs <- hourly_data

dbs$datetime
#>  [1] "2019-04-01 00:00:00 UTC" "2019-04-01 02:00:00 UTC"
#>  [3] "2019-04-01 04:00:00 UTC" "2019-04-01 06:00:00 UTC"
#>  [5] "2019-04-01 08:00:00 UTC" "2019-04-01 10:00:00 UTC"
#>  [7] "2019-04-01 12:00:00 UTC" "2019-04-01 14:00:00 UTC"
#>  [9] "2019-04-01 16:00:00 UTC" "2019-04-01 18:00:00 UTC"
#> [11] "2019-04-01 20:00:00 UTC" "2019-04-01 22:00:00 UTC"
#> [13] "2019-04-02 00:00:00 UTC"

rolling_episodes(strata = dbs$category, date = dbs$datetime, case_length = 5,
                 episode_unit = "hours", recurrence_length = 9, group_stats = TRUE, to_s4 = TRUE, display = FALSE)
#> Episode grouping complete - 0 record(s) assinged a unique ID.
#>  [1] "E-1 2019-04-01 00:00:00 -> 2019-04-01 12:00:00 (C)"
#>  [2] "E-2 2019-04-01 02:00:00 -> 2019-04-01 14:00:00 (C)"
#>  [3] "E-1 2019-04-01 00:00:00 -> 2019-04-01 12:00:00 (D)"
#>  [4] "E-2 2019-04-01 02:00:00 -> 2019-04-01 14:00:00 (D)"
#>  [5] "E-1 2019-04-01 00:00:00 -> 2019-04-01 12:00:00 (R)"
#>  [6] "E-2 2019-04-01 02:00:00 -> 2019-04-01 14:00:00 (R)"
#>  [7] "E-1 2019-04-01 00:00:00 -> 2019-04-01 12:00:00 (D)"
#>  [8] "E-2 2019-04-01 02:00:00 -> 2019-04-01 14:00:00 (D)"
#>  [9] "E-9 2019-04-01 16:00:00 -> 2019-04-02 00:00:00 (C)"
#> [10] "E-9 2019-04-01 16:00:00 -> 2019-04-02 00:00:00 (D)"
#> [11] "E-9 2019-04-01 16:00:00 -> 2019-04-02 00:00:00 (D)"
#> [12] "E-9 2019-04-01 16:00:00 -> 2019-04-02 00:00:00 (R)"
#> [13] "E-9 2019-04-01 16:00:00 -> 2019-04-02 00:00:00 (D)"

Limit episode grouping to a subset of the dataframe

For example, with the hourly_data dataset, you can decide to exclude "GP1" and "GP2" records from episode grouping as shown below. These records will be excluded from episode grouping and assigned a unique episode IDs.

dbs <- head(hourly_data[c("datetime","category")], 10)
dbs$subset <- ifelse(dbs$category!="GP3", NA, "group")

dbs$epids <- rolling_episodes(strata= dbs$subset, date = dbs$datetime, case_length = 5, episode_unit = "hours", 
                        recurrence_length = 9, display = TRUE, group_stats = TRUE, to_s4 = TRUE)
#> Episode or recurrence window 1.
#> 2 of 2 record(s) grouped into episodes. 0 records not yet grouped.
#> 
#> Episode grouping complete - 8 record(s) assinged a unique ID.

dbs
#>               datetime category subset
#> 1  2019-04-01 00:00:00      GP1   <NA>
#> 2  2019-04-01 02:00:00      GP2   <NA>
#> 3  2019-04-01 04:00:00      GP1   <NA>
#> 4  2019-04-01 06:00:00      GP2   <NA>
#> 5  2019-04-01 08:00:00      GP1   <NA>
#> 6  2019-04-01 10:00:00      GP2   <NA>
#> 7  2019-04-01 12:00:00      GP1   <NA>
#> 8  2019-04-01 14:00:00      GP2   <NA>
#> 9  2019-04-01 16:00:00      GP3  group
#> 10 2019-04-01 18:00:00      GP3  group
#>                                                 epids
#> 1  E-1 2019-04-01 00:00:00 == 2019-04-01 00:00:00 (C)
#> 2  E-2 2019-04-01 02:00:00 == 2019-04-01 02:00:00 (C)
#> 3  E-3 2019-04-01 04:00:00 == 2019-04-01 04:00:00 (C)
#> 4  E-4 2019-04-01 06:00:00 == 2019-04-01 06:00:00 (C)
#> 5  E-5 2019-04-01 08:00:00 == 2019-04-01 08:00:00 (C)
#> 6  E-6 2019-04-01 10:00:00 == 2019-04-01 10:00:00 (C)
#> 7  E-7 2019-04-01 12:00:00 == 2019-04-01 12:00:00 (C)
#> 8  E-8 2019-04-01 14:00:00 == 2019-04-01 14:00:00 (C)
#> 9  E-9 2019-04-01 16:00:00 -> 2019-04-01 18:00:00 (C)
#> 10 E-9 2019-04-01 16:00:00 -> 2019-04-01 18:00:00 (D)

Use a strata from record_group()

data(infections) 

dbs <- infections[c("date","infection")]; dbs
#> # A tibble: 11 x 2
#>    date       infection
#>    <date>     <chr>    
#>  1 2018-04-01 BSI      
#>  2 2018-04-07 UTI      
#>  3 2018-04-13 UTI      
#>  4 2018-04-19 UTI      
#>  5 2018-04-25 BSI      
#>  6 2018-05-01 UTI      
#>  7 2018-05-07 BSI      
#>  8 2018-05-13 BSI      
#>  9 2018-05-19 RTI      
#> 10 2018-05-25 RTI      
#> 11 2018-05-31 BSI

# unique record ids
rd_id <- c(640,17,58,21,130,79,45,300,40,13,31)

# strata based on matching sources of infection
dbs$pids <- record_group(dbs, criteria = infection, to_s4 = TRUE, display = FALSE)
#> Record grouping complete - 0 record(s) assigned a group unique ID.

# stratified grouping 
dbs$epids <- fixed_episodes(sn = rd_id, date = dbs$date, strata = dbs$pids, 
                             to_s4 = TRUE, display = FALSE, group_stats = TRUE, case_length = 10)
#> Episode grouping complete - 5 record(s) assinged a unique ID.

dbs
#> # A tibble: 11 x 4
#>    date       infection pids         epids                             
#>    <date>     <chr>     <pid>        <epid>                            
#>  1 2018-04-01 BSI       P-1 (CRI 01) E-640 2018-04-01 == 2018-04-01 (C)
#>  2 2018-04-07 UTI       P-2 (CRI 01) E-017 2018-04-07 -> 2018-04-13 (C)
#>  3 2018-04-13 UTI       P-2 (CRI 01) E-017 2018-04-07 -> 2018-04-13 (D)
#>  4 2018-04-19 UTI       P-2 (CRI 01) E-021 2018-04-19 == 2018-04-19 (C)
#>  5 2018-04-25 BSI       P-1 (CRI 01) E-130 2018-04-25 == 2018-04-25 (C)
#>  6 2018-05-01 UTI       P-2 (CRI 01) E-079 2018-05-01 == 2018-05-01 (C)
#>  7 2018-05-07 BSI       P-1 (CRI 01) E-045 2018-05-07 -> 2018-05-13 (C)
#>  8 2018-05-13 BSI       P-1 (CRI 01) E-045 2018-05-07 -> 2018-05-13 (D)
#>  9 2018-05-19 RTI       P-9 (CRI 01) E-040 2018-05-19 -> 2018-05-25 (C)
#> 10 2018-05-25 RTI       P-9 (CRI 01) E-040 2018-05-19 -> 2018-05-25 (D)
#> 11 2018-05-31 BSI       P-1 (CRI 01) E-031 2018-05-31 == 2018-05-31 (C)

Conclusion

There are a variety of ways to use this function. It’s worth reviewing your case definition and its implication on the dataset before using this function. In general, the following steps will guide you on how to use this function;

  1. Work out which columns should be the strata
  2. Choose whether you need "fixed" or "rolling" episodes
  3. Choose whether you are grouping individual events or a duration of events by supply a date, datetime or lubridate interval object as required. See interval grouping
  4. Create a column for case_length, and/or recurrence_length. The values should be unique to each strata unless you require a sub-strata
  5. Change from_last to TRUE if you want to start episode grouping at the most recent record thereby making it the "Case". Note that this is not the same as starting episode grouping at the earliest record (from_last is FALSE) and then picking the most recent record in that episode as the "Case". See case assignment
  6. If you require the "Case" to be the earliest or most recent record of a particular type of record, use custom_sort in combination with from_last. If not, ignore this argument. See user defined case assignment
  7. If you require episodes to include records on either side of the "Case" use bi_direction. If not, ignore this argument
  8. Choose if episodes are occurring by the minute, hour or day etc., and set episode_unit accordingly