The goal of this vignette is to illustrate how event data can be used for descriptive analysis in R. The data from the first municipality of the BPI Challenge 2015 will be used throughout this vignette. It is made available by the package under the name BPIC15_1 and already preprocessed to an object of the class eventlog. For more information on the preprocessing of event data, look at the corresponding vignette.

library(edeaR)
data("BPIC15_1")

Event log summary

The most high-level way to describe an eventlog is to use the generic R function summary.

summary(BPIC15_1)
## Number of events:  52217
## Number of cases:  1199
## Number of traces:  1099
## Number of activities:  398
## Average trace length:  43.55046
## 
## Start eventlog:  2010-10-04 22:00:00
## End eventlog:  2015-07-31 22:00:00
##  case_concept.name  event_question     event_dateFinished
##  Length:52217       Length:52217       Length:52217      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  event_dueDate      event_action_code  event_activityNameEN
##  Length:52217       Length:52217       Length:52217        
##  Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character    
##                                                            
##                                                            
##                                                            
##  event_planned      event_time.timestamp          event_monitoringResource
##  Length:52217       Min.   :2010-10-04 22:00:00   Length:52217            
##  Class :character   1st Qu.:2011-11-07 09:32:10   Class :character        
##  Mode  :character   Median :2012-11-19 08:25:49   Mode  :character        
##                     Mean   :2012-12-12 19:44:44                           
##                     3rd Qu.:2014-01-15 23:00:00                           
##                     Max.   :2015-07-31 22:00:00                           
##  event_org.resource event_activityNameNL event_concept.name
##  Length:52217       Length:52217         Length:52217      
##  Class :character   Class :character     Class :character  
##  Mode  :character   Mode  :character     Mode  :character  
##                                                            
##                                                            
##                                                            
##  event_lifecycle.transition event_dateStop     activity_instance
##  Length:52217               Length:52217       Min.   :    1    
##  Class :character           Class :character   1st Qu.:13055    
##  Mode  :character           Mode  :character   Median :26109    
##                                                Mean   :26109    
##                                                3rd Qu.:39163    
##                                                Max.   :52217

As can be observed above, the summary contains the number of events, activities, traces and cases, as well as the time span covered by the event log.

Cases

The cases function returns a data.frame which contains general descriptives about each individual case.

case_information <- cases(BPIC15_1)
case_information
## Source: local data frame [1,199 x 10]
## 
##    case_concept.name trace_length number_of_activities     start_timestamp
##                (chr)        (int)                (int)              (time)
## 1           10009138           45                   45 2014-04-10 22:00:00
## 2           10051383           57                   56 2014-04-16 22:00:00
## 3           10053042           57                   56 2014-04-13 22:00:00
## 4           10083315           58                   57 2014-04-16 22:00:00
## 5           10093171           46                   46 2014-04-21 22:00:00
## 6           10128431           56                   55 2014-04-24 22:00:00
## 7           10153084           58                   57 2014-04-28 22:00:00
## 8           10154600           47                   47 2014-04-29 22:00:00
## 9           10186016           71                   70 2014-05-01 22:00:00
## 10          10186644           55                   54 2014-04-30 22:00:00
## ..               ...          ...                  ...                 ...
## Variables not shown: complete_timestamp (time), trace (chr), trace_id
##   (dbl), duration_in_days (dbl), first_activity (fctr), last_activity
##   (fctr)

For each case, the following values are reported

  1. Trace length
  2. Number of activities
  3. Start timestamp
  4. Complete timestamp
  5. Trace
  6. Duration (days)
  7. First activity
  8. Last activity

The resulting data.frame as such has little value, as there might be hunderds of cases. However, it can be further summarized and visualized. Below, the most common start and end activities of a case are shown. While almost all cases start with 01_HOOFD_010, there is much more variance in the last activity.

library(dplyr)
summary(select(case_information, first_activity, last_activity))
##         first_activity         last_activity
##  01_HOOFD_010  :1182   01_HOOFD_530   :302  
##  11_AH_II_040b :   7   01_HOOFD_510_2a:106  
##  01_HOOFD_030_2:   2   01_HOOFD_820   : 95  
##  01_HOOFD_065_2:   2   01_HOOFD_510_2 : 92  
##  01_HOOFD_011  :   1   01_HOOFD_516   : 82  
##  01_HOOFD_080  :   1   01_HOOFD_510_4 : 48  
##  (Other)       :   4   (Other)        :474

Using the package ggplot2, we can also visalize this information. The next code will visualize the distribution of throughput time, i.e. duration.

library(ggplot2)
ggplot(case_information) + 
    geom_bar(aes(duration_in_days), binwidth = 30, fill = "#0072B2") + 
    scale_x_continuous(limits = c(0,500)) +
    xlab("Duration (in days)") + 
    ylab("Number of cases") 


## Activities

The activities functions shows the frequencies of the different activities.

activity_information <- activities(BPIC15_1)
activity_information
## Source: local data frame [398 x 3]
## 
##    event_concept.name absolute_frequency relative_frequency
##                 (chr)              (int)              (dbl)
## 1           01_BB_550                  1       1.915085e-05
## 2           01_BB_560                  1       1.915085e-05
## 3         01_BB_670_1                  1       1.915085e-05
## 4           01_BB_680                  1       1.915085e-05
## 5        01_HOOFD_197                  1       1.915085e-05
## 6        01_HOOFD_331                  1       1.915085e-05
## 7      01_HOOFD_446_1                  1       1.915085e-05
## 8      01_HOOFD_446_2                  1       1.915085e-05
## 9        01_HOOFD_456                  1       1.915085e-05
## 10     01_HOOFD_496_1                  1       1.915085e-05
## ..                ...                ...                ...

The following graph shows an cumulative distribution function for the absolute frequency of activities. It shows that about 75% of the activities only occur less than a 100 times.

ggplot(activity_information) +
    stat_ecdf(aes(absolute_frequency), lwd = 1, col = "#0072B2") + 
    scale_x_continuous(breaks = seq(0, 1000, by = 100)) + 
    xlab("Absolute activity frequencies") +
    ylab("Cumulative percentage")


## Predefined descriptive metrics

Next to the more general descriptives seen so far, a series of specific descriptives metrics have been defined. Three different analysis levels are distinguished, log, trace and activity. The metrics look at aspects of time as well as structuredness of the eventlog. Some of the metrics will be illustrated below.

Selfloops

The next piece of code will computed the number of selfloops at the level of activites.

activity_selfloops <- number_of_selfloops(BPIC15_1, level_of_analysis = "activity")
activity_selfloops
##    event_concept.name absolute    relative
## 1        01_HOOFD_205       86 0.565789474
## 2        01_HOOFD_100       31 0.086834734
## 3      01_HOOFD_190_2        9 0.068181818
## 4        08_AWB45_005        5 0.006684492
## 5      01_HOOFD_065_2        2 0.003067485
## 6        01_HOOFD_110        1 0.001858736
## 7        01_HOOFD_120        1 0.001972387
## 8        01_HOOFD_180        1 0.000896861
## 9        01_HOOFD_200        1 0.001027749
## 10     01_HOOFD_510_2        1 0.001108647
## 11       01_HOOFD_790        1 0.015873016
## 12       02_DRZ_030_2        1 0.200000000
## 13         10_UOV_065        1 0.076923077

The output shows that 13 activites sometimes occur in a selfloop. The activity 01_HOOFD_205 shows the most selfloops, i.e. 86.

Visualized:

ggplot(activity_selfloops) + 
    geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") + 
    theme(axis.text.x = element_text(angle = 90)) + 
    xlab("Activity") + 
    ylab("Number of selfloops")


### Repetitions

Complementary to selfloops are repetitions: activities which are repeated in a case, but not directly following each other.

activity_repetitions <- repetitions(BPIC15_1, level_of_analysis = "activity")
activity_repetitions
##    event_concept.name relative_frequency absolute     relative
## 1        01_HOOFD_180       0.0213723500       78 0.0650542118
## 2        01_HOOFD_200       0.0186529291       37 0.0308590492
## 3      01_HOOFD_510_2       0.0172932187        3 0.0025020851
## 4        08_AWB45_005       0.0144205910      143 0.1192660550
## 5      01_HOOFD_065_2       0.0125246567        1 0.0008340284
## 6        01_HOOFD_110       0.0103223088       71 0.0592160133
## 7        01_HOOFD_120       0.0097286324       67 0.0558798999
## 8        01_HOOFD_100       0.0074305303      156 0.1301084237
## 9        01_HOOFD_205       0.0045579026        3 0.0025020851
## 10     01_HOOFD_190_2       0.0027002700       10 0.0083402836
## 11       01_HOOFD_790       0.0012256545       12 0.0100083403
## 12         10_UOV_065       0.0002681119        0 0.0000000000
## 13       02_DRZ_030_2       0.0001149051        0 0.0000000000

Visualized:

ggplot(activity_repetitions) + 
    geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") + 
    theme(axis.text.x = element_text(angle = 90)) + 
    xlab("Activity") + 
    ylab("Number of repetitions")


### Combining descriptives

Using some data manipulation in R, we can plot both descriptives together, to easily see whether repetitions and selfloops occur often for the same activities.

data <- bind_rows(mutate(activity_selfloops, type = "selfloops"),
              mutate(select(activity_repetitions, event_concept.name, absolute), type = "repetitions"))

ggplot(data) + 
    geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") + 
    facet_grid(type ~ .) +
    theme(axis.text.x = element_text(angle = 90)) + 
    xlab("Activity") + 
    ylab("Number of selfloops and repetitions")


## Other descriptives

Other available descriptives and the supported analysis levels are listed below:

Time

Structuredness

Variance

  • Activity presence in cases (activity)
  • Activity type frequency (trace, activity)
  • Start activities (log, activity)
  • End activities (log, activity)
  • Trace length (log, trace)
  • Trace coverage (log)
  • Trace frequency (trace)
  • Number of traces (log)

Repetititons

  • Number of repetitions (log, trace, activity)

Selfloops

  • Size of selfloops (log, trace, activity)
  • Number of selfloops per traces (log, trace)
  • Number of traces with selfloop (log)