Preparing data for finalfit

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Ewen Harrison

This vignette shows you how to upload and prepare any dataset for use with finalfit. The demonstration will use the boot::melanoma. Use ?boot::melanoma to see the help page with data description. I will use library(tidyverse) methods. First I’ll write_csv() the data just to demonstrate reading it.

Read data

Note the various options in read_csv(), including providing column names, variable type, missing data identifier etc.

library(readr)

# Save example
write_csv(boot::melanoma, "boot.csv")

# Read data
melanoma = read_csv("boot.csv")
#> Rows: 205 Columns: 7
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (7): time, status, sex, age, year, thickness, ulcer
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Column types

Note the output shows how the columns/variables have been parsed. For full details see ?readr::cols().

Continuous data

Integer (whole numbers) - col_integer()
Double or numeric (real numbers; the name comes from “double-precision floating point”) - col_double()

Categorical data

Factor (a fixed set of names/strings or numbers) - col_factor()
Character (sequences letters, numbers, and symbols) - col_character()
Logical (containing only TRUE or FALSE) - col_logical()

Dates and times

Date - col_date()
Time - col_time()
Date-time - col_datetime()

Check data

ff_glimpse() provides a convenient overview of all data in a tibble or data frame. It is particularly important that factors are correctly specified. Hence, ff_glimpse() separates variables into continuous and categorcial. As expected, no factors are yet specified in the melanoma dataset.

library(finalfit)
ff_glimpse(melanoma)
#> $Continuous
#>               label var_type   n missing_n missing_percent   mean     sd    min
#> time           time    <dbl> 205         0             0.0 2152.8 1122.1   10.0
#> status       status    <dbl> 205         0             0.0    1.8    0.6    1.0
#> sex             sex    <dbl> 205         0             0.0    0.4    0.5    0.0
#> age             age    <dbl> 205         0             0.0   52.5   16.7    4.0
#> year           year    <dbl> 205         0             0.0 1969.9    2.6 1962.0
#> thickness thickness    <dbl> 205         0             0.0    2.9    3.0    0.1
#> ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5    0.0
#>           quartile_25 median quartile_75    max
#> time           1525.0 2005.0      3042.0 5565.0
#> status            1.0    2.0         2.0    3.0
#> sex               0.0    0.0         1.0    1.0
#> age              42.0   54.0        65.0   95.0
#> year           1968.0 1970.0      1972.0 1977.0
#> thickness         1.0    1.9         3.6   17.4
#> ulcer             0.0    0.0         1.0    1.0
#> 
#> $Categorical
#> data frame with 0 columns and 205 rows

If you wish to see the variables in the order in which they appear in the data frame or tibble, missing_glimpse() or tibble::glimpse() are useful.

missing_glimpse(melanoma)
#>               label var_type   n missing_n missing_percent
#> time           time    <dbl> 205         0             0.0
#> status       status    <dbl> 205         0             0.0
#> sex             sex    <dbl> 205         0             0.0
#> age             age    <dbl> 205         0             0.0
#> year           year    <dbl> 205         0             0.0
#> thickness thickness    <dbl> 205         0             0.0
#> ulcer         ulcer    <dbl> 205         0             0.0

Specify factors

Use an original description of the data (often called a data dictionary) to correctly assign and label any factor variables. This can be done in a single pipe.

library(dplyr)
melanoma %>% 
  mutate(
    status.factor = factor(status, levels = c(1, 2, 3), 
      labels = c("Died from melanoma", "Alive", "Died from other causes")) %>% 
    ff_label("Status"),
    sex.factor = factor(sex, levels = c(1, 0),
      labels = c("Male", "Female")) %>% 
    ff_label("Sex"),
    ulcer.factor = factor(ulcer, levels = c(1, 0),
      labels = c("Present", "Absent")) %>% 
    ff_label("Ulcer")
  ) -> melanoma

ff_glimpse(melanoma)
#> $Continuous
#>               label var_type   n missing_n missing_percent   mean     sd    min
#> time           time    <dbl> 205         0             0.0 2152.8 1122.1   10.0
#> status       status    <dbl> 205         0             0.0    1.8    0.6    1.0
#> sex             sex    <dbl> 205         0             0.0    0.4    0.5    0.0
#> age             age    <dbl> 205         0             0.0   52.5   16.7    4.0
#> year           year    <dbl> 205         0             0.0 1969.9    2.6 1962.0
#> thickness thickness    <dbl> 205         0             0.0    2.9    3.0    0.1
#> ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5    0.0
#>           quartile_25 median quartile_75    max
#> time           1525.0 2005.0      3042.0 5565.0
#> status            1.0    2.0         2.0    3.0
#> sex               0.0    0.0         1.0    1.0
#> age              42.0   54.0        65.0   95.0
#> year           1968.0 1970.0      1972.0 1977.0
#> thickness         1.0    1.9         3.6   17.4
#> ulcer             0.0    0.0         1.0    1.0
#> 
#> $Categorical
#>                label var_type   n missing_n missing_percent levels_n
#> status.factor Status    <fct> 205         0             0.0        3
#> sex.factor       Sex    <fct> 205         0             0.0        2
#> ulcer.factor   Ulcer    <fct> 205         0             0.0        2
#>                                                                             levels
#> status.factor "Died from melanoma", "Alive", "Died from other causes", "(Missing)"
#> sex.factor                                           "Male", "Female", "(Missing)"
#> ulcer.factor                                      "Present", "Absent", "(Missing)"
#>               levels_count   levels_percent
#> status.factor  57, 134, 14 27.8, 65.4,  6.8
#> sex.factor         79, 126           39, 61
#> ulcer.factor       90, 115           44, 56

Everything looks good and you are ready to start analysis.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.