The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Deriving Disease Phenotypes from UKB Data

Overview

The derive_* functions convert raw UKB columns into analysis-ready variables. This vignette covers the disease phenotype derivation pipeline:

Step Function(s) Purpose
1 derive_missing() Handle “Do not know” / “Prefer not to answer”
2 derive_covariate() Convert types; summarise covariates
3 derive_cut() Bin continuous variables into groups
4 derive_selfreport() Self-reported disease status + date
5 derive_hes() HES inpatient ICD-10 status + date
6 derive_first_occurrence() First Occurrence field status + date
7 derive_cancer_registry() Cancer registry status + date
8 derive_death_registry() Death registry ICD-10 status + date
9 derive_icd10() Combine any subset of sources (wrapper)
10 derive_case() Merge self-report + ICD-10 into final case definition

All functions accept a data.frame or data.table and return a data.table. For data.table input, new columns are added by reference (no copy); data.frame input is converted to data.table internally before modification.

In production, replace ops_toy() with extract_batch() followed by decode_values() and decode_names(). See vignette("decode"). Column names below use the RAP raw format (p{field}_{instance}_{array}) as returned by ops_toy() and extract_batch() before decoding.


Setup

library(ukbflow)

df <- ops_toy(n = 500)

Step 1: Handle Informative Missing Labels

UKB uses special labels such as "Do not know" and "Prefer not to answer" to distinguish refusal from true missing data. derive_missing() converts these to NA (default) or retains them as "Unknown" for modelling.

df <- derive_missing(df)

Performance: derive_missing() uses data.table::set() for in-place replacement — no column copies are made regardless of dataset size.

To keep non-response as a model category:

df <- derive_missing(df, action = "unknown")

To add custom labels beyond the built-in list:

df <- derive_missing(df, extra_labels = "Not applicable")

Step 2: Prepare Covariates

derive_covariate() converts categorical columns to factor and prints a distribution summary for each.

df <- derive_covariate(
  df,
  as_factor = c(
    "p31",        # sex
    "p20116_i0",  # smoking_status_i0
    "p1558_i0"    # alcohol_intake_frequency_i0
  ),
  factor_levels = list(
    p20116_i0 = c("Never", "Previous", "Current")
  )
)

Step 3: Bin Continuous Variables

derive_cut() creates a new factor column by binning a continuous variable into quantile-based or custom groups.

df <- derive_cut(
  df,
  col    = "p21001_i0",                              # body_mass_index_bmi_i0
  n      = 4,
  breaks = c(18.5, 25, 30),
  labels = c("Underweight", "Normal", "Overweight", "Obese"),
  name   = "bmi_cat"
)

df <- derive_cut(
  df,
  col    = "p22189",                                 # townsend_deprivation_index_at_recruitment
  n      = 4,
  labels = c("Q1 (least deprived)", "Q2", "Q3", "Q4 (most deprived)"),
  name   = "tdi_cat"
)

Step 4: Self-Reported Disease

derive_selfreport() searches UKB self-reported non-cancer illness (field 20002) or cancer (field 20001) columns for a disease label matching a regex, then returns binary status and the earliest report date. Column detection is automatic from field IDs.

# Non-cancer: type 2 diabetes (field 20002)
df <- derive_selfreport(df,
  name  = "dm",
  regex = "type 2 diabetes"
)
# Cancer: lung cancer (field 20001)
df <- derive_selfreport(df,
  name  = "lung_cancer",
  regex = "lung cancer",
  field = "cancer"
)

This adds two columns per call:

Column Type Description
dm_selfreport logical TRUE if any instance matched
dm_selfreport_date IDate Earliest report date

Step 5: HES Inpatient Records

derive_hes() scans UKB Hospital Episode Statistics ICD-10 codes (field 41270, stored as a JSON array per participant) and matches the earliest corresponding date from field 41280.

# Prefix match: codes starting with "I10" (hypertension)
df <- derive_hes(df, name = "htn", icd10 = "I10")

# Exact match
df <- derive_hes(df, name = "dm_hes", icd10 = "E11", match = "exact")

# Regex: E10 and E11 simultaneously
df <- derive_hes(df, name = "dm_broad", icd10 = "^E1[01]", match = "regex")

The match argument controls how codes are compared:

match Behaviour Example
"prefix" (default) Code starts with pattern "E11" matches "E110", "E119"
"exact" Full 3- or 4-digit match "E11" matches only "E11"
"regex" Full regular expression "^E1[01]"

Step 6: First Occurrence Fields

UKB First Occurrence fields (p131xxx) record the earliest date a condition was observed across all linked sources — self-report, HES inpatient, GP records, and death registry — pre-integrated by UKB. Look up your disease in the UKB Field Finder.

# ops_toy includes p131742 as a representative First Occurrence column
df <- derive_first_occurrence(df, name = "htn", field = 131742L, col = "p131742")

Step 7: Cancer Registry

derive_cancer_registry() searches the cancer registry ICD-10 field (40006) and optionally filters by histology (field 40011) and behaviour (field 40012).

# ICD-10 only
df <- derive_cancer_registry(df,
  name  = "skin_cancer",
  icd10 = "^C44"
)

# With histology and behaviour filters
df <- derive_cancer_registry(df,
  name      = "scc",
  icd10     = "^C44",
  histology = c(8070L, 8071L, 8072L),
  behaviour = 3L                        # 3 = malignant
)

Step 8: Death Registry

derive_death_registry() searches primary (field 40001) and secondary (field 40002) causes of death for ICD-10 codes.

df <- derive_death_registry(df, name = "mi",   icd10 = "I21")
df <- derive_death_registry(df, name = "dm",   icd10 = "E11")
df <- derive_death_registry(df, name = "lung", icd10 = "C34")

Step 9: Combine Sources with derive_icd10()

derive_icd10() is a high-level wrapper that calls any combination of the source-specific functions above and merges their outputs into a single status column and earliest date. This is the recommended approach for multi-source ascertainment.

# Non-cancer disease: HES + death + First Occurrence
df <- derive_icd10(df,
  name   = "dm",
  icd10  = "E11",
  source = c("hes", "death", "first_occurrence"),
  fo_col = "p131742"
)

# Cancer outcome: cancer registry
df <- derive_icd10(df,
  name      = "lung",
  icd10     = "^C3[34]",
  match     = "regex",
  source    = "cancer_registry",
  behaviour = 3L
)

Intermediate source columns are retained alongside the combined result:

Column Type Description
dm_icd10 logical TRUE if positive in any specified source
dm_icd10_date IDate Earliest date across all sources
dm_hes logical HES status
dm_hes_date IDate HES date
dm_fo logical First Occurrence status
dm_fo_date IDate First Occurrence date
dm_death logical Death registry status
dm_death_date IDate Death registry date

Step 10: Final Case Definition

derive_case() merges the self-report and ICD-10 flags into a unified case status, with the earliest date across both sources taken via pmin().

df <- derive_case(df, name = "dm")

Output columns:

Column Type Description
dm_status logical TRUE if positive in self-report OR ICD-10
dm_date IDate Earliest date across all sources (pmin)

Why the earliest date matters: dm_date is the direct input to derive_timing(), derive_age(), and derive_followup() — it is the chronological anchor of every downstream survival analysis. See vignette("derive-survival").


Getting Help

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.