Deriving Disease Phenotypes from UKB Data

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Overview

The derive_* functions convert raw UKB columns into analysis-ready variables. This vignette covers the disease phenotype derivation pipeline:

Step	Function(s)	Purpose
1	`derive_missing()`	Handle “Do not know” / “Prefer not to answer”
2	`derive_covariate()`	Convert types; summarise covariates
3	`derive_cut()`	Bin continuous variables into groups
4	`derive_selfreport()`	Self-reported disease status + date
5	`derive_hes()`	HES inpatient ICD-10 status + date
6	`derive_first_occurrence()`	First Occurrence field status + date
7	`derive_cancer_registry()`	Cancer registry status + date
8	`derive_death_registry()`	Death registry ICD-10 status + date
9	`derive_icd10()`	Combine any subset of sources (wrapper)
10	`derive_case()`	Merge self-report + ICD-10 into final case definition

All functions accept a data.frame or data.table and return a data.table. For data.table input, new columns are added by reference (no copy); data.frame input is converted to data.table internally before modification.

In production, replace ops_toy() with extract_batch() followed by decode_values() and decode_names(). See vignette("decode"). Column names below use the RAP raw format (p{field}_{instance}_{array}) as returned by ops_toy() and extract_batch() before decoding.

Setup

library(ukbflow)

df <- ops_toy(n = 500)

Step 1: Handle Informative Missing Labels

UKB uses special labels such as "Do not know" and "Prefer not to answer" to distinguish refusal from true missing data. derive_missing() converts these to NA (default) or retains them as "Unknown" for modelling.

df <- derive_missing(df)

Performance: derive_missing() uses data.table::set() for in-place replacement — no column copies are made regardless of dataset size.

To keep non-response as a model category:

df <- derive_missing(df, action = "unknown")

To add custom labels beyond the built-in list:

df <- derive_missing(df, extra_labels = "Not applicable")

Step 2: Prepare Covariates

derive_covariate() converts categorical columns to factor and prints a distribution summary for each.

df <- derive_covariate(
  df,
  as_factor = c(
    "p31",        # sex
    "p20116_i0",  # smoking_status_i0
    "p1558_i0"    # alcohol_intake_frequency_i0
  ),
  factor_levels = list(
    p20116_i0 = c("Never", "Previous", "Current")
  )
)

Step 3: Bin Continuous Variables

derive_cut() creates a new factor column by binning a continuous variable into quantile-based or custom groups.

df <- derive_cut(
  df,
  col    = "p21001_i0",                              # body_mass_index_bmi_i0
  n      = 4,
  breaks = c(18.5, 25, 30),
  labels = c("Underweight", "Normal", "Overweight", "Obese"),
  name   = "bmi_cat"
)

df <- derive_cut(
  df,
  col    = "p22189",                                 # townsend_deprivation_index_at_recruitment
  n      = 4,
  labels = c("Q1 (least deprived)", "Q2", "Q3", "Q4 (most deprived)"),
  name   = "tdi_cat"
)

Step 4: Self-Reported Disease

derive_selfreport() searches UKB self-reported non-cancer illness (field 20002) or cancer (field 20001) columns for a disease label matching a regex, then returns binary status and the earliest report date. Column detection is automatic from field IDs.

# Non-cancer: type 2 diabetes (field 20002)
df <- derive_selfreport(df,
  name  = "dm",
  regex = "type 2 diabetes"
)

# Cancer: lung cancer (field 20001)
df <- derive_selfreport(df,
  name  = "lung_cancer",
  regex = "lung cancer",
  field = "cancer"
)

This adds two columns per call:

Column	Type	Description
`dm_selfreport`	logical	`TRUE` if any instance matched
`dm_selfreport_date`	IDate	Earliest report date

Step 5: HES Inpatient Records

derive_hes() scans UKB Hospital Episode Statistics ICD-10 codes (field 41270, stored as a JSON array per participant) and matches the earliest corresponding date from field 41280.

# Prefix match: codes starting with "I10" (hypertension)
df <- derive_hes(df, name = "htn", icd10 = "I10")

# Exact match
df <- derive_hes(df, name = "dm_hes", icd10 = "E11", match = "exact")

# Regex: E10 and E11 simultaneously
df <- derive_hes(df, name = "dm_broad", icd10 = "^E1[01]", match = "regex")

The match argument controls how codes are compared:

`match`	Behaviour	Example
`"prefix"` (default)	Code starts with pattern	`"E11"` matches `"E110"`, `"E119"`
`"exact"`	Full 3- or 4-digit match	`"E11"` matches only `"E11"`
`"regex"`	Full regular expression	`"^E1[01]"`

Step 6: First Occurrence Fields

UKB First Occurrence fields (p131xxx) record the earliest date a condition was observed across all linked sources — self-report, HES inpatient, GP records, and death registry — pre-integrated by UKB. Look up your disease in the UKB Field Finder.

# ops_toy includes p131742 as a representative First Occurrence column
df <- derive_first_occurrence(df, name = "htn", field = 131742L, col = "p131742")

Step 7: Cancer Registry

derive_cancer_registry() searches the cancer registry ICD-10 field (40006) and optionally filters by histology (field 40011) and behaviour (field 40012).

# ICD-10 only
df <- derive_cancer_registry(df,
  name  = "skin_cancer",
  icd10 = "^C44"
)

# With histology and behaviour filters
df <- derive_cancer_registry(df,
  name      = "scc",
  icd10     = "^C44",
  histology = c(8070L, 8071L, 8072L),
  behaviour = 3L                        # 3 = malignant
)

Step 8: Death Registry

derive_death_registry() searches primary (field 40001) and secondary (field 40002) causes of death for ICD-10 codes.

df <- derive_death_registry(df, name = "mi",   icd10 = "I21")
df <- derive_death_registry(df, name = "dm",   icd10 = "E11")
df <- derive_death_registry(df, name = "lung", icd10 = "C34")

Step 9: Combine Sources with `derive_icd10()`

derive_icd10() is a high-level wrapper that calls any combination of the source-specific functions above and merges their outputs into a single status column and earliest date. This is the recommended approach for multi-source ascertainment.

# Non-cancer disease: HES + death + First Occurrence
df <- derive_icd10(df,
  name   = "dm",
  icd10  = "E11",
  source = c("hes", "death", "first_occurrence"),
  fo_col = "p131742"
)

# Cancer outcome: cancer registry
df <- derive_icd10(df,
  name      = "lung",
  icd10     = "^C3[34]",
  match     = "regex",
  source    = "cancer_registry",
  behaviour = 3L
)

Intermediate source columns are retained alongside the combined result:

Column	Type	Description
`dm_icd10`	logical	`TRUE` if positive in any specified source
`dm_icd10_date`	IDate	Earliest date across all sources
`dm_hes`	logical	HES status
`dm_hes_date`	IDate	HES date
`dm_fo`	logical	First Occurrence status
`dm_fo_date`	IDate	First Occurrence date
`dm_death`	logical	Death registry status
`dm_death_date`	IDate	Death registry date

Step 10: Final Case Definition

derive_case() merges the self-report and ICD-10 flags into a unified case status, with the earliest date across both sources taken via pmin().

df <- derive_case(df, name = "dm")

Output columns:

Column	Type	Description
`dm_status`	logical	`TRUE` if positive in self-report OR ICD-10
`dm_date`	IDate	Earliest date across all sources (`pmin`)

Why the earliest date matters: dm_date is the direct input to derive_timing(), derive_age(), and derive_followup() — it is the chronological anchor of every downstream survival analysis. See vignette("derive-survival").

Getting Help

?derive_missing, ?derive_covariate, ?derive_cut
?derive_selfreport, ?derive_hes, ?derive_first_occurrence
?derive_cancer_registry, ?derive_death_registry
?derive_icd10, ?derive_case
vignette("derive-survival") — timing, age at event, follow-up
vignette("decode") — decoding column names and values
GitHub Issues

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Deriving Disease Phenotypes from UKB Data

Overview

Setup

Step 1: Handle Informative Missing Labels

Step 2: Prepare Covariates

Step 3: Bin Continuous Variables

Step 4: Self-Reported Disease

Step 5: HES Inpatient Records

Step 6: First Occurrence Fields

Step 7: Cancer Registry

Step 8: Death Registry

Step 9: Combine Sources with derive_icd10()

Step 10: Final Case Definition

Getting Help

Step 9: Combine Sources with `derive_icd10()`