The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Extracting Phenotype Data

Overview

UKB phenotype data is stored in a proprietary .dataset format on the RAP and cannot be read directly. The extract_* functions provide R interfaces for discovering approved fields and extracting phenotype data via the DNAnexus dx extract_dataset and table-exporter tools.

Two workflows are available:

Function Mode Scale Output
extract_batch() Async job Large / production (typically 50+ fields) job ID → CSV on RAP cloud
extract_pheno() Synchronous Small (quick checks) data.table in memory

extract_batch() is the recommended approach for any serious analysis. extract_pheno() is provided for quick interactive inspection inside the RAP environment only.


Prerequisites

Ensure you are authenticated and have selected your project:

library(ukbflow)

auth_login()
auth_select_project("project-XXXXXXXXXXXX")

Step 1: Browse Available Fields

Before extracting, use extract_ls() to explore what fields are approved for your project:

# List all approved fields (cached after first call)
extract_ls()

# Search by keyword
extract_ls(pattern = "cancer")
extract_ls(pattern = "p31|p53|p21022")

# Force refresh after switching projects or datasets
extract_ls(refresh = TRUE)

The result is a data.frame with two columns:

Column Example
field_name participant.p53_i0
title Date of attending assessment centre \| Instance 0

Fields reflect your project’s approved data only — not all UKB fields are present.


Step 2: Extract Data

Quick inspection: extract_pheno()

For small-scale interactive checks inside the RAP RStudio environment:

df <- extract_pheno(c(31, 53, 21022))

extract_pheno() is restricted to the RAP environment and returns data in memory only. For any analysis intended to be saved or reproduced, use extract_batch().

Note: extract_pheno() returns raw coded values (e.g. 1/0 for Sex, numeric codes for diseases). Use the decode_* series to convert codes to human-readable labels.


A Note on Column Names

Column naming differs between the two extraction methods:

extract_batch() — no prefix:

Column Meaning
eid Participant ID
p31 Field 31 (Sex)
p53_i0 Field 53, Instance 0
p20002_i0_a0 Field 20002, Instance 0, Array 0

extract_pheno()participant. prefix:

Column Meaning
participant.eid Participant ID
participant.p31 Field 31 (Sex)
participant.p53_i0 Field 53, Instance 0
participant.p20002_i0_a0 Field 20002, Instance 0, Array 0

Getting Help

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.