The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Genetic Risk Score (GRS) Analysis on UKB RAP

Overview

The grs_* functions provide an end-to-end pipeline for computing and validating polygenic / genetic risk scores within the UK Biobank Research Analysis Platform (RAP). Because individual-level genotype data cannot be downloaded locally, all computationally intensive steps are executed as cloud jobs via the DNAnexus Swiss Army Knife app.

Function Where it runs Purpose
grs_check() Local or RAP Validate and export a SNP weights file
grs_bgen2pgen() Submits RAP jobs Convert UKB imputed BGEN files to PGEN
grs_score() Submits RAP jobs Score genotypes across chromosomes with plink2
grs_standardize() Local or RAP Z-score standardise GRS columns
grs_zscore() Local or RAP Alias for grs_standardize()
grs_validate() Local or RAP Assess predictive performance (OR/HR, AUC, C-index)

Typical pipeline:

grs_check()  ->  grs_bgen2pgen()  ->  grs_score()  ->  grs_standardize()  ->  grs_validate()

Prerequisite: you must be authenticated to UKB RAP and have a project selected. See vignette("auth").


Step 1: Validate the Weights File – grs_check()

Before uploading to RAP, validate your SNP weights file. The function reads the file, checks required columns, flags problems, and writes a plink2-compatible space-delimited output.

Required columns:

Column Description
snp SNP identifier (expected rs + digits format)
effect_allele Effect allele: one of A / T / C / G
beta Effect size (log-OR or beta); must be numeric
library(ukbflow)

# Local: weights file on your machine
w <- grs_check("weights.csv", dest = "weights_clean.txt")
#> Read 'weights.csv': 312 rows, 5 columns.
#> ✔ No NA values.
#> ✔ No duplicate SNPs.
#> ✔ All SNP IDs match rs[0-9]+ format.
#> ✔ All effect alleles are A/T/C/G.
#> Beta summary:
#>   Range     : -0.412 to 0.589
#>   Mean |beta|: 0.183
#>   Positive  : 187 (59.9%)
#>   Negative  : 125 (40.1%)
#>   Zero      : 0
#> ✔ Weights file passed checks: 312 SNPs ready for UKB RAP.
#> ✔ Saved: 'weights_clean.txt'
# On RAP (RStudio) -- use /mnt/project/ paths directly
w <- grs_check(
  file = "/mnt/project/weights/weights.csv",
  dest = "/mnt/project/weights/weights_clean.txt"
)

The validated weights are also returned invisibly for inspection.


Step 2: Convert BGEN to PGEN – grs_bgen2pgen()

UKB imputed genotype data is stored in BGEN format on RAP. grs_bgen2pgen() submits one Swiss Army Knife job per chromosome to convert BGEN files to the faster PGEN format with a MAF filter applied via plink2.

This function is called from your local machine or RAP RStudio – the heavy lifting runs entirely in the cloud.

# Test on the smallest chromosome first
ids <- grs_bgen2pgen(chr = 22, dest = "/pgen/", priority = "high")
#> Uploading 'grs_bgen2pgen_std.R' to RAP ...
#> ✔ Uploaded: '/grs_bgen2pgen_std.R'
#> Submitting 1 job(s) -- mem2_ssd1_v2_x4 / priority: high
#> ✔ chr22 -> 'job-Gxxxxxxxxxxxxxxxxxxxxxxxx'
#> ✔ 1/1 job(s) submitted. Monitor with job_ls().
# Full run: small chromosomes on standard, large on upgraded instance
ids_small <- grs_bgen2pgen(chr = 17:22, dest = "/pgen/")
ids_large <- grs_bgen2pgen(chr = 1:16,  dest = "/pgen/", instance = "large")

# Monitor progress (job_wait() takes one job ID at a time)
job_ls()
for (id in c(ids_small, ids_large)) job_wait(id)

Instance types:

Preset DNAnexus instance Cores RAM SSD Recommended for
"standard" mem2_ssd1_v2_x4 4 12 GB 200 GB chr 15–22
"large" mem2_ssd2_v2_x8 8 28 GB 640 GB chr 1–16

Key arguments:

Argument Default Description
chr 1:22 Chromosomes to process
dest RAP output path for PGEN files (required)
maf 0.01 MAF filter passed to plink2 --maf
instance "standard" Instance type preset
priority "low" Job priority ("low" or "high")

Storage warning: chromosomes 1–16 may exceed the 200 GB SSD on "standard" instances. Use instance = "large" for these.


Step 3: Calculate GRS Scores – grs_score()

Once PGEN files are ready, grs_score() uploads your weights file(s) and submits one plink2 scoring job per GRS. Each job scores all 22 chromosomes and saves a single CSV to RAP.

ids <- grs_score(
  file     = c(
    grs_a = "weights/grs_a_weights.txt",
    grs_b = "weights/grs_b_weights.txt"
  ),
  pgen_dir = "/mnt/project/pgen",
  dest     = "/grs/",
  priority = "high"
)
#> ── Uploading 2 weight file(s) to RAP ────────────────────────────────────────
#> Uploading 'grs_a_weights.txt' ...
#> ✔ Uploaded: '/grs_a_weights.txt'
#> Uploading 'grs_b_weights.txt' ...
#> ✔ Uploaded: '/grs_b_weights.txt'
#> Submitting 2 job(s) -- mem2_ssd1_v2_x4 / priority: high
#> ✔ grs_a -> 'job-Gxxxxxxxxxxxxxxxxxxxxxxxx'
#> ✔ grs_b -> 'job-Gyyyyyyyyyyyyyyyyyyyyyyyy'
#> ✔ 2/2 job(s) submitted. Monitor with job_ls().

job_wait(ids)

When running from RAP RStudio with weights already at the project root, the upload step is skipped automatically:

# On RAP: weights already at /mnt/project/grs_a_weights.txt
ids <- grs_score(
  file     = c(grs_a = "/mnt/project/grs_a_weights.txt"),
  pgen_dir = "/mnt/project/pgen",
  dest     = "/grs/"
)
#> ℹ grs_a_weights.txt already at RAP root, skipping upload.

Important: the maf argument must match the value used in grs_bgen2pgen() so that the correct PGEN files are located.

Output per job: /grs/<GRS_name>_scores.csv with columns IID and the GRS score named GRS_<name>.


Step 4: Standardise GRS Columns – grs_standardize()

After downloading the score CSVs from RAP (via fetch_file()) and merging them into your analysis dataset, Z-score standardise each GRS column so that effect estimates are interpretable as per-SD associations.

# Auto-detect all columns containing "grs" (case-insensitive)
cohort <- grs_standardize(cohort)
#> Auto-detected 2 GRS column(s): 'GRS_a', 'GRS_b'
#> ✔ GRS_a -> GRS_a_z  [mean=0.0000, sd=1.0000]
#> ✔ GRS_b -> GRS_b_z  [mean=0.0000, sd=1.0000]
# Or specify columns explicitly
cohort <- grs_standardize(cohort, grs_cols = c("GRS_a", "GRS_b"))

grs_zscore() is an alias and produces an identical result.

The original columns are preserved; _z columns are inserted immediately after their source column.


Step 5: Validate Predictive Performance – grs_validate()

grs_validate() runs four complementary validation analyses for each GRS:

Analysis What it measures
Per SD OR (logistic) or HR (Cox) per 1-SD increase in GRS
High vs Low OR / HR comparing top 20% vs bottom 20%
Trend test P-trend across Q1–Q4 quartiles
Discrimination AUC (logistic) or C-index (Cox) with 95% CI

Logistic (cross-sectional)

res <- grs_validate(
  data        = cohort,
  grs_cols    = c("GRS_a_z", "GRS_b_z"),
  outcome_col = "outcome"
)
#> ── Creating GRS groups ───────────────────────────────────────────────────────
#> ── Effect per SD (OR) ───────────────────────────────────────────────────────
#> ── High vs Low ──────────────────────────────────────────────────────────────
#> ── Trend test ───────────────────────────────────────────────────────────────
#> ── AUC ──────────────────────────────────────────────────────────────────────
#> ✔ Validation complete.

res$per_sd
res$discrimination

Cox (survival)

res <- grs_validate(
  data        = cohort,
  grs_cols    = c("GRS_a_z", "GRS_b_z"),
  outcome_col = "outcome",
  time_col    = "followup_years",
  covariates  = c("age", "sex", paste0("pc", 1:10))
)

res$per_sd          # HR per SD x model
res$high_vs_low     # HR: top 20% vs bottom 20%
res$trend           # p-trend across Q1–Q4
res$discrimination  # C-index with 95% CI

Return value: a named list with four data.table elements.

Element Columns (logistic) Columns (Cox)
per_sd exposure, term, model, OR, CI_lower, CI_upper, p_value same with HR
high_vs_low same as per_sd same with HR
trend exposure, model, p_trend same
discrimination GRS, AUC, CI_lower, CI_upper GRS, C_index, CI_lower, CI_upper

AUC calculation requires the pROC package. Install with install.packages("pROC") if needed.


Complete Pipeline Example

library(ukbflow)

# 1. Validate weights (local)
grs_check("weights.csv", dest = "weights_clean.txt")

# 2. Convert BGEN -> PGEN on RAP (submit jobs)
ids_std  <- grs_bgen2pgen(chr = 17:22, dest = "/pgen/", maf = 0.01)
ids_lrg  <- grs_bgen2pgen(chr = 1:16,  dest = "/pgen/", maf = 0.01, instance = "large")
for (id in c(ids_std, ids_lrg)) job_wait(id)

# 3. Score GRS on RAP (submit jobs)
score_ids <- grs_score(
  file     = c(grs_a = "weights_clean.txt"),
  pgen_dir = "/mnt/project/pgen",
  maf      = 0.01,   # must match grs_bgen2pgen()
  dest     = "/grs/"
)
job_wait(score_ids)

# 4. Download score CSV from RAP
fetch_file("/grs/GRS_a_scores.csv", dest = "GRS_a_scores.csv")

# 5. Merge into analysis dataset and standardise
# cohort: your analysis data.table with IID and phenotype columns
scores <- data.table::fread("GRS_a_scores.csv")   # columns: IID, GRS_a
cohort <- scores[cohort, on = "IID"]               # right-join: keep all cohort rows
cohort <- grs_standardize(cohort, grs_cols = "GRS_a")

# 6. Validate
res <- grs_validate(
  data        = cohort,
  grs_cols    = "GRS_a_z",
  outcome_col = "outcome",
  time_col    = "followup_years",
  covariates  = c("age", "sex", paste0("pc", 1:10))
)

res$per_sd
res$discrimination

Getting Help

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.