The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

User Guide

library(ApplyPolygenicScore)

1 Introduction

ApplyPolygenicScore provides utilities for applying a polygenic score to genetic data.

This guide contains step-by-step instructions for how best to take advantage of the functions in this package to facilitate this process and transition smoothly into downstream analyses.

2 Input data

The application of a polygenic score to genetic data requires two basic inputs:

  1. A VCF file containing genotype data for a set of individuals.
  2. A polygenic score weight file containing the SNP coordinates and weights for the polygenic score.

Optionally, a third data type, phenotype data, can be provided to several ApplyPolygenicScore functions.

VCF file requirements: - VCF files must be in standard VCF format. - All multiallelic sites must be merged into one line per variant.

PGS weight file requirements: - PGS weight files provided by the PGS Catalog should be imported with the native package function import.pgs.weight.file to ensure proper formatting - PGS weight files from other sources must be formatted according to PGS Catalog standards for compatibility with the import function. - PGS weight files from other sources that are imported independently must have the required columns: CHROM, POS, effect_allele, beta and optionally ID, other_allele, and allelefrequency_effect. - Weights for multiple alleles at the same site can be provided as separate rows in the file with differing alternative alleles specified.

2.1 Input importation

ApplyPolygenicScore provides functions for the importation of input data into R, and the verification of these data for compatibility with other functions in the package.

2.1.1 Importing VCF data

Use the import.vcf function to import VCF data into R. This function is a wrapper around the vcfR package that ensures the VCF data is in the format expected by other package functions.

The example below imports a VCF file from GIAB, included in the package as an example file.

vcf.file <- system.file(
    'extdata',
    'HG001_GIAB.vcf.gz',
    package = 'ApplyPolygenicScore',
    mustWork = TRUE
    )

vcf.data <- import.vcf(
    vcf.path = vcf.file,
    info.fields = NULL,
    format.fields = NULL,
    verbose = TRUE
    )
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 234
##   header_line: 235
##   variant count: 63
##   column count: 11
## Meta line 234 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 63
##   Character matrix gt cols: 11
##   skip: 0
##   nrows: 63
##   row_num: 0
## Processed variant: 63
## All variants processed
## Extracting gt element DP
## Extracting gt element GQ
## Extracting gt element ADALL
## Extracting gt element AD
## Extracting gt element GT
## Extracting gt element PS
str(vcf.data)
## List of 2
##  $ dat : tibble [126 × 31] (S3: tbl_df/tbl/data.frame)
##   ..$ CHROM                          : chr [1:126] "chr1" "chr1" "chr1" "chr1" ...
##   ..$ POS                            : int [1:126] 87734095 87734095 111721652 111721652 154895579 154895579 184222582 184222582 219521561 219521561 ...
##   ..$ ID                             : chr [1:126] NA NA NA NA ...
##   ..$ REF                            : chr [1:126] "T" "T" "T" "T" ...
##   ..$ ALT                            : chr [1:126] "A" "A" "C" "C" ...
##   ..$ QUAL                           : num [1:126] 50 50 50 50 50 50 50 50 50 50 ...
##   ..$ FILTER                         : chr [1:126] "PASS" "PASS" "PASS" "PASS" ...
##   ..$ DPSum                          : int [1:126] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ platforms                      : int [1:126] 5 5 5 5 5 5 5 5 5 5 ...
##   ..$ platformnames                  : chr [1:126] "Illumina,PacBio,CG,10X,Solid" "Illumina,PacBio,CG,10X,Solid" "Illumina,PacBio,CG,10X,Solid" "Illumina,PacBio,CG,10X,Solid" ...
##   ..$ platformbias                   : chr [1:126] NA NA NA NA ...
##   ..$ datasets                       : int [1:126] 5 5 5 5 5 5 5 5 5 5 ...
##   ..$ datasetnames                   : chr [1:126] "HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR,SolidSE75bp" "HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR,SolidSE75bp" "HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR,SolidSE75bp" "HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR,SolidSE75bp" ...
##   ..$ datasetsmissingcall            : chr [1:126] "IonExome" "IonExome" "IonExome" "IonExome" ...
##   ..$ callsets                       : int [1:126] 7 7 7 7 7 7 7 7 7 7 ...
##   ..$ callsetnames                   : chr [1:126] "HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,10XLRGATK,SolidSE75GATKHC" "HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,10XLRGATK,SolidSE75GATKHC" "HiSeqPE300xGATK,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,CCS15kb_20kbDV,10XLRGATK,SolidSE75GATKHC" "HiSeqPE300xGATK,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,CCS15kb_20kbDV,10XLRGATK,SolidSE75GATKHC" ...
##   ..$ varType                        : chr [1:126] NA NA NA NA ...
##   ..$ filt                           : chr [1:126] NA NA NA NA ...
##   ..$ callable                       : chr [1:126] "CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_C"| __truncated__ "CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_C"| __truncated__ "CS_HiSeqPE300xGATK_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE"| __truncated__ "CS_HiSeqPE300xGATK_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE"| __truncated__ ...
##   ..$ difficultregion                : chr [1:126] NA NA NA NA ...
##   ..$ arbitrated                     : chr [1:126] NA NA NA NA ...
##   ..$ callsetwiththisuniqgenopassing : chr [1:126] NA NA NA NA ...
##   ..$ callsetwithotheruniqgenopassing: chr [1:126] NA NA NA NA ...
##   ..$ Indiv                          : chr [1:126] "HG001" "2:HG001" "HG001" "2:HG001" ...
##   ..$ gt_DP                          : int [1:126] 762 762 658 658 966 966 711 711 924 924 ...
##   ..$ gt_GQ                          : int [1:126] 371 371 1463 1463 1278 1278 395 395 425 425 ...
##   ..$ gt_ADALL                       : chr [1:126] "0,276" "0,276" "70,109" "70,109" ...
##   ..$ gt_AD                          : chr [1:126] "54,380" "54,380" "154,209" "154,209" ...
##   ..$ gt_GT                          : chr [1:126] "1/1" "1/1" "0/1" "0/1" ...
##   ..$ gt_PS                          : int [1:126] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ gt_GT_alleles                  : chr [1:126] "A/A" "A/A" "T/C" "T/C" ...
##  $ meta: tibble [22 × 5] (S3: tbl_df/tbl/data.frame)
##   ..$ Tag        : chr [1:22] "INFO" "INFO" "INFO" "INFO" ...
##   ..$ ID         : chr [1:22] "DPSum" "platforms" "platformnames" "platformbias" ...
##   ..$ Number     : chr [1:22] "1" "1" "." "." ...
##   ..$ Type       : chr [1:22] "Integer" "Integer" "String" "String" ...
##   ..$ Description: chr [1:22] "Total read depth summed across all datasets, excluding MQ0 reads" "Number of different platforms for which at least one callset called this genotype, whether filtered or not" "Names of platforms for which at least one callset called this genotype, whether filtered or not" "Names of platforms that have reads containing a variant at this location, but the high-confidence call is homoz"| __truncated__ ...

2.1.2 Importing polygenic score data

Use the import.pgs.weight.file function to import PGS weight files formatted according to PGS Catalog specifications. This function parses the header that is typically provided by PGS catalog files and differentiates between harmonized and not harmonized data columns.

The example below imports a PGS weight file from the PGS Catalog, included in the package as an example file.

pgs.file <- system.file(
    'extdata',
    'PGS003378_hmPOS_GRCh38.txt.gz',
    package = 'ApplyPolygenicScore',
    mustWork = TRUE
    )

pgs.data <- import.pgs.weight.file(
    pgs.weight.path = pgs.file,
    use.harmonized.data = TRUE
    )

str(pgs.data)
## List of 2
##  $ file.metadata  :'data.frame': 15 obs. of  2 variables:
##   ..$ key  : chr [1:15] "format_version" "pgs_id" "pgs_name" "trait_reported" ...
##   ..$ value: chr [1:15] "2.0" "PGS003378" "PSA_PGS_128" "Prostate-specific antigen (PSA) levels" ...
##  $ pgs.weight.data:'data.frame': 128 obs. of  16 variables:
##   ..$ rsID               : chr [1:128] "rs12131120" "rs2076591" "rs4951018" "rs12569177" ...
##   ..$ chr_name           : chr [1:128] "1" "1" "1" "1" ...
##   ..$ chr_position       : int [1:128] 88199778 112264274 205636334 51237409 219694903 205632217 154868055 184191716 219954267 60759747 ...
##   ..$ effect_allele      : chr [1:128] "T" "C" "C" "C" ...
##   ..$ other_allele       : chr [1:128] "A" "T" "A" "G" ...
##   ..$ effect_weight      : num [1:128] 0.0333 0.0317 0.0317 0.0264 0.0254 ...
##   ..$ locus_name         : chr [1:128] "1:88199778:T:A" "1:112264274:T:C" "1:205636334:A:C" "1:51237409:G:C" ...
##   ..$ hm_source          : chr [1:128] "ENSEMBL" "ENSEMBL" "ENSEMBL" "ENSEMBL" ...
##   ..$ hm_rsID            : chr [1:128] "rs12131120" "rs2076591" "rs4951018" "rs12569177" ...
##   ..$ hm_chr             : chr [1:128] "1" "1" "1" "1" ...
##   ..$ hm_pos             : int [1:128] 87734095 111721652 205667206 50771737 219521561 205663089 154895579 184222582 219780925 60532612 ...
##   ..$ hm_inferOtherAllele: logi [1:128] NA NA NA NA NA NA ...
##   ..$ ID                 : chr [1:128] "rs12131120" "rs2076591" "rs4951018" "rs12569177" ...
##   ..$ CHROM              : chr [1:128] "1" "1" "1" "1" ...
##   ..$ POS                : int [1:128] 87734095 111721652 205667206 50771737 219521561 205663089 154895579 184222582 219780925 60532612 ...
##   ..$ beta               : num [1:128] 0.0333 0.0317 0.0317 0.0264 0.0254 ...

The output of the import.pgs.weight.file function is a list with one data frame containing the SNP coordinates and weights for the polygenic score, with standardized column names expected by other package functions, and a second data frame containing the header information from the PGS catalog file.

2.1.3 Importing phenotype data

Providing phenotype data is an optional feature of several ApplyPolygenicScore functions. Phenotype data must be imported as a data frame and must contain a column named Indiv corresponding to the individual IDs in the VCF file.

# Isolating the individual IDs from the VCF data
vcf.individuals <- unique(vcf.data$dat$Indiv)

# Simulating phenotype data
set.seed(123)
phenotype.data <- data.frame(
    Indiv = vcf.individuals,
    continuous.phenotype = rnorm(length(vcf.individuals)),
    binary.phenotype = rbinom(length(vcf.individuals), 1, 0.5)
    )

head(phenotype.data)
##     Indiv continuous.phenotype binary.phenotype
## 1   HG001           -0.5604756                1
## 2 2:HG001           -0.2301775                0

2.2 Creating a BED coordinate file

VCF files can be very large. Sometimes they are too large to be imported into R. In these cases, it is useful to first filter the VCF file to just the variants that are included in the PGS you wish to calculate and reduce file size. This is best done using command line tools designed for VCF file manipulation. For filtering, they typically require a BED file containing the coordinates of the variants you wish to keep. To simplify this process, ApplyPolygenicScore provides functions for converting PGS weight files to BED coordinate files.

2.2.1 Conversion of PGS weight files to BED coordinate format

BED format requires the following first three columns: chromosome name, start position, and end position. PGS weight files only contain the chromosome name and end position of each variant, so must be reformatted with an additional column for the start position, and with the correct column order.

Note: When writing BED formatted data frames to files, make sure to use tab-separated values and not include row names. Additionally, most tools do not accept BED files with column names. If you wish to maintain a header, you may need to add a comment character to the first line of the file: # chr start end

Use the convert.pgs.to.bed function to convert a PGS weight file to a BED coordinate data frame.

pgs.coordinate.info <- pgs.data$pgs.weight.data

pgs.bed.format <- convert.pgs.to.bed(
    pgs.weight.data = pgs.coordinate.info,
    chr.prefix = TRUE,
    numeric.sex.chr = FALSE,
    slop = 10
    )

head(pgs.bed.format)
##    chr     start       end       rsID chr_name chr_position effect_allele
## 1 chr1  87734084  87734105 rs12131120        1     88199778             T
## 2 chr1 111721641 111721662  rs2076591        1    112264274             C
## 3 chr1 205667195 205667216  rs4951018        1    205636334             C
## 4 chr1  50771726  50771747 rs12569177        1     51237409             C
## 5 chr1 219521550 219521571 rs12034581        1    219694903             C
## 6 chr1 205663078 205663099          .        1    205632217             G
##                            other_allele effect_weight
## 1                                     A        0.0333
## 2                                     T        0.0317
## 3                                     A        0.0317
## 4                                     G        0.0264
## 5                                     A        0.0254
## 6 GGCCGACAGCCCTTCTGCTGGCTCGGTGGGGCCCAGC        0.1628
##                   locus_name hm_source    hm_rsID hm_chr    hm_pos
## 1             1:88199778:T:A   ENSEMBL rs12131120      1  87734095
## 2            1:112264274:T:C   ENSEMBL  rs2076591      1 111721652
## 3            1:205636334:A:C   ENSEMBL  rs4951018      1 205667206
## 4             1:51237409:G:C   ENSEMBL rs12569177      1  50771737
## 5            1:219694903:C:A   ENSEMBL rs12034581      1 219521561
## 6 SLC45A3_p.Leu223_Ala234del  liftover                 1 205663089
##   hm_inferOtherAllele         ID   beta
## 1                  NA rs12131120 0.0333
## 2                  NA  rs2076591 0.0317
## 3                  NA  rs4951018 0.0317
## 4                  NA rs12569177 0.0264
## 5                  NA rs12034581 0.0254
## 6                  NA            0.1628

Coordinate files must match the coordinate style used in the VCF file you wish to filter. The convert.pgs.to.bed function provides options for formatting chromosome names, as these tend to vary between human genome reference GRCh38 and GRCh37 aligned files. Use chr.prefix = TRUE to add ‘chr’ to the chromosome name (GRCh38 style) and chr.prefix = FALSE to remove ‘chr’ from the chromosome name (GRCh37 style). Use numeric.sex.chr = FALSE to format the X and Y chromosomes as ‘X’ and ‘Y’ respectively, and numeric.sex.chr = TRUE to format the X and Y chromosomes as ‘23’ and ‘24’ respectively.

The slop option imitates bedtools nomenclature for adding base pairs to the start and end of a set of coordinates. slop = 10 adds 10 base pairs to the start and end of each variant coordinate.

Here is an example of BED coordinates for a variant on chromosome 1 at the 20th base pair.

No slop:

chr start end
chr1 19 20

With slop of 10 base pairs:

chr start end
chr1 9 30

2.2.2 Merging coordinates from multiple polygenic scores

What if you want to apply multiple polygenic scores to the same VCF file? Instead of filtering the VCF file multiple times, you can use the combine.pgs.bed function to merge multiple BED data frames into a single set of coordinates, and filter your VCF just once for the union of all variants in multiple PGSs.

# convert your PGS weight files with no added slop
pgs.bed1 <- convert.pgs.to.bed(pgs.coordinate.info, slop = 0)

# simulating a second PGS with all coordinates shifted by 20 base pairs.
pgs.bed2 <- pgs.bed1
pgs.bed2$start <- pgs.bed2$start + 20
pgs.bed2$end <- pgs.bed2$end + 20

# Input must be a named list
pgs.bed.list <- list(PGS1 = pgs.bed1, PGS2 = pgs.bed2)

merged.pgs.bed <- combine.pgs.bed(
    pgs.bed.list = pgs.bed.list,
    add.annotation.data = TRUE,
    annotation.column.index = which(colnames(pgs.bed1) == 'rsID') # keep information from the rsID column during merge
    )

str(merged.pgs.bed)
## 'data.frame':    256 obs. of  4 variables:
##  $ chr       : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ start     : num  5.08e+07 5.08e+07 8.77e+07 8.77e+07 1.12e+08 ...
##  $ end       : num  5.08e+07 5.08e+07 8.77e+07 8.77e+07 1.12e+08 ...
##  $ annotation: chr  "rs12569177|PGS1" "rs12569177|PGS2" "rs12131120|PGS1" "rs12131120|PGS2" ...

combine.pgs.bed automatically annotates each interval (in a new column) with the name of the origin PGS. It provides the option of adding information from one additional column from the inputs to the annotation in the merged output. In the cases of overlapping intervals (e.g. the same variant is included in multiple PGSs), overlapping annotations are concatenated with a comma.

2.3 Input data validation

Both the import.vcf and import.pgs.weight.file functions perform some basic validation of the input data during import. For example, PGS files are checked for duplicate variants, and VCF files are checked for unmerged multiallelic sites. If you have not used the native import functions, you may wish to check that your data is compatible with the polygenic score application functions in this package.

The apply.polygenic.score function performs extensive input validation prior to starting the PGS application process. It can be configured to just run the validation step to check your inputs first.

apply.polygenic.score(
    vcf.data = vcf.data$dat,
    pgs.weight.data = pgs.data$pgs.weight.data,
    phenotype.data = phenotype.data,
    validate.inputs.only = TRUE
    )
## Input data passed validation
## [1] TRUE

3 Polygenic Score Application

3.1 Basic usage

The main function of ApplyPolygenicScore is apply.polygenic.score. It comes with some bells and whistles, but the basic usage is simple. Provide the VCF data and the PGS weight data, and specify how you wish the algorithm to handle missing genotypes.

pgs.results <- apply.polygenic.score(
    vcf.data = vcf.data$dat,
    pgs.weight.data = pgs.data$pgs.weight.data,
    correct.strand.flips = FALSE, # no strand flip check to avoid warnings
    missing.genotype.method = 'none'
    )
## Warning in combine.vcf.with.pgs(vcf.data = vcf.data, pgs.weight.data =
## pgs.weight.data): PGS is missing 66 SNPs from VCF
str(pgs.results)
## List of 2
##  $ pgs.output       :'data.frame':   2 obs. of  7 variables:
##   ..$ Indiv                    : chr [1:2] "2:HG001" "HG001"
##   ..$ PGS                      : num [1:2] 1.98 1.98
##   ..$ percentile               : num [1:2] 1 1
##   ..$ decile                   : int [1:2] 10 10
##   ..$ quartile                 : int [1:2] 4 4
##   ..$ n.missing.genotypes      : num [1:2] 66 66
##   ..$ percent.missing.genotypes: num [1:2] 0.52 0.52
##  $ regression.output: NULL

We see several things. First, a warning was printed by combine.vcf.with.pgs. This is a function called by apply.polygenic.score to merge vcf and pgs data by genomic coordinates. The warning indicates that some variants from the PGS weight data were not found in the VCF data. These variants were not called in any individual. apply.polygenic.score does not apply any weights to these variants, which effectively assumes their dosage as homozygous reference for all individuals. Checkout our discussion on missing genotype data for more details on how missing variants come to be.

Next, we see the output of the function. The output is a list with two elements: pgs.output, and regression.output. regression.output is empty because we did not provide the optional phenotype.data, more on that later. pgs.output is a data frame with the individual IDs from the VCF and the calculated polygenic score in the PGS column for each individual. As we will see later, each missing genotype method creates a uniquely named column of PGS values. Next, the percentile, decile, and quartile columns report the respective percentile information for each individual’s PGS among the distribution of the entire cohort in the VCF data. These values look a bit strange in this example because our example VCF contains only two individuals which have identical genotypes. In a real-world scenario, these values would be more informative. The next two columns report the number and percentage of missing PGS genotypes for each individual. This number should be equal to or greater than the number reported in the warning message from combine.vcf.with.pgs.

3.1.1 combine.vcf.with.pgs

A quick aside about this function. While it is internal to apply.polygenic.score, it is also available for use on its own. The output includes a list of the variants that were not found in the VCF data, which may be useful for troubleshooting missing genotype data.

merged.vcf.pgs.data <- combine.vcf.with.pgs(
    vcf.data = vcf.data$dat,
    pgs.weight.data = pgs.data$pgs.weight.data
    )
## Warning in combine.vcf.with.pgs(vcf.data = vcf.data$dat, pgs.weight.data =
## pgs.data$pgs.weight.data): PGS is missing 66 SNPs from VCF
names(merged.vcf.pgs.data)
## [1] "merged.vcf.with.pgs.data" "missing.snp.data"
head(merged.vcf.pgs.data$missing.snp.data)[, 1:6]
##   ID        rsID chr_name chr_position effect_allele
## 1              .        1    205632217             G
## 2     rs34009884       17      7793367             C
## 3    rs113920094       19     51349841             C
## 4    rs374546878       19     51360398            GA
## 5    rs112103380       19     51378275             C
## 6     rs73583119        6    129168057             A
##                            other_allele
## 1 GGCCGACAGCCCTTCTGCTGGCTCGGTGGGGCCCAGC
## 2                                   CTG
## 3                                     T
## 4                                     G
## 5                                     G
## 6                                     T

3.2 Allele matching

After PGS and VCF data are merged by coordinates, you may also wish to check that the alleles in the VCF data variant record match the corresponding alleles in the PGS weight data. By convention, the other_allele (non-effect allele) in the PGS weight data should match the reference (REF) allele in the VCF data, and the effect_allele should match the alternate (ALT) allele in the VCF data. Checkout our discussion on allele matching for more details on the various ways in which this matching may fail.

apply.polygenic.score can perform an allele match check (internally using assess.pgs.vcf.allele.match). This functionality can be customized with the following arguments: - correct.strand.flips to automatically correct mismatches caused by strand flips by flipping the PGS weight data alleles. - remove.ambiguous.allele.matches to remove variants with mismatched alleles that cannot be corrected by flipping. The variant will be treated as a cohort-wide missing variant and handled according to missing variant rules described below. - remove.mismatched.indels to remove variants with mismatched INDEL alleles. These variants will be treated as cohort-wide missing variants and handled according to missing variant rules described below.

# Most strict allele match handling
strict.allele.match.result <- apply.polygenic.score(
    vcf.data = vcf.data$dat,
    pgs.weight.data = pgs.data$pgs.weight.data,
    missing.genotype.method = 'none',
    correct.strand.flips = TRUE,
    remove.ambiguous.allele.matches = TRUE,
    remove.mismatched.indels = TRUE
    );

# Less strict allele match handling
less.strict.allele.match.result <- apply.polygenic.score(
    vcf.data = vcf.data$dat,
    pgs.weight.data = pgs.data$pgs.weight.data,
    missing.genotype.method = 'none',
    correct.strand.flips = TRUE,
    remove.ambiguous.allele.matches = FALSE,
    remove.mismatched.indels = FALSE
    );

3.3 Missing genotype methods

Let’s explore all the different options for handling missing genotype data. The missing.genotype.method argument can be set to one of the following options: none, normalize, and mean.dosage. Any combination of these methods can be applied in the same call to apply.polygenic.score, producing the corresponding number of polygenic score output columns. For details on how each of these methods work, head back to our missing genotype data discussion post.

Briefly:

all.missing.methods.pgs.results <- apply.polygenic.score(
    vcf.data = vcf.data$dat,
    pgs.weight.data = pgs.data$pgs.weight.data,
    correct.strand.flips = FALSE, # no strand flip check to avoid warnings
     missing.genotype.method = c('none', 'normalize', 'mean.dosage')
    )
## Warning in combine.vcf.with.pgs(vcf.data = vcf.data, pgs.weight.data =
## pgs.weight.data): PGS is missing 66 SNPs from VCF
head(all.missing.methods.pgs.results$pgs.output)
##     Indiv    PGS PGS.with.normalized.missing PGS.with.replaced.missing
## 1 2:HG001 1.9761                  0.01593629                    1.9761
## 2   HG001 1.9761                  0.01593629                    1.9761
##   percentile decile quartile n.missing.genotypes percent.missing.genotypes
## 1          1     10        4                  66                      0.52
## 2          1     10        4                  66                      0.52

We can see that the output now contains three columns of PGS values, one for each missing genotype method.

missing genotype method column name
none PGS
normalize PGS.with.normalized.missing
mean.dosage PGS.with.replaced.missing

The PGS.with.normalized.missing contains the PGS values divided by 124: the number of non-missing alleles in the cohort. As we can see from the n.missing.genotypes column, the number of missing variants is 66. The number of variants in the example PGS weight file is 128, so the number of non-missing genotypes is 128 - 66 = 62; multiply by two for a diploid genome (2 alleles per genotype): 62 * 2 = 124.

The PGS.with.replaced.missing column is identical to the PGS column in this case, since all missing genotypes were missing from all individuals in the cohort, all such dosages were assumed as homozygous reference, and no change from the none method was made.

What if you have a list of population allele frequencies for the variants in your PGS weight file? Surely, when implementing the mean.dosage method, values derived from a massive public dataset would be more accurate than inferring the population dosage from your cohort! No problem, simply add these frequencies to your pgs weight file, in PGS Catalog format, and use the use.external.effect.allele.frequency argument to tell apply.polygenic.score to use them instead.

3.4 Custom percentiles

Curious to see how your PGS values distribute among quintiles? No problem. Specify a custom number of percentiles with n.percentiles.

custom.percentiles.pgs.results <- apply.polygenic.score(
    vcf.data = vcf.data$dat,
    pgs.weight.data = pgs.data$pgs.weight.data,
    correct.strand.flips = FALSE, # no strand flip check to avoid warnings
    n.percentiles = 5
    )
## Warning in combine.vcf.with.pgs(vcf.data = vcf.data, pgs.weight.data =
## pgs.weight.data): PGS is missing 66 SNPs from VCF
head(custom.percentiles.pgs.results$pgs.output)
##     Indiv PGS.with.replaced.missing percentile decile quartile percentile.5
## 1 2:HG001                    1.9761          1     10        4            5
## 2   HG001                    1.9761          1     10        4            5
##   n.missing.genotypes percent.missing.genotypes
## 1                  66                      0.52
## 2                  66                      0.52

3.5 Phenotype analysis

Let’s explore the extra features provided for handling phenotype data.

Since these features are much more useful in larger cohorts, let’s start by loading a larger VCF file. The following example file contains fake genotype data for 10 fake individuals, along with some simulated phenotype data and a fake set of PGS weights.

phenotype.test.data.path <- system.file(
    'extdata',
    'phenotype.test.data.Rda',
    package = 'ApplyPolygenicScore',
    mustWork = TRUE
    )

load(phenotype.test.data.path)
str(phenotype.test.data)
## List of 3
##  $ vcf.data       :'data.frame': 100 obs. of  7 variables:
##   ..$ CHROM        : chr [1:100] "chr1" "chr1" "chr1" "chr1" ...
##   ..$ POS          : int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ ID           : chr [1:100] "rs1" "rs1" "rs1" "rs1" ...
##   ..$ REF          : chr [1:100] "T" "T" "T" "T" ...
##   ..$ ALT          : chr [1:100] "A" "A" "A" "A" ...
##   ..$ Indiv        : chr [1:100] "sample1" "sample2" "sample3" "sample4" ...
##   ..$ gt_GT_alleles: chr [1:100] "T/T" "T/T" "T/T" "T/T" ...
##  $ pgs.weight.data:'data.frame': 10 obs. of  6 variables:
##   ..$ CHROM        : chr [1:10] "chr1" "chr2" "chr3" "chr4" ...
##   ..$ POS          : int [1:10] 1 2 3 4 5 6 7 8 9 10
##   ..$ ID           : chr [1:10] "rs1" "rs2" "rs3" "rs4" ...
##   ..$ effect_allele: chr [1:10] "A" "A" "A" "A" ...
##   ..$ other_allele : chr [1:10] "T" "T" "T" "T" ...
##   ..$ beta         : num [1:10] -0.432 1.69 1.228 0.276 -1.049 ...
##  $ phenotype.data :'data.frame': 10 obs. of  5 variables:
##   ..$ Indiv                  : chr [1:10] "sample1" "sample2" "sample3" "sample4" ...
##   ..$ continuous.phenotype   : num [1:10] -0.468 -0.773 2.15 -1.334 0.496 ...
##   ..$ binary.phenotype       : int [1:10] 1 1 0 0 0 0 1 0 0 0
##   ..$ binary.factor.phenotype: Factor w/ 2 levels "control","case": 2 2 1 1 1 1 2 1 1 1
##   ..$ categorical.phenotype  : Factor w/ 5 levels "a","b","c","d",..: 3 1 2 1 2 4 5 5 3 1

When provided with phenotype data, apply.polygenic.score merges the phenotype data with the PGS output.

pgs.results.with.phenotype <- apply.polygenic.score(
    vcf.data = phenotype.test.data$vcf.data,
    pgs.weight.data = phenotype.test.data$pgs.weight.data,
    phenotype.data = phenotype.test.data$phenotype.data
    )

head(pgs.results.with.phenotype$pgs.output)
##      Indiv PGS.with.replaced.missing percentile decile quartile
## 1  sample1                -1.5698448        0.2      2        1
## 2 sample10                 0.6369117        0.7      7        3
## 3  sample2                 1.8992260        0.9      9        4
## 4  sample3                -0.7625591        0.5      5        2
## 5  sample4                -1.0417387        0.4      4        2
## 6  sample5                -2.1401365        0.1      1        1
##   n.missing.genotypes percent.missing.genotypes continuous.phenotype
## 1                   0                         0           -0.4682005
## 2                   0                         0           -0.1524106
## 3                   0                         0           -0.7729782
## 4                   0                         0            2.1499193
## 5                   0                         0           -1.3343536
## 6                   0                         0            0.4958705
##   binary.phenotype binary.factor.phenotype categorical.phenotype
## 1                1                    case                     c
## 2                0                 control                     a
## 3                1                    case                     a
## 4                0                 control                     b
## 5                0                 control                     a
## 6                0                 control                     b

Here we see that the pgs.output data frame now contains the phenotype data columns in addition to the PGS columns.

When provided with a vector of phenotype.analysis.columns, apply.polygenic.score automatically detects the data type of each column (continuous or binary), and performs a regression (linear or logistic) between the PGS and each phenotype column. This is particularly useful to quickly see how well your PGS predicts the trait it is intended to predict, or whether it is associated with any other variables of interest.

pgs.results.with.phenotype.analysis <- apply.polygenic.score(
    vcf.data = phenotype.test.data$vcf.data,
    pgs.weight.data = phenotype.test.data$pgs.weight.data,
    phenotype.data = phenotype.test.data$phenotype.data,
    phenotype.analysis.columns = c('continuous.phenotype', 'binary.phenotype')
    )

head(pgs.results.with.phenotype.analysis$regression.output)
##                                 phenotype               model        beta
## continuous.phenotype continuous.phenotype   linear.regression -0.01401100
## binary.phenotype         binary.phenotype logistic.regression -0.02360666
##                             se   p.value    r.squared       AUC
## continuous.phenotype 0.2058113 0.9473952 0.0005789728        NA
## binary.phenotype     0.4185203 0.9550191           NA 0.5238095

We finally see a data frame in the regression.output element of the output list. This data frame contains the results of the regression analysis for each compatible phenotype. An accuracy metric in the form of an r-squared value is computed for linear regression, and an Area Under the (Receiver Operator) Curve (AUC) value is computed for logistic regression.

If multiple missing genotype methods are being used at once, producing multiple PGS columns, the regression analysis is only performed on one of these and defaults to the first method specified in missing.genotype.method. If you wish to use one of the other PGSs for analysis, the source PGS column for regression (and percentile) analysis can be specified using the analysis.source.pgs argument.

4 Data Visualization

ApplyPolygenicScore provides three plotting functions designed to operate on the output of apply.polygenic.score:

  1. create.pgs.density.plot
  2. create.pgs.with.continuous.phenotype.plot
  3. create.pgs.rank.plot

These visualizations are intended to provide quick quality assessments of PGS in a cohort. They really shine when combined with phenotype data to produce at-a-glance summaries of the relationship between PGS and phenotypes.

4.1 Common plotting arguments

Plotting functions are built on the BoutrosLab.plotting.general package and produce lattice multipanel plot objects. Each function can be provided with an output.directory, a filename.prefix, and a file.extension to write the plots to a file. A filename will be generated using the current date and the provided prefix, and the plot will be written in the format specified by the extension.

Note: in the examples below, all plots are written to a temporary directory from which they are read and displayed inline in the vignette.

Plotting functions automatically detect and plot data from the standardized PGS output columns produced by apply.polygenic.score: PGS, PGS.with.normalized.missing, and PGS.with.replaced.missing. Plots are automatically titled with the name of the PGS column being plotted, and the titles can be “tidied up” to remove period characters and replace them with spaces using the tidy.titles argument.

Multipanel plot dimensions can be controlled with width and height arguments, which are interpreted as inches. If your plots are looking squished, try increasing these values. Axis and title label font sizes can be controlled with xaxis.cex, yaxis.cex, and titles.cex arguments. Border padding can be controlled with border.padding and is applied to all four sides of the plot at once.

4.2 PGS Density

create.pgs.density.plot plots the PGS distribution in the cohort.

4.2.1 Basic plot

create.pgs.density.plot(
    pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
    output.dir = temp.dir,
    filename.prefix = 'vignette-example-basic',
    file.extension = 'png'
    )
## NULL

4.2.2 Add phenotypes

If provided with phenotype column names, this function will plot individual PGS density curves for each category in a categorical phenotype. - Non-categorical phenotypes are ignored. - Colors are automatically assigned up to 12 categories. - If more than 12 categories are present, all categories are plotted in black. - When multiple PGS columns are present and/or multiple phenotype columns are specified, each combination is plotted in a separate panel. - Panels are arranged in a grid with unique PGS inputs on the x-axis and unique phenotype inputs on the y-axis.

create.pgs.density.plot(
    pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
    output.dir = temp.dir,
    filename.prefix = 'vignette-example-phenotype',
    file.extension = 'png',
    tidy.titles = TRUE,
    phenotype.columns = c('binary.factor.phenotype', 'categorical.phenotype', 'continuous.phenotype')
    )
## NULL

4.3 PGS Correlation

create.pgs.with.continuous.phenotype.plot plots a scatterplot between the PGS and a continuous phenotype, and reports a correlation. Non-continuous phenotypes are ignored. This function can only be used if phenotype data is provided.

4.3.1 Basic plot

create.pgs.with.continuous.phenotype.plot(
    pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
    output.dir = temp.dir,
    filename.prefix = 'vignette-example-correlation',
    file.extension = 'png',
    tidy.titles = TRUE,
    phenotype.columns = c('continuous.phenotype', 'categorical.phenotype')
    )
## NULL

Unsurprisingly, our randomly generated phenotype data is not correlated with our fake PGS.

  • When multiple PGS columns are present and/or multiple phenotype columns are specified, each combination is plotted in a separate panel.
  • Panels are arranged in a grid with unique PGS inputs on the x-axis and unique phenotype inputs on the y-axis.

4.4 PGS Percentile Rank

create.pgs.rank.plot plots several panels of per-individual data, each in the order of increasing PGS percentile rank.

At the top, a barplot of missing PGS variant genotype counts or percents for each individual (only if at least one individual is missing a variant). Next, a barplot of the percentile rank of each individual. Next, a covariate bar of the PGS quartile and decile of each individual. Next, a covariate bar of categories for each provided categorical phenotype in each individual. Finally, a heatmap of continuous phenotype values for each provided continuous phenotype in each individual.

4.4.1 Basic plot

# Introducing some missing genotypes to demonstrate the missing genotype barplot
pgs.results.with.phenotype.analysis$pgs.output$n.missing.genotypes <- rep(c(0, 1, 0, 2, 1), 2)

create.pgs.rank.plot(
    pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
    output.dir = temp.dir,
    filename.prefix = 'vignette-example-basic',
    file.extension = 'png'
    )
## NULL

4.4.2 Add phenotypes

create.pgs.rank.plot(
    pgs.data = pgs.results.with.phenotype.analysis$pgs.output,
    output.dir = temp.dir,
    filename.prefix = 'vignette-example-phenotype',
    file.extension = 'png',
    phenotype.columns = c('binary.factor.phenotype', 'binary.phenotype', 'categorical.phenotype', 'continuous.phenotype')
    )
## [1] "springgreen3"
## [1] "darkturquoise"
## [1] "deepskyblue"
## NULL

  • Legend titles are inherited from phenotype column names.
  • For cohorts exceeding 50 individuals, sample labels are automatically removed.
  • If missing phenotypes exist, they are plotted in gray and accompanied by a missingness legend.

4.4.3 Optional arguments

This function comes with a handful of additional arguments to control the appearance of the plot.

  • Use missing.genotype.style to toggle between displaying counts and percentages.
  • Use categorical.palette to specify a custom color palette for categorical phenotype bars.
  • Use binary.pallette to specify a custom color palette for binary and continuous phenotype bars.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.