The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette demonstrates how to perform multi-trait colocalization
analysis using summary statistics data, specifically focusing on the
Sumstat_5traits
dataset included in the package.
Sumstat_5traits
DatasetThe Sumstat_5traits
dataset contains 5 simulated summary
statistics, where it is directly derived from the
Ind_5traits
dataset using marginal association. The dataset
is specifically designed to evaluate and demonstrate the capabilities of
ColocBoost in multi-trait colocalization analysis with summary
association data.
sumstat
: A list of data.frames of summary statistics
for different traits.true_effect_variants
: True effect variants indices for
each trait.LD
could be calculated from the
X
data in the Ind_5traits
dataset, but it is
not included in the Sumstat_5traits
dataset.The dataset features two causal variants with indices 194 and 589.
This structure creates a realistic scenario in which multiple traits are influenced by different but overlapping sets of genetic variants.
# Loading the Dataset
data("Sumstat_5traits")
names(Sumstat_5traits)
#> [1] "sumstat" "true_effect_variants"
Sumstat_5traits$true_effect_variants
#> $Outcome_1
#> [1] 194
#>
#> $Outcome_2
#> [1] 194 589
#>
#> $Outcome_3
#> [1] 194 589
#>
#> $Outcome_4
#> [1] 194
#>
#> $Outcome_5
#> [1] 589
Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in colocboost paper repo.
sumstat
must include the following columns:
z
or (beta
, sebeta
): either
z-score or (effect size and standard error)n
: sample size for the summary statistics.
Highly recommended: Providing the sample size, or even
a rough estimate of n
, is highly recommended. Without
n
, the implicit assumption is n
is large (Inf)
and the effect sizes are small (close to zero).variant
: required if sumstat
for different
outcomes do not have the same number of variables (multiple
sumstat
and multiple LD
).When studying multiple traits with their own trait-specific LD matrices, you could provide a list of LD matrices matched with a list of summary statistics.
sumstat
and
LD
are organized as lists, matched by trait index,
(sumstat[1], LD[1])
contains information for trait
1,(sumstat[2], LD[2])
contains information for trait
2,# Duplicate LD with matched summary statistics
LD_multiple <- lapply(1:length(Sumstat_5traits$sumstat), function(i) LD )
# Run colocboost
res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_multiple)
#> Validating input data.
#> Starting gradient boosting algorithm.
#> Gradient boosting for outcome 4 converged after 40 iterations!
#> Gradient boosting for outcome 5 converged after 59 iterations!
#> Gradient boosting for outcome 1 converged after 61 iterations!
#> Gradient boosting for outcome 3 converged after 91 iterations!
#> Gradient boosting for outcome 2 converged after 94 iterations!
#> Performing inference on colocalization events.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 186 194 168 205
#>
#> $`cos2:y2_y3_y5`
#> [1] 589 593
When the LD matrix includes a superset of variants across different summary statistics, with Input Format:
sumstat
is a list of data.frames for all traitsLD
is a matrix of linkage disequilibrium (LD)
information for all variants across all traits.# Create sumstat with different number of variants - remove 100 variants in each sumstat
LD_superset <- LD
sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 20), , drop = FALSE])
# Run colocboost
res <- colocboost(sumstat = sumstat, LD = LD_superset)
#> Validating input data.
#> Starting gradient boosting algorithm.
#> Gradient boosting for outcome 4 converged after 41 iterations!
#> Gradient boosting for outcome 5 converged after 60 iterations!
#> Gradient boosting for outcome 1 converged after 62 iterations!
#> Gradient boosting for outcome 3 converged after 93 iterations!
#> Gradient boosting for outcome 2 converged after 95 iterations!
#> Performing inference on colocalization events.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 186 194 168 205
#>
#> $`cos2:y2_y3_y5`
#> [1] 589 593
When studying multiple traits with arbitrary LD matrices for different summary statistics, we also provide the interface for arbitrary LD matrices with multiple sumstat. This particularly benefits meta-analysis across heterogeneous datasets where, for different subsets of summary statistics, LD comes from different populations.
sumstat = list(sumstat1, sumstat2, sumstat3, sumstat4, sumstat5)
is a list of data.frames for all traits.LD = list(LD1, LD2)
is a list of LD matrices.dict_sumstatLD
is a dictionary matrix that index of
sumstat to index of LD.# Create a simple dictionary for demonstration purposes
LD_arbitrary <- list(LD, LD) # traits 1 and 2 matched to the first genotype matrix; traits 3,4,5 matched to the third genotype matrix.
dict_sumstatLD = cbind(c(1:5), c(1,1,2,2,2))
# Display the dictionary
dict_sumstatLD
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 1
#> [3,] 3 2
#> [4,] 4 2
#> [5,] 5 2
# Run colocboost
res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_arbitrary, dict_sumstatLD = dict_sumstatLD)
#> Validating input data.
#> Starting gradient boosting algorithm.
#> Gradient boosting for outcome 4 converged after 40 iterations!
#> Gradient boosting for outcome 5 converged after 59 iterations!
#> Gradient boosting for outcome 1 converged after 61 iterations!
#> Gradient boosting for outcome 3 converged after 91 iterations!
#> Gradient boosting for outcome 2 converged after 94 iterations!
#> Performing inference on colocalization events.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 186 194 168 205
#>
#> $`cos2:y2_y3_y5`
#> [1] 589 593
ColocBoost also provides a flexibility to use HyPrColoc compatible format for summary statistics with and without LD matrix.
# Loading the Dataset
data(Ind_5traits)
X <- Ind_5traits$X
Y <- Ind_5traits$Y
# Coverting to HyPrColoc compatible format
effect_est <- effect_se <- effect_n <- c()
for (i in 1:length(X)){
x <- X[[i]]
y <- Y[[i]]
effect_n[i] <- length(y)
output <- susieR::univariate_regression(X = x, y = y)
effect_est <- cbind(effect_est, output$beta)
effect_se <- cbind(effect_se, output$sebeta)
}
colnames(effect_est) <- colnames(effect_se) <- c("Y1", "Y2", "Y3", "Y4", "Y5")
rownames(effect_est) <- rownames(effect_se) <- colnames(X[[1]])
# Run colocboost
LD <- get_cormat(Ind_5traits$X[[1]])
res <- colocboost(effect_est = effect_est, effect_se = effect_se, effect_n = effect_n, LD = LD)
#> Validating input data.
#> Starting gradient boosting algorithm.
#> Gradient boosting for outcome 4 converged after 40 iterations!
#> Gradient boosting for outcome 5 converged after 59 iterations!
#> Gradient boosting for outcome 1 converged after 61 iterations!
#> Gradient boosting for outcome 3 converged after 91 iterations!
#> Gradient boosting for outcome 2 converged after 94 iterations!
#> Performing inference on colocalization events.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 186 194 168 205
#>
#> $`cos2:y2_y3_y5`
#> [1] 589 593
See more details about data format to implement LD-free ColocBoost and LD-mismatch diagnosis in LD mismatch and LD-free Colocalization).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.