This script is to show the importance of clustering high confidence variants and then attribute meaningful ones to the identified clusters
All functions are stored inside the reproduce_2.R, to avoid long display of codes. Below are the values that will be used throughout the testing
number_iterations <- 10
number_mutations <- 200
ndrivers <- 20
We will first create a test set with QuantumCat
with 6 clones, 200 variants, diploid, with an average depth of 100X, two samples with respective purity 70% and 60%. We make sure these variants correspond to stringent filters (i.e. depth \(> 50\)X).
toy.data<-QuantumCat_stringent(number_of_clones = 6,number_of_mutations = number_mutations,
ploidy = "AB",depth = 100,
contamination = c(0.3,0.4),min_depth = 50)
We check that all these variants are within the stringent filters (i.e depth \(\geq 50\) X), and display the first six rows of the first sample:
sum(toy.data[[1]]$Depth<50 | toy.data[[2]]$Depth<50)
## [1] 0
kable(toy.data[[1]][1:6,])
Chr | Start | Genotype | Cellularit | number_of_copies | Frequency | Depth | Alt |
---|---|---|---|---|---|---|---|
1 | 1 | AB | 100 | 1 | 35.00 | 110 | 45 |
6 | 2 | AB | 49 | 1 | 17.15 | 223 | 40 |
2 | 3 | AB | 28 | 1 | 9.80 | 106 | 7 |
1 | 4 | AB | 100 | 1 | 35.00 | 135 | 54 |
2 | 5 | AB | 28 | 1 | 9.80 | 76 | 10 |
6 | 6 | AB | 49 | 1 | 17.15 | 137 | 27 |
Then we create 200 mutations that are in permissive filters. For that we take 50 mutations with 30 to 50 depth, 100 that have a depth \(\geq 30\) in triploid (AAB) loci and 50 that have a depth \(\geq 30\) in a tetraploid (AABB) locus.
permissive<-QuantumCat_permissive(fromQuantumCat = toy.data ,number_of_mutations = number_mutations,
ploidy = "AB",depth = 100,
contamination = c(0.3,0.4),max_depth = 50, min_depth = 30)
kable(permissive[[1]][1:6,])
Chr | Start | Cellularit | Genotype | number_of_copies | Depth | Frequency | Alt |
---|---|---|---|---|---|---|---|
3 | 201 | 25 | AB | 1 | 31 | 8.75 | 3 |
1 | 202 | 100 | AB | 1 | 33 | 35.00 | 11 |
5 | 203 | 6 | AB | 1 | 34 | 2.10 | 0 |
6 | 204 | 49 | AB | 1 | 38 | 17.15 | 5 |
5 | 205 | 6 | AB | 1 | 44 | 2.10 | 1 |
2 | 206 | 28 | AB | 1 | 39 | 9.80 | 4 |
We are now going to select 20 drivers, with probability \(3/4\) of being in the permissive filters.
drivers_id<-sample(1:(2*number_mutations),size = ndrivers,prob = rep(c(1/{4*number_mutations},
3/{4*number_mutations}),
each = number_mutations)
)
drivers_id<-drivers_id[order(drivers_id)]
drivers_id
## [1] 47 109 124 141 166 217 223 226 249 258 270 301 319 353 360 369 370
## [18] 384 388 397
We now want to cluster mutations using only the filtered mutations (Paper pipeline), the filtered and drivers (extended), or all mutations alltogether (All), and compare the clustering quality of these different methods.
ext<-extended(filtered = toy.data,
permissive = permissive,
drivers_id = drivers_id)
all<-All(filtered = toy.data,
permissive = permissive,
drivers_id = drivers_id
)
pap<-paper_pipeline(filtered = toy.data,
permissive = permissive,
drivers_id = drivers_id)
We are now going to compare the quality of clustering using the Normalized Mutual Information, the number of clusters found (the truth being 6), the maximal and average error in the distance of a driver to its real position. N.B:
Quality<-compare_qual(paper = pap,
extended = ext,
all = all,
drivers_id = drivers_id)
kable(Quality)
Pipeline | NMI | Max.Distance.to.clone | nclusters | mean.mut.error | mean.driv.error |
---|---|---|---|---|---|
paper | 0.5992793 | 0.0554961 | 5 | 0.0665081 | 0.2653133 |
extended | 0.6445918 | 0.0415484 | 4 | 0.0587591 | 0.2568724 |
all | 0.6478438 | 0.4110115 | 5 | 0.4036281 | 0.2695045 |
We are now going to reproduce this test 9 times.
Quality<-rbind(Quality,
reproduce(number_iterations-1,
number_mutations,
ndrivers)
)
We can plot these results: