The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This pipeline is designed for filtering mutations found in formalin-fixed and paraffin-embedded (FFPE) samples. The MicroSEC filter utilizes a statistical analysis, and the results for mutations with less than 10 supporting reads are not reliable. Four files are necessary for the analysis: mutation information file, BAM file, and mutation supporting read ID information file.
This tsv or tsv.gz file should contain at least these columns: Sample
Mut_type Chr Pos Ref Alt SimpleRepeat_TRF Neighborhood_sequence
sample_name 1-snv chr1 108130741 C T N
CTACCTGGAGAATGGGCCCATGTGTCCAGGTAGCAGTAAGC
SimpleRepeat_TRF: Whether the mutation locates at a simple repeat
sequence or not (“Y” or “N”).
Neighborhood_sequence: [5’-20 bases] + [Alt sequence] + [3’-20
bases]
Sample, Mut_type, Chr, Pos, Ref, and Alt should be set exactly.
If you do not know the SimpleRepeat_TRF, Mut_type, or
Neighborhood_sequence, enter “-”. Automatically detected.
(mandatory, if multiple samples are processed in a batch)
Seven to ten columns are necessary (without column names).
Optional columns can be deleted if they are not applicable.
[sample name] [mutation information tsv file] [BAM file] [read length]
[adapter sequence read 1] [optional: adapter sequence read 2] [sample
type: Human or Mouse] [panel name] [optional: reference genome fastq
file] [optional: simple repeat region bed file]
PC9 ./source/CCLE.tsv ./source/Cell_line/PC9.bam 127
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Human TOP
A375 ./source/CCLE.tsv.gz ./source/Cell_line/A375.bam 127
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Hg38 TOP ./reference/hg38.fa
./reference/simpleRepeat.bed.gz
(optional, but mandatory with cram files)
(optional, but mandatory to detect simple repeat derived artifacts)
This pipeline contains 8 filtering processes.
Filter 1 : Shorter-supporting lengths distribute too unevenly to occur
(1-1 and 1-2).
Filter 1-1: P-values are less than the threshold_p(default:
10^(-6)).
Filter 1-2: The shorter-supporting lengths distributed over less than
75% of the read length.
Filter 2 : Hairpin-structure induced error detection (2-1 and
2-2).
Filter 2-1: Palindromic sequences exist within 150 bases. Filter 2-2:
>=50% mutation-supporting reads contains a reverse complementary
sequence of the opposite strand consisting >= 15 bases.
Filter 3 : 3’-/5’-supporting lengths are too unevenly distributed to
occur (3-1 and 3-3).
Filter 3-1: P-values are less than the threshold_p(default:
10^(-6)).
Filter 3-2: The distributions of 3’-/5’-supporting lengths are within
75% of the read length.
Filter 4 : >=15% mutations were called by chimeric reads comprising
two distant regions.
Filter 5 : >=50% mutations were called by soft-clipped reads.
Filter 6 : Mutations locating at simple repeat sequences.
Filter 7 : Mutations locating at a >=15 homopolymer.
Filter 8 : >=10% low quality bases (Quality score <18) in the
mutation supporting reads.
Supporting lengths are adjusted considering small repeat sequences around the mutations.
Rscript MicroSEC.R [working/output directory] [sample information tsv file] [progress bar Y/N]
$ Rscript MicroSEC.R ~ \
~/source/Sample_list_test.txt N
$ Rscript MicroSEC.R ~ \
~/source/sample_info_test.tsv.gz Y
If you want to know the progress visually, [progress bar Y/N] should
be Y.
Results are saved in a tsv file.
github url: https://github.com/MANO-B/MicroSEC
# initialize
msec <- NULL
homology_search <- NULL
mut_depth <- NULL
# test data
sample_name <- "sample"
read_length <- 150
adapter_1 <- "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
adapter_2 <- "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
organism <- "hg38"
# load mutation information
df_mutation <- fun_load_mutation(
system.file("extdata", "mutation_list.tsv", package = "MicroSEC"),
"sample",
BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38,
24)
df_bam <- fun_load_bam(
system.file("extdata", "sample.bam", package = "MicroSEC"))
# another example data
# data(exampleMutation)
# data(exampleBam)
# df_mutation <- exampleMutation
# df_bam <- exampleBam
# load genomic sequence
ref_genome <- fun_load_genome(organism)
chr_no <- fun_load_chr_no(organism)
# analysis
result <- fun_read_check(df_mutation = df_mutation,
df_bam = df_bam,
ref_genome = ref_genome,
sample_name = sample_name,
read_length = read_length,
adapter_1 = adapter_1,
adapter_2 = adapter_2,
short_homology_search_length = 4,
min_homology_search = 40,
progress_bar = progress_bar)
msec_read_checked <- result[[1]]
homology_searched <- result[[2]]
mut_depth_checked <- result[[3]]
# search homologous sequences
msec_homology_searched = fun_homology(msec = msec_read_checked,
df_distant = homology_searched,
min_homology_search = 40,
ref_genome = ref_genome,
chr_no = chr_no,
progress_bar = progress_bar)
# statistical analysis
msec_summarized <- fun_summary(msec_homology_searched)
msec_analyzed <- fun_analysis(msec = msec_summarized,
mut_depth = mut_depth_checked,
short_homology_search_length = 4,
min_homology_search = 40,
threshold_p = 10 ^ (-6),
threshold_hairpin_ratio = 0.50,
threshold_short_length = 0.75,
threshold_distant_homology = 0.15,
threshold_soft_clip_ratio = 0.50,
threshold_low_quality_rate = 0.1,
homopolymer_length = 15)
# save the results as a tsv.gz file.
#fun_save(msec_analyzed, "~/MicroSEC_test.tsv.gz")
msec_analyzed
#> Sample Mut_type Chr Pos Ref Alt SimpleRepeat_TRF
#> 1 sample 2-snv chr11 17719997 CC GG N
#> 2 sample 2-snv chr11 69295601 CT AA N
#> 3 sample 2-snv chr15 98707826 TC CG N
#> 4 sample 1-ins chr2 47783391 C CA N
#> 5 sample 2-del chr2 60920434 CAG C N
#> 6 sample 1-snv chr3 72446880 G C N
#> 7 sample 2-snv chr3 179198865 CAC AAG N
#> 8 sample 2-del chr3 181742887 TTA T Y
#> 9 sample 1-snv chr5 180618885 G A N
#> 10 sample 2-snv chr6 29725282 CC GG N
#> 11 sample 2-snv chr6 88143884 GA TC N
#> 12 sample 2-snv chr7 64068966 CG AA N
#> 13 sample 2-snv chr8 23681315 GC CA N
#> Neighborhood_sequence read_length total_read
#> 1 CCCCGCGGCGGTGCACCCGGGGCCGGGCGCACGTGAGGACGA 150 9
#> 2 CTTGTCTGTAGTCTGTCCCCAATGGGGACAGGGACTCGTTGC 150 28
#> 3 CAACTACGCCCTGGTCATCTCGGAGATGACCAATCTCAAGGA 150 17
#> 4 GGATGCGGCCTGGAGCGAGGCATGGGCCTGGGCCCAGGCCCT 150 13
#> 5 TGGTCTTGAACTCCTGACATCGTGATCCACCCACCTTGGCC 150 34
#> 6 CTCGCACACCAGGCGGTGGCCGCGGCGCCCCGGGAAGGGCC 150 16
#> 7 CAGGTGAACTGTGGGGCATCAAGTTGATGCCCCCAAGAATCCT 150 23
#> 8 GGGGGAAAGGACAGAGCAGTTTATATATATATATATATATA 150 6
#> 9 AGCCTGGCGAGCTCCACCATAGCGCGGAAGCGTCCGCGCTG 150 19
#> 10 ATGTGCAGCACGAGGGGCTGGGCCAGCCCCTCATCCTGAGAT 150 9
#> 11 CCTCGGCAGACGTGTCTGTGTCCACAGACATGGTTACCTTGG 150 23
#> 12 GCCTGGACTCTGCTCAGCAGAATTTGTATAGGGATGTGATGT 150 11
#> 13 CCAGGGAGGCCCGGGAGAAGCACTCCTCTTTCAGGGCCGGCA 150 34
#> soft_clipped_read flag_hairpin pre_support_length post_support_length
#> 1 3 8 143 10
#> 2 13 28 139 140
#> 3 7 17 139 140
#> 4 11 12 131 20
#> 5 16 0 65 144
#> 6 16 0 138 5
#> 7 7 23 139 141
#> 8 1 0 100 111
#> 9 12 0 145 19
#> 10 4 9 139 140
#> 11 8 21 144 10
#> 12 7 0 47 131
#> 13 34 32 135 9
#> short_support_length pre_farthest post_farthest
#> 1 10 205 10
#> 2 9 242 282
#> 3 10 338 187
#> 4 20 211 20
#> 5 65 65 258
#> 6 5 190 5
#> 7 10 268 383
#> 8 72 114 111
#> 9 19 158 19
#> 10 10 205 155
#> 11 10 393 10
#> 12 47 47 243
#> 13 9 301 9
#> low_quality_base_rate_under_q18 low_quality_pre low_quality_post
#> 1 0.015555556 0.011111111 0.000000000
#> 2 0.006428571 0.010714286 0.003571429
#> 3 0.009803922 0.005882353 0.005882353
#> 4 0.010769231 0.015384615 0.000000000
#> 5 0.003725490 0.000000000 0.000000000
#> 6 0.009583333 0.018750000 0.012500000
#> 7 0.005217391 0.000000000 0.004347826
#> 8 0.091111111 0.000000000 0.166666667
#> 9 0.009824561 0.021052632 0.000000000
#> 10 0.004444444 0.000000000 0.000000000
#> 11 0.021449275 0.004347826 0.000000000
#> 12 0.020000000 0.009090909 0.027272727
#> 13 0.006862745 0.000000000 0.011764706
#> distant_homology_rate soft_clipped_rate prob_filter_1 prob_filter_3_pre
#> 1 0.00000000 0.3333333 5.080526e-05 1.173347e-03
#> 2 0.00000000 0.4642857 6.775881e-22 5.890075e-01
#> 3 0.00000000 0.4117647 1.362376e-19 4.838936e-01
#> 4 0.07692308 0.8461538 3.198724e-11 1.943905e-06
#> 5 1.00000000 0.4705882 9.553664e-03 1.214764e-04
#> 6 0.00000000 1.0000000 3.370855e-15 7.600607e-06
#> 7 0.00000000 0.3043478 2.439900e-13 1.000000e+00
#> 8 0.33333333 0.1666667 1.438974e-04 5.013675e-01
#> 9 0.00000000 0.6315789 2.128166e-08 2.319433e-12
#> 10 0.00000000 0.4444444 5.263540e-10 6.122053e-01
#> 11 0.00000000 0.3478261 1.667818e-18 7.358585e-10
#> 12 0.27272727 0.6363636 9.045576e-03 2.661244e-03
#> 13 0.23529412 1.0000000 2.853725e-24 2.396907e-12
#> prob_filter_3_post filter_1_mutation_intra_hairpin_loop
#> 1 2.136914e-05 FALSE
#> 2 8.042059e-01 TRUE
#> 3 6.816295e-01 TRUE
#> 4 8.192000e-10 TRUE
#> 5 1.042664e-04 FALSE
#> 6 3.552714e-15 TRUE
#> 7 1.000000e+00 TRUE
#> 8 1.562500e-02 FALSE
#> 9 3.876247e-12 TRUE
#> 10 7.240429e-01 TRUE
#> 11 3.030645e-20 TRUE
#> 12 3.880853e-03 FALSE
#> 13 2.295862e-28 TRUE
#> filter_2_hairpin_structure filter_3_microhomology_induced_mutation
#> 1 TRUE FALSE
#> 2 TRUE FALSE
#> 3 TRUE FALSE
#> 4 TRUE TRUE
#> 5 FALSE FALSE
#> 6 FALSE TRUE
#> 7 TRUE FALSE
#> 8 FALSE FALSE
#> 9 FALSE TRUE
#> 10 TRUE FALSE
#> 11 TRUE TRUE
#> 12 FALSE FALSE
#> 13 TRUE TRUE
#> filter_4_highly_homologous_region filter_5_soft_clipped_reads
#> 1 FALSE FALSE
#> 2 FALSE FALSE
#> 3 FALSE FALSE
#> 4 FALSE TRUE
#> 5 TRUE FALSE
#> 6 FALSE TRUE
#> 7 FALSE FALSE
#> 8 FALSE FALSE
#> 9 FALSE TRUE
#> 10 FALSE FALSE
#> 11 FALSE FALSE
#> 12 TRUE TRUE
#> 13 TRUE TRUE
#> filter_6_simple_repeat filter_7_mutation_at_homopolymer filter_8_low_quality
#> 1 FALSE FALSE FALSE
#> 2 FALSE FALSE FALSE
#> 3 FALSE FALSE FALSE
#> 4 FALSE FALSE FALSE
#> 5 FALSE FALSE FALSE
#> 6 FALSE FALSE FALSE
#> 7 FALSE FALSE FALSE
#> 8 TRUE FALSE TRUE
#> 9 FALSE FALSE FALSE
#> 10 FALSE FALSE FALSE
#> 11 FALSE FALSE FALSE
#> 12 FALSE FALSE FALSE
#> 13 FALSE FALSE FALSE
#> msec_filter_123 msec_filter_1234 msec_filter_all
#> 1 Artifact suspicious Artifact suspicious Artifact suspicious
#> 2 Artifact suspicious Artifact suspicious Artifact suspicious
#> 3 Artifact suspicious Artifact suspicious Artifact suspicious
#> 4 Artifact suspicious Artifact suspicious Artifact suspicious
#> 5 Artifact suspicious Artifact suspicious
#> 6 Artifact suspicious Artifact suspicious Artifact suspicious
#> 7 Artifact suspicious Artifact suspicious Artifact suspicious
#> 8 Artifact suspicious
#> 9 Artifact suspicious Artifact suspicious Artifact suspicious
#> 10 Artifact suspicious Artifact suspicious Artifact suspicious
#> 11 Artifact suspicious Artifact suspicious Artifact suspicious
#> 12 Artifact suspicious Artifact suspicious
#> 13 Artifact suspicious Artifact suspicious Artifact suspicious
#> comment
#> 1
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
#> 8 too repetitive to analyze homology, filter 4: sequence is too repetitive,
#> 9
#> 10
#> 11
#> 12
#> 13
sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.5
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Asia/Tokyo
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] MicroSEC_2.1.6
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 utf8_1.2.4
#> [3] generics_0.1.3 SparseArray_1.2.4
#> [5] bitops_1.0-7 lattice_0.22-6
#> [7] stringi_1.8.3 digest_0.6.35
#> [9] magrittr_2.0.3 grid_4.3.2
#> [11] evaluate_0.23 fastmap_1.1.1
#> [13] Matrix_1.6-5 jsonlite_1.8.8
#> [15] restfulr_0.0.15 GenomeInfoDb_1.38.8
#> [17] fansi_1.0.6 BSgenome_1.70.2
#> [19] XML_3.99-0.16.1 Biostrings_2.70.3
#> [21] codetools_0.2-20 jquerylib_0.1.4
#> [23] abind_1.4-5 cli_3.6.2
#> [25] rlang_1.1.3 crayon_1.5.2
#> [27] XVector_0.42.0 Biobase_2.62.0
#> [29] withr_3.0.0 DelayedArray_0.28.0
#> [31] cachem_1.0.8 yaml_2.3.8
#> [33] S4Arrays_1.2.1 tools_4.3.2
#> [35] parallel_4.3.2 BiocParallel_1.36.0
#> [37] dplyr_1.1.4 GenomeInfoDbData_1.2.11
#> [39] Rsamtools_2.18.0 SummarizedExperiment_1.32.0
#> [41] BiocGenerics_0.48.1 vctrs_0.6.5
#> [43] R6_2.5.1 matrixStats_1.3.0
#> [45] BiocIO_1.12.0 stats4_4.3.2
#> [47] lifecycle_1.0.4 BSgenome.Hsapiens.UCSC.hg38_1.4.5
#> [49] rtracklayer_1.62.0 zlibbioc_1.48.2
#> [51] stringr_1.5.1 S4Vectors_0.40.2
#> [53] IRanges_2.36.0 pkgconfig_2.0.3
#> [55] pillar_1.9.0 bslib_0.8.0
#> [57] glue_1.7.0 GenomicAlignments_1.38.2
#> [59] xfun_0.43 tibble_3.2.1
#> [61] GenomicRanges_1.54.1 tidyselect_1.2.1
#> [63] MatrixGenerics_1.14.0 rstudioapi_0.16.0
#> [65] knitr_1.46 rjson_0.2.21
#> [67] htmltools_0.5.8.1 rmarkdown_2.26
#> [69] compiler_4.3.2 RCurl_1.98-1.14
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.