This pipeline is designed for filtering mutations found in formalin-fixed and paraffin-embedded (FFPE) samples. The MicroSEC filter utilizes a statistical analysis, and the results for mutations with less than 10 supporting reads are not reliable. Four files are nessesary for the analysis: mutation information file, BAM file, and mutation supporting read ID information file.
File 1: mutation information file
This excel file should contain at least these contents:
Sample Gene HGVS.c HGVS.p Mut_type Total_QV>=20 %Alt Chr
SL_1010-N6-B SLC25A24 _ _ 1-snv 366 1.0929 chr1
Pos Ref Alt SimpleRepeat_TRF Neighborhood_sequence
108130741 C T N CTACCTGGAGAATGGGCCCATGTGTCCAGGTAGCAGTAAGC
Transition
C>T_t
Total_QV>=20: The read number with total Q-value >=20.
SimpleRepeat_TRF: Whether the mutation locates at a simple repeat sequence or not.
Neighborhood_sequence: [5’-20 bases] + [Alt sequence] + [3’-20 bases].
Transition: 1-snv mutation pattern with a 3’-base. C>T_t represents CT to TT mutation. C>T_g_FFPE represents the possible FFPE artifact.
Sample, Mut_type, Chr, Pos, Ref, and Alt should be set exactly.
Gene, HGVS.c, HGVS.p, Total_QV>=20, %Alt, SimpleRepeat_TRF, and Transition can be set to any values.
File 2: BAM file
File 3: mutation supporting read ID information file
This file should contain at least these contents:
Chr Pos Ref Alt
chr1 2561609 T A
Mut_ID Mut
_;ID001-1:579185f,ID004-1:1873933f;ID006-1:1131647f,ID001-1:570086f .;A;N#
File 4: sample information tsv file
Seven or eight columns are necessary.
[sample name] [mutation information excel file] [BAM file] [read ID information directory] [read length] [adapter sequence read 1] [optional: adapter sequence read 2] [sample type: Human or Mouse]
PC9 ./source/CCLE.xlsx ./source/Cell_line/PC9_Cell_line_Ag_TDv4.realigned.bam ./source/PC9_Cell_line 127 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT hg38
This pipeline contains 8 filtering processes.
Filter 1 : Shorter-supporting lengths distribute too short to occur (1-1 and 1-2).
Filter 1-1: P-values are less than the threshold_p(default: 10 ^ (-6)).
Filter 1-2: The longest shorter-supporting lengths is shorter than 40% of the read length.
Filter 2 : Hairpin-structure induced error detection (2-1 or 2-2).
Filter 2-1: Palindromic sequences exist within 150 bases. Filter 2-2: >=50% mutation-supporting reads contains a reverse complementary sequence of the opposite strand consisting >= 15 bases.
Filter 3 : 3’-/5’-supporting lengths are too densely distributed to occur (3-1, 3-2, and 3-3).
Filter 3-1: P-values are less than the threshold_p(default: 10^(-6)).
Filter 3-2: The distributions of 3’-/5’-supporting lengths are shorter than 80% of the read length.
Filter 3-3: <10% of bases are low quality (Quality score <18).
Filter 4 : >=90% mutation-supporting reads are soft-clipped (after cutting adapter sequence).
Filter 5 : >=20% mutations were called by chimeric reads comprising two distant regions.
Filter 6 : Mutations locating at simple repeat sequences.
Filter 7 : C>T_g positive artifacts in FFPE samples.
Filter 8 : Indel mutations locating at a >=15 homopolymer.
Supporting lengths are adjusted considering small repeat sequences around the mutations.
Results are saved in a excel file.
The explatation of the results is written in detail in the second sheet of the excel file.
github url: https://github.com/MANO-B/MicroSEC
# initialize
msec = NULL
homology_search = NULL
mut_depth = NULL
# test data
sample_name = "H15-11943-1-T_TDv3"
read_length = 151
adapter_1 = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
adapter_2 = "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
organism = "hg38"
# load mutation information
data(exampleMutation)
data(exampleBAM)
data(exampleMutCall)
# load genomic sequence
ref_genome = fun_load_genome(organism)
chr_no = fun_load_chr_no(organism)
# analysis
result = fun_read_check(df_mutation = exampleMutation,
df_bam = exampleBAM,
df_mut_call = exampleMutCall,
ref_genome = ref_genome,
sample_name = sample_name,
read_length = read_length,
adapter_1 = adapter_1,
adapter_2 = adapter_2,
short_homology_search_length = 4,
progress_bar = progress_bar)
msec = rbind(msec, result[[1]])
homology_search = rbind(homology_search, result[[2]])
mut_depth = rbind(mut_depth, result[[3]])
# search homologous sequences
msec = fun_homology(msec,
homology_search,
min_homology_search = 40,
ref_genome,
chr_no,
progress_bar = progress_bar)
# statistical analysis
msec = fun_summary(msec)
msec = fun_analysis(msec,
mut_depth,
short_homology_search_length = 4,
min_homology_search = 40,
threshold_p = 10 ^ (-6),
threshold_hairpin_ratio = 0.50,
threshold_soft_clip_ratio = 0.90,
threshold_short_length = 0.8,
threshold_distant_homology = 0.2,
threshold_low_quality_rate = 0.1,
homopolymer_length = 15)
# save the results in the working/output directory.
# fun_save(msec, sample_name, wd)
msec
#> Sample Gene HGVS.c HGVS.p Mut_type
#> 1 H15-11943-1-T_TDv3 KMT2D c.2887G>A p.Ala963Thr 1-snv
#> 2 H15-11943-1-T_TDv3 KMT2D c.2886T>C p.Gly962Gly 1-snv
#> 3 H15-11943-1-T_TDv3 TP53 c.783-114G>A _ 1-snv
#> 4 H15-11943-1-T_TDv3 TP53 c.783-117T>C _ 1-snv
#> 5 H15-11943-1-T_TDv3 PIK3CA c.-76-3489delA _ 1-del
#> 6 H15-11943-1-T_TDv3 PIK3CA c.3073A>G p.Thr1025Ala 1-snv
#> 7 H15-11943-1-T_TDv3 EGFR c.2573T>G p.Leu858Arg 1-snv
#> 8 H15-11943-1-T_TDv3 EGFR c.3114+15_3114+16delTC _ 2-del
#> Total_QV>=20 %Alt Chr Pos Ref Alt SimpleRepeat_TRF
#> 1 163 6.74847 chr12 49050701 C T N
#> 2 162 7.40741 chr12 49050702 A G N
#> 3 189 8.46561 chr17 7673951 C T N
#> 4 183 9.83607 chr17 7673954 A G N
#> 5 134 17.16420 chr3 179195260 GA G N
#> 6 185 27.56760 chr3 179234230 A G N
#> 7 315 44.44440 chr7 55191822 T G N
#> 8 231 6.92641 chr7 55201369 GTC G N
#> Neighborhood_sequence Transition read_length total_read
#> 1 AGGGTCACTGTCCCCTTTGGTACCAAAGGGGTACTCTAACT NA 151 11
#> 2 GGGTCACTGTCCCCTTTGGCGCCAAAGGGGTACTCTAACTC NA 151 12
#> 3 ACCCTTGTCCTTTCTGGAGCTTAAGCTCCAGCTCCAGGTAG NA 151 16
#> 4 CTTGTCCTTTCTGGAGCCTAGGCTCCAGCTCCAGGTAGGTG NA 151 18
#> 5 TATTTACTTGCCTTCTCGGTGAAAAAAAAAACAACAACAAA NA 151 23
#> 6 ACATTGCATACATTCGAAAGGCCCTAGCCTTAGATAAAACT NA 151 51
#> 7 CAAGATCACAGATTTTGGGCGGGCCAAACTGCTGGGTGCGG NA 151 140
#> 8 CTCTCTGGTATGAAATCTCTGTCTCTCTCTCTCTCAAGCTG NA 151 16
#> soft_clipped_read flag_hairpin pre_support_length post_support_length
#> 1 10 0 132 137
#> 2 11 0 133 136
#> 3 14 0 140 134
#> 4 14 0 143 131
#> 5 9 0 130 130
#> 6 7 0 144 131
#> 7 29 0 147 135
#> 8 3 0 97 144
#> short_support_length low_quality_base_rate_under_q18 distant_homology_rate
#> 1 11 0.059602649 0
#> 2 10 0.071743929 0
#> 3 10 0.007036424 0
#> 4 18 0.006990434 0
#> 5 70 0.039447164 0
#> 6 72 0.014673419 0
#> 7 68 0.006196783 0
#> 8 75 0.011175497 0
#> prob_filter_1 prob_filter_3_pre prob_filter_3_post
#> 1 2.794410e-11 0.7106022939 0.3735577856
#> 2 5.248589e-17 0.4755555434 0.2746144956
#> 3 2.254597e-19 0.1180670870 0.0519644737
#> 4 7.602715e-14 0.1600788940 0.0635856288
#> 5 1.384192e-03 0.0304539889 0.0161038299
#> 6 1.134132e-01 0.0115221008 0.0157100141
#> 7 8.203053e-06 0.0537795550 0.0001750834
#> 8 1.000000e+00 0.0004783243 0.1146106607
#> filter_1_mutation_intra_hairpin_loop filter_2_hairpin_structure
#> 1 TRUE FALSE
#> 2 TRUE FALSE
#> 3 TRUE FALSE
#> 4 TRUE FALSE
#> 5 FALSE FALSE
#> 6 FALSE FALSE
#> 7 FALSE FALSE
#> 8 FALSE FALSE
#> filter_3_microhomology_induced_mutation filter_4_soft_clipping
#> 1 FALSE TRUE
#> 2 FALSE TRUE
#> 3 FALSE FALSE
#> 4 FALSE FALSE
#> 5 FALSE FALSE
#> 6 FALSE FALSE
#> 7 FALSE FALSE
#> 8 FALSE FALSE
#> filter_5_highly_homologous_region filter_6_simple_repeat
#> 1 FALSE FALSE
#> 2 FALSE FALSE
#> 3 FALSE FALSE
#> 4 FALSE FALSE
#> 5 FALSE FALSE
#> 6 FALSE FALSE
#> 7 FALSE FALSE
#> 8 FALSE FALSE
#> filter_7_c_to_t_artifact filter_8_mutation_at_homopolymer msec_filter_1234
#> 1 FALSE FALSE Artifact suspicious
#> 2 FALSE FALSE Artifact suspicious
#> 3 FALSE FALSE Artifact suspicious
#> 4 FALSE FALSE Artifact suspicious
#> 5 FALSE FALSE
#> 6 FALSE FALSE
#> 7 FALSE FALSE
#> 8 FALSE FALSE
#> msec_filter_12345 msec_filter_all comment
#> 1 Artifact suspicious Artifact suspicious
#> 2 Artifact suspicious Artifact suspicious
#> 3 Artifact suspicious Artifact suspicious
#> 4 Artifact suspicious Artifact suspicious
#> 5
#> 6
#> 7
#> 8
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] MicroSEC_1.1.3
#>
#> loaded via a namespace (and not attached):
#> [1] zip_2.1.1 Rcpp_1.0.5
#> [3] compiler_3.6.3 pillar_1.4.6
#> [5] GenomeInfoDb_1.22.1 XVector_0.26.0
#> [7] bitops_1.0-6 tools_3.6.3
#> [9] zlibbioc_1.32.0 digest_0.6.27
#> [11] lattice_0.20-41 BSgenome_1.54.0
#> [13] evaluate_0.14 lifecycle_0.2.0
#> [15] tibble_3.0.4 pkgconfig_2.0.3
#> [17] rlang_0.4.8 Matrix_1.2-18
#> [19] BSgenome.Hsapiens.UCSC.hg38_1.4.1 openxlsx_4.2.3
#> [21] DelayedArray_0.12.3 yaml_2.2.1
#> [23] parallel_3.6.3 xfun_0.18
#> [25] GenomeInfoDbData_1.2.2 rtracklayer_1.46.0
#> [27] stringr_1.4.0 dplyr_1.0.2
#> [29] knitr_1.30 Biostrings_2.54.0
#> [31] generics_0.0.2 S4Vectors_0.24.4
#> [33] vctrs_0.3.4 gtools_3.8.2
#> [35] IRanges_2.20.2 grid_3.6.3
#> [37] stats4_3.6.3 tidyselect_1.1.0
#> [39] Biobase_2.46.0 glue_1.4.2
#> [41] R6_2.4.1 XML_4.0-0
#> [43] BiocParallel_1.20.1 rmarkdown_2.5
#> [45] purrr_0.3.4 magrittr_1.5
#> [47] matrixStats_0.57.0 GenomicAlignments_1.22.1
#> [49] Rsamtools_2.2.3 htmltools_0.5.0
#> [51] ellipsis_0.3.1 BiocGenerics_0.32.0
#> [53] GenomicRanges_1.38.0 SummarizedExperiment_1.16.1
#> [55] stringi_1.5.3 RCurl_1.98-1.2
#> [57] crayon_1.3.4