The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

MicroSEC: Sequence artifact filtering pipeline for FFPE samples

Masachika Ikegami

2024-08-25

Introduction

This pipeline is designed for filtering mutations found in formalin-fixed and paraffin-embedded (FFPE) samples. The MicroSEC filter utilizes a statistical analysis, and the results for mutations with less than 10 supporting reads are not reliable. Four files are necessary for the analysis: mutation information file, BAM file, and mutation supporting read ID information file.

File 1: mutation information file (mandatory)

This tsv or tsv.gz file should contain at least these columns: Sample Mut_type Chr Pos Ref Alt SimpleRepeat_TRF Neighborhood_sequence
sample_name 1-snv chr1 108130741 C T N CTACCTGGAGAATGGGCCCATGTGTCCAGGTAGCAGTAAGC
SimpleRepeat_TRF: Whether the mutation locates at a simple repeat sequence or not (“Y” or “N”).
Neighborhood_sequence: [5’-20 bases] + [Alt sequence] + [3’-20 bases]
Sample, Mut_type, Chr, Pos, Ref, and Alt should be set exactly.
If you do not know the SimpleRepeat_TRF, Mut_type, or Neighborhood_sequence, enter “-”. Automatically detected.

File 2: BAM file (mandatory)

File 3: sample information tsv file

(mandatory, if multiple samples are processed in a batch)
Seven to ten columns are necessary (without column names).
Optional columns can be deleted if they are not applicable.
[sample name] [mutation information tsv file] [BAM file] [read length] [adapter sequence read 1] [optional: adapter sequence read 2] [sample type: Human or Mouse] [panel name] [optional: reference genome fastq file] [optional: simple repeat region bed file]
PC9 ./source/CCLE.tsv ./source/Cell_line/PC9.bam 127 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Human TOP
A375 ./source/CCLE.tsv.gz ./source/Cell_line/A375.bam 127 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Hg38 TOP ./reference/hg38.fa ./reference/simpleRepeat.bed.gz

File 4: Reference genome: Human (hg38 or hg19) or Mouse (mm10)

(optional, but mandatory with cram files)

File 5: simple repeat region bed file

(optional, but mandatory to detect simple repeat derived artifacts)

This pipeline contains 8 filtering processes.
Filter 1 : Shorter-supporting lengths distribute too unevenly to occur (1-1 and 1-2).
Filter 1-1: P-values are less than the threshold_p(default: 10^(-6)).
Filter 1-2: The shorter-supporting lengths distributed over less than 75% of the read length.
Filter 2 : Hairpin-structure induced error detection (2-1 and 2-2).
Filter 2-1: Palindromic sequences exist within 150 bases. Filter 2-2: >=50% mutation-supporting reads contains a reverse complementary sequence of the opposite strand consisting >= 15 bases.
Filter 3 : 3’-/5’-supporting lengths are too unevenly distributed to occur (3-1 and 3-3).
Filter 3-1: P-values are less than the threshold_p(default: 10^(-6)).
Filter 3-2: The distributions of 3’-/5’-supporting lengths are within 75% of the read length.
Filter 4 : >=15% mutations were called by chimeric reads comprising two distant regions.
Filter 5 : >=50% mutations were called by soft-clipped reads.
Filter 6 : Mutations locating at simple repeat sequences.
Filter 7 : Mutations locating at a >=15 homopolymer.
Filter 8 : >=10% low quality bases (Quality score <18) in the mutation supporting reads.

Supporting lengths are adjusted considering small repeat sequences around the mutations.

How to use

Rscript MicroSEC.R [working/output directory] [sample information tsv file] [progress bar Y/N]

Example

$ Rscript MicroSEC.R ~ \
  ~/source/Sample_list_test.txt N
$ Rscript MicroSEC.R ~ \
   ~/source/sample_info_test.tsv.gz Y

If you want to know the progress visually, [progress bar Y/N] should be Y.
Results are saved in a tsv file.

github url: https://github.com/MANO-B/MicroSEC

Setting

wd <- "~"
knitr::opts_chunk$set(collapse = TRUE,
                      fig.width = 12,
                      fig.height = 8,
                      echo = TRUE,
                      warning = FALSE,
                      message = TRUE,
                      comment = "#>")
options(rmarkdown.html_vignette.check_title = FALSE,
        show.error.messages = FALSE,
        warn = -1)

progress_bar <- "N"

Necessary packages

library(MicroSEC)

Analysis

# initialize
msec <- NULL
homology_search <- NULL
mut_depth <- NULL

# test data
sample_name <- "sample"
read_length <- 150
adapter_1 <- "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
adapter_2 <- "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
organism <- "hg38"

# load mutation information
df_mutation <- fun_load_mutation(
   system.file("extdata", "mutation_list.tsv", package = "MicroSEC"),
   "sample",
   BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38,
   24)
df_bam <- fun_load_bam(
   system.file("extdata", "sample.bam", package = "MicroSEC"))

# another example data
# data(exampleMutation)
# data(exampleBam)
# df_mutation <- exampleMutation
# df_bam <- exampleBam

# load genomic sequence
ref_genome <- fun_load_genome(organism)
chr_no <- fun_load_chr_no(organism)

# analysis
result <- fun_read_check(df_mutation = df_mutation,
                         df_bam = df_bam,
                         ref_genome = ref_genome,
                         sample_name = sample_name,
                         read_length = read_length,
                         adapter_1 = adapter_1,
                         adapter_2 = adapter_2,
                         short_homology_search_length = 4,
                         min_homology_search = 40,
                         progress_bar = progress_bar)
msec_read_checked <- result[[1]]
homology_searched <- result[[2]]
mut_depth_checked <- result[[3]]

# search homologous sequences
msec_homology_searched = fun_homology(msec = msec_read_checked,
                    df_distant = homology_searched,
                    min_homology_search = 40,
                    ref_genome = ref_genome,
                    chr_no = chr_no,
                    progress_bar = progress_bar)

# statistical analysis
msec_summarized <- fun_summary(msec_homology_searched)
msec_analyzed <- fun_analysis(msec = msec_summarized,
                    mut_depth = mut_depth_checked,
                    short_homology_search_length = 4,
                    min_homology_search = 40,
                    threshold_p = 10 ^ (-6),
                    threshold_hairpin_ratio = 0.50,
                    threshold_short_length = 0.75,
                    threshold_distant_homology = 0.15,
                    threshold_soft_clip_ratio = 0.50,
                    threshold_low_quality_rate = 0.1,
                    homopolymer_length = 15)

# save the results as a tsv.gz file.
#fun_save(msec_analyzed, "~/MicroSEC_test.tsv.gz")

Results

msec_analyzed
#>    Sample Mut_type   Chr       Pos Ref Alt SimpleRepeat_TRF
#> 1  sample    2-snv chr11  17719997  CC  GG                N
#> 2  sample    2-snv chr11  69295601  CT  AA                N
#> 3  sample    2-snv chr15  98707826  TC  CG                N
#> 4  sample    1-ins  chr2  47783391   C  CA                N
#> 5  sample    2-del  chr2  60920434 CAG   C                N
#> 6  sample    1-snv  chr3  72446880   G   C                N
#> 7  sample    2-snv  chr3 179198865 CAC AAG                N
#> 8  sample    2-del  chr3 181742887 TTA   T                Y
#> 9  sample    1-snv  chr5 180618885   G   A                N
#> 10 sample    2-snv  chr6  29725282  CC  GG                N
#> 11 sample    2-snv  chr6  88143884  GA  TC                N
#> 12 sample    2-snv  chr7  64068966  CG  AA                N
#> 13 sample    2-snv  chr8  23681315  GC  CA                N
#>                          Neighborhood_sequence read_length total_read
#> 1   CCCCGCGGCGGTGCACCCGGGGCCGGGCGCACGTGAGGACGA         150          9
#> 2   CTTGTCTGTAGTCTGTCCCCAATGGGGACAGGGACTCGTTGC         150         28
#> 3   CAACTACGCCCTGGTCATCTCGGAGATGACCAATCTCAAGGA         150         17
#> 4   GGATGCGGCCTGGAGCGAGGCATGGGCCTGGGCCCAGGCCCT         150         13
#> 5    TGGTCTTGAACTCCTGACATCGTGATCCACCCACCTTGGCC         150         34
#> 6    CTCGCACACCAGGCGGTGGCCGCGGCGCCCCGGGAAGGGCC         150         16
#> 7  CAGGTGAACTGTGGGGCATCAAGTTGATGCCCCCAAGAATCCT         150         23
#> 8    GGGGGAAAGGACAGAGCAGTTTATATATATATATATATATA         150          6
#> 9    AGCCTGGCGAGCTCCACCATAGCGCGGAAGCGTCCGCGCTG         150         19
#> 10  ATGTGCAGCACGAGGGGCTGGGCCAGCCCCTCATCCTGAGAT         150          9
#> 11  CCTCGGCAGACGTGTCTGTGTCCACAGACATGGTTACCTTGG         150         23
#> 12  GCCTGGACTCTGCTCAGCAGAATTTGTATAGGGATGTGATGT         150         11
#> 13  CCAGGGAGGCCCGGGAGAAGCACTCCTCTTTCAGGGCCGGCA         150         34
#>    soft_clipped_read flag_hairpin pre_support_length post_support_length
#> 1                  3            8                143                  10
#> 2                 13           28                139                 140
#> 3                  7           17                139                 140
#> 4                 11           12                131                  20
#> 5                 16            0                 65                 144
#> 6                 16            0                138                   5
#> 7                  7           23                139                 141
#> 8                  1            0                100                 111
#> 9                 12            0                145                  19
#> 10                 4            9                139                 140
#> 11                 8           21                144                  10
#> 12                 7            0                 47                 131
#> 13                34           32                135                   9
#>    short_support_length pre_farthest post_farthest
#> 1                    10          205            10
#> 2                     9          242           282
#> 3                    10          338           187
#> 4                    20          211            20
#> 5                    65           65           258
#> 6                     5          190             5
#> 7                    10          268           383
#> 8                    72          114           111
#> 9                    19          158            19
#> 10                   10          205           155
#> 11                   10          393            10
#> 12                   47           47           243
#> 13                    9          301             9
#>    low_quality_base_rate_under_q18 low_quality_pre low_quality_post
#> 1                      0.015555556     0.011111111      0.000000000
#> 2                      0.006428571     0.010714286      0.003571429
#> 3                      0.009803922     0.005882353      0.005882353
#> 4                      0.010769231     0.015384615      0.000000000
#> 5                      0.003725490     0.000000000      0.000000000
#> 6                      0.009583333     0.018750000      0.012500000
#> 7                      0.005217391     0.000000000      0.004347826
#> 8                      0.091111111     0.000000000      0.166666667
#> 9                      0.009824561     0.021052632      0.000000000
#> 10                     0.004444444     0.000000000      0.000000000
#> 11                     0.021449275     0.004347826      0.000000000
#> 12                     0.020000000     0.009090909      0.027272727
#> 13                     0.006862745     0.000000000      0.011764706
#>    distant_homology_rate soft_clipped_rate prob_filter_1 prob_filter_3_pre
#> 1             0.00000000         0.3333333  5.080526e-05      1.173347e-03
#> 2             0.00000000         0.4642857  6.775881e-22      5.890075e-01
#> 3             0.00000000         0.4117647  1.362376e-19      4.838936e-01
#> 4             0.07692308         0.8461538  3.198724e-11      1.943905e-06
#> 5             1.00000000         0.4705882  9.553664e-03      1.214764e-04
#> 6             0.00000000         1.0000000  3.370855e-15      7.600607e-06
#> 7             0.00000000         0.3043478  2.439900e-13      1.000000e+00
#> 8             0.33333333         0.1666667  1.438974e-04      5.013675e-01
#> 9             0.00000000         0.6315789  2.128166e-08      2.319433e-12
#> 10            0.00000000         0.4444444  5.263540e-10      6.122053e-01
#> 11            0.00000000         0.3478261  1.667818e-18      7.358585e-10
#> 12            0.27272727         0.6363636  9.045576e-03      2.661244e-03
#> 13            0.23529412         1.0000000  2.853725e-24      2.396907e-12
#>    prob_filter_3_post filter_1_mutation_intra_hairpin_loop
#> 1        2.136914e-05                                FALSE
#> 2        8.042059e-01                                 TRUE
#> 3        6.816295e-01                                 TRUE
#> 4        8.192000e-10                                 TRUE
#> 5        1.042664e-04                                FALSE
#> 6        3.552714e-15                                 TRUE
#> 7        1.000000e+00                                 TRUE
#> 8        1.562500e-02                                FALSE
#> 9        3.876247e-12                                 TRUE
#> 10       7.240429e-01                                 TRUE
#> 11       3.030645e-20                                 TRUE
#> 12       3.880853e-03                                FALSE
#> 13       2.295862e-28                                 TRUE
#>    filter_2_hairpin_structure filter_3_microhomology_induced_mutation
#> 1                        TRUE                                   FALSE
#> 2                        TRUE                                   FALSE
#> 3                        TRUE                                   FALSE
#> 4                        TRUE                                    TRUE
#> 5                       FALSE                                   FALSE
#> 6                       FALSE                                    TRUE
#> 7                        TRUE                                   FALSE
#> 8                       FALSE                                   FALSE
#> 9                       FALSE                                    TRUE
#> 10                       TRUE                                   FALSE
#> 11                       TRUE                                    TRUE
#> 12                      FALSE                                   FALSE
#> 13                       TRUE                                    TRUE
#>    filter_4_highly_homologous_region filter_5_soft_clipped_reads
#> 1                              FALSE                       FALSE
#> 2                              FALSE                       FALSE
#> 3                              FALSE                       FALSE
#> 4                              FALSE                        TRUE
#> 5                               TRUE                       FALSE
#> 6                              FALSE                        TRUE
#> 7                              FALSE                       FALSE
#> 8                              FALSE                       FALSE
#> 9                              FALSE                        TRUE
#> 10                             FALSE                       FALSE
#> 11                             FALSE                       FALSE
#> 12                              TRUE                        TRUE
#> 13                              TRUE                        TRUE
#>    filter_6_simple_repeat filter_7_mutation_at_homopolymer filter_8_low_quality
#> 1                   FALSE                            FALSE                FALSE
#> 2                   FALSE                            FALSE                FALSE
#> 3                   FALSE                            FALSE                FALSE
#> 4                   FALSE                            FALSE                FALSE
#> 5                   FALSE                            FALSE                FALSE
#> 6                   FALSE                            FALSE                FALSE
#> 7                   FALSE                            FALSE                FALSE
#> 8                    TRUE                            FALSE                 TRUE
#> 9                   FALSE                            FALSE                FALSE
#> 10                  FALSE                            FALSE                FALSE
#> 11                  FALSE                            FALSE                FALSE
#> 12                  FALSE                            FALSE                FALSE
#> 13                  FALSE                            FALSE                FALSE
#>        msec_filter_123    msec_filter_1234     msec_filter_all
#> 1  Artifact suspicious Artifact suspicious Artifact suspicious
#> 2  Artifact suspicious Artifact suspicious Artifact suspicious
#> 3  Artifact suspicious Artifact suspicious Artifact suspicious
#> 4  Artifact suspicious Artifact suspicious Artifact suspicious
#> 5                      Artifact suspicious Artifact suspicious
#> 6  Artifact suspicious Artifact suspicious Artifact suspicious
#> 7  Artifact suspicious Artifact suspicious Artifact suspicious
#> 8                                          Artifact suspicious
#> 9  Artifact suspicious Artifact suspicious Artifact suspicious
#> 10 Artifact suspicious Artifact suspicious Artifact suspicious
#> 11 Artifact suspicious Artifact suspicious Artifact suspicious
#> 12                     Artifact suspicious Artifact suspicious
#> 13 Artifact suspicious Artifact suspicious Artifact suspicious
#>                                                                       comment
#> 1                                                                            
#> 2                                                                            
#> 3                                                                            
#> 4                                                                            
#> 5                                                                            
#> 6                                                                            
#> 7                                                                            
#> 8   too repetitive to analyze homology, filter 4: sequence is too repetitive,
#> 9                                                                            
#> 10                                                                           
#> 11                                                                           
#> 12                                                                           
#> 13

Information about the current R session

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Asia/Tokyo
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] MicroSEC_2.1.6
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9                        utf8_1.2.4                       
#>  [3] generics_0.1.3                    SparseArray_1.2.4                
#>  [5] bitops_1.0-7                      lattice_0.22-6                   
#>  [7] stringi_1.8.3                     digest_0.6.35                    
#>  [9] magrittr_2.0.3                    grid_4.3.2                       
#> [11] evaluate_0.23                     fastmap_1.1.1                    
#> [13] Matrix_1.6-5                      jsonlite_1.8.8                   
#> [15] restfulr_0.0.15                   GenomeInfoDb_1.38.8              
#> [17] fansi_1.0.6                       BSgenome_1.70.2                  
#> [19] XML_3.99-0.16.1                   Biostrings_2.70.3                
#> [21] codetools_0.2-20                  jquerylib_0.1.4                  
#> [23] abind_1.4-5                       cli_3.6.2                        
#> [25] rlang_1.1.3                       crayon_1.5.2                     
#> [27] XVector_0.42.0                    Biobase_2.62.0                   
#> [29] withr_3.0.0                       DelayedArray_0.28.0              
#> [31] cachem_1.0.8                      yaml_2.3.8                       
#> [33] S4Arrays_1.2.1                    tools_4.3.2                      
#> [35] parallel_4.3.2                    BiocParallel_1.36.0              
#> [37] dplyr_1.1.4                       GenomeInfoDbData_1.2.11          
#> [39] Rsamtools_2.18.0                  SummarizedExperiment_1.32.0      
#> [41] BiocGenerics_0.48.1               vctrs_0.6.5                      
#> [43] R6_2.5.1                          matrixStats_1.3.0                
#> [45] BiocIO_1.12.0                     stats4_4.3.2                     
#> [47] lifecycle_1.0.4                   BSgenome.Hsapiens.UCSC.hg38_1.4.5
#> [49] rtracklayer_1.62.0                zlibbioc_1.48.2                  
#> [51] stringr_1.5.1                     S4Vectors_0.40.2                 
#> [53] IRanges_2.36.0                    pkgconfig_2.0.3                  
#> [55] pillar_1.9.0                      bslib_0.8.0                      
#> [57] glue_1.7.0                        GenomicAlignments_1.38.2         
#> [59] xfun_0.43                         tibble_3.2.1                     
#> [61] GenomicRanges_1.54.1              tidyselect_1.2.1                 
#> [63] MatrixGenerics_1.14.0             rstudioapi_0.16.0                
#> [65] knitr_1.46                        rjson_0.2.21                     
#> [67] htmltools_0.5.8.1                 rmarkdown_2.26                   
#> [69] compiler_4.3.2                    RCurl_1.98-1.14

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.