Short introduction to an STRMPS workflow

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Søren B. Vilsen

Last updated: 2025-02-22

The STRMPS package is designed to extract and collect the short tandem repeat (STR) information from the fastq files produced by massively parallel sequencing (MPS).

Example sequences

The STRMPS-package depends on R (>= 4.4), methods, utils, tidyr, tibble, dplyr, stringr, purrr, parallel, as well as the bioconductor packages Biostrings, pwalign, ShortRead, and IRanges.

Using the readFastq function, we can load a data file into R.

library("Biostrings")
library("pwalign")
library("ShortRead")

readPath <- system.file('extdata', 'sampleSequences.fastq', package = 'STRMPS')
sequences <- readFastq(readPath)

sequences@sread
#> DNAStringSet object of length 20237:
#>         width seq
#>     [1]    70 CTGTTCTAAGTACAGTAAAGATAGATAGATAG...GGGTGAGATTTCTATCTATGAAGGCAGTTAC
#>     [2]    70 CTGTTCTAAGTACAGTAAAGATAGATAGATAG...GGGTGAGATTTCTATCTATGAAGGCAGTTAC
#>     [3]    70 CTGTTCTAAGTACAGTAAAGATAGATAGATAG...GGGTGAGATTTCTATCTATGAAGGCAGTTAC
#>     [4]    70 CTGTTCTAAGTACAGTAAAGATAGATAGATAG...GGGTGAGATTTCTATCTATGAAGGCAGTTAC
#>     [5]    70 CTGTTCTAAGTACAGTAAAGATAGATAGATAG...GGGTGAGATTTCTATCTATGAAGGCAGTTAC
#>     ...   ... ...
#> [20233]   144 GGACAGATGATACCGCATGGTCCGCTAAAGCT...TACAATTCGGACCAAGTCACATACTGATTAT
#> [20234]   144 GGACAGATGATACCGCATGGTCCGCTAAAGCT...TACAATTCGGACCAAGTCACATACTGATTAT
#> [20235]   144 GGACAGATGATACCGCATGGTCCGCTAAAGCT...TACAATTCGGACCAAGTCACATACTGATTAT
#> [20236]   144 GGACAGATGATACCGCATGGTCCGCTAAAGCT...TACAATTCGGACCAAGTCACATACTGATTAT
#> [20237]   144 GGACAGATGATACCGCATGGTCCGCTAAAGCT...TACAATTCGGACCAAGTCACATACTGATTAT

Flanking regions

We are interested in extracting the STR regions of these sequences. They are extracted by searching for marker specific sequences in the regions surrounding the STR region – called the flanking regions. The following is an example of a tibble containing the necessary flanking region information. NOTE: this is just an example, the following loaded flanking region tibble is not usable for actual applications.

library("STRMPS")
data("flankingRegions")

head(flankingRegions, 5)
#> # A tibble: 5 × 10
#>   Marker     Chromosome Type  ForwardFlank ReverseFlank Motif MotifLength Offset
#>   <chr>      <chr>      <chr> <chr>        <chr>        <chr>       <int>  <int>
#> 1 AMELOGENIN X/Y        AUTO… CCCTGGGCTCT… CAGCTTCCCAGT TCAA…           6     63
#> 2 CSF1PO     5          AUTO… CTGTTCTAAGTA CTATCTATGAA… ATCT…           4     20
#> 3 D10S1248   10         AUTO… GTCACAAACAT… GTTCCTTTAATA GGAA…           4     41
#> 4 D12S391    12         AUTO… CTGTATTAGTA… GCTGGAGACC   GTCT…           4    112
#> 5 D13S317    13         AUTO… GACTCTCTGGAC CTGCCTATGGC… TATC…           4     78
#> # ℹ 2 more variables: ForwardShift <dbl>, ReverseShift <dbl>

The columns of the tibble contains the following information:

Marker and Type are the name and chromosomal type (autosomal, X, or Y) of each marker.
ForwardFlank and ReverseFlank contain the marker specific sequences used to identify the STR regions. The forward and reverse flanks should occur before and after the STR regions, respectively.
Motif and MotifLength are the structure and length of the STR regions motif, respectively.
Offset is used to adjust the numeric allele designation so it can be compared to the corresponding CE allele designation.
ForwardShift and ReverseShift (not shown) are used to trim the extracted sequences to just include the STR regions. If these are set to zero, then the extracted regions contain some useful flanking region information.

Identification

Using the flankingRegions file, we can identify the STR regions of the sequences by calling the identifySTRRegions function. The function takes four arguments:

reads: The sequences to search through.
flankingRegions: A tibble or data.frame in the same style as shown above.
numberOfMutation: The number of allowed mismatches, when searching the sequences.
control: A control object setting additional parameters (to see more type ?identifySTRRegions.control).

identifiedSTRs <- identifySTRRegions(
    reads = sequences,
    flankingRegions = flankingRegions,
    numberOfMutation = 1,
    control = identifySTRRegions.control(
        numberOfThreads = 1,
        includeReverseComplement = FALSE
    )
)

The function returns a list with the following:

n_reads: The total number of reads in the supplied reads object.
reverseComplement: TRUE/FALSE value – did we search for the reverse complementary flanking regions?
identifiedMarkers: A list containing indecies, and the start and end positions of STR regions for each of the markers in the flankingRegiions object.
identifiedMarkersSequencesUniquelyAssigned: Almost identical to identifiedMarkers, but if multiple markers have been found in a sequences it is disregarded.
remainingSequences: A vector of indecies of sequences where no marker was identified.

An example of what is stored in the identifiedMarkersSequencesUniquelyAssigned list

names(identifiedSTRs$identifiedMarkersSequencesUniquelyAssigned$CSF1PO)
#>  [1] "name"                          "matchedSeq"                   
#>  [3] "startForward"                  "endForward"                   
#>  [5] "startReverse"                  "endReverse"                   
#>  [7] "trimmedIncludingFlanks"        "trimmedQualityIncludingFlanks"
#>  [9] "trimmed"                       "trimmedQuality"

That is, we find the name of the marker, the indecies of the sequences identified as belonging to the marker, the start of the forward flank, the end of the reverse flank, and then trimmed versions of the sequences, where ‘junk’ has been removed (any information outside the searched for flanking sequences).

Aggregation

Given the identified STR regions, we can aggregate the identified strings, to get their coverage. When doing so, we also supply the length of the motifs of every marker, and their type, i.e. whether the marker is located on an autosomal, the X, or the Y chromosome. In order to supply the motif length and type information, we need to know which markers were actually identified, and in which order they were identified.

This could be done as follows:

sortedIncludedMarkers <- sapply(
    names(identifiedSTRs$identifiedMarkersSequencesUniquelyAssigned),
    function(m) which(m == flankingRegions$Marker)
)

The aggregation is then performed by calling the stringCoverage function with the arguments extractedReadsListObject, motifLength, Type, and control. In the control argument, we pass an argument called simpleReturn. If this argument is set to FALSE, the aggregation is performed based on both the STR region and the forward and reverse flanks. If it is set to TRUE, the aggregation is just performed based on the STR regions. That is, if it is FALSE, we will find more distinct strings, with lower coverage. Note that when setting simpleReturn = TRUE, the quality is also summarised by the geometric average for each base.

stringCoverageList <- stringCoverage(
    extractedReadsListObject = identifiedSTRs,
    flankingRegions = flankingRegions,
    control = stringCoverage.control(
        numberOfThreads = 1,
        trace = FALSE,
        simpleReturn = TRUE
    )
)

The stringCoverage returns a list of tibbles one for each of the identified markers. The result can be seen here:

stringCoverageList$CSF1PO
#> # A tibble: 3 × 9
#>   Marker BasePairs Allele Type      MotifLength Region Coverage AggregateQuality
#>   <chr>      <int>  <dbl> <chr>           <int> <chr>     <int> <chr>           
#> 1 CSF1PO        38    9.5 AUTOSOMAL           4 CAGTA…      955 [[[[[[[[[[[[[[[…
#> 2 CSF1PO        50   12.5 AUTOSOMAL           4 CAGTA…       37 [[[[[[[[[[[[[[[…
#> 3 CSF1PO        54   13.5 AUTOSOMAL           4 CAGTA…      994 [[[[[[[[[[[[[[[…
#> # ℹ 1 more variable: Quality <list>

We see on marker CSF1PO, we have found three distinct markers two with a coverage above 950, and one with a coverage of 37. If the sample only contained a single contributor, we would expect that the two strings with a coverage of above 950 to be the alleles of the individual. While the string with low coverage is either a stutter (as it is in the stutter position), or an error.

Genotype identification

If the aggregated stringCoverageList-object, contains DNA from a single contributor and was made with a large amount of input DNA, we can determine the genotype of the contributor. A way of doing so is to identify the string with the largest coverage, and determine whether or not another allele is within some pre-defined heterozygous threshold (by default this is set to \(0.35\)).

We determine the genotype of each marker, by using the getGenotype function.

genotypeList <- getGenotype(stringCoverageList)

The created genotypeList-object contains the genotypes determined by the thresholdHeterozygosity. Continuing the example from before, the genotypes of the marker CSF1PO are:

genotypeList$CSF1PO
#> # A tibble: 2 × 10
#>   Marker BasePairs Allele Type      MotifLength Region Coverage AggregateQuality
#>   <chr>      <int>  <dbl> <chr>           <int> <chr>     <int> <chr>           
#> 1 CSF1PO        54   13.5 AUTOSOMAL           4 CAGTA…      994 [[[[[[[[[[[[[[[…
#> 2 CSF1PO        38    9.5 AUTOSOMAL           4 CAGTA…      955 [[[[[[[[[[[[[[[…
#> # ℹ 2 more variables: Quality <list>, Indices <int>

We see that the getGenotype function has identified the two strings with coverage above 950, as the potential alleles of the marker for the contributor of the sample. Furthermore, note that the function adds a column called Indices containing the indices of the corresponding strings in the stringCoverageList-object. That is, in this example, Indices contains the values 1 and 3.

Noise identification

In a similar way to the genotype identification, we implemented simple rules for noise identification. A string is classified as noise if it is less than the coverage of the most prevalent string times thresholdSignal. By default this factor is set to \(1\%\).

noiseList <- identifyNoise(stringCoverageList, thresholdSignal = 0.03)

The identifyNoise function returns a list with an element for every marker in the stringCoverageList-object. If an observation is classified as noise, it is then removed from the data. Returning to our example from before, marker CSF1PO has no strings with less than \(1\%\) of the coverage of the most prevalent string. That is, the following still contains three distinct sequences:

noiseList$CSF1PO
#> # A tibble: 2 × 10
#>   Marker BasePairs Allele Type      MotifLength Region Coverage AggregateQuality
#>   <chr>      <int>  <dbl> <chr>           <int> <chr>     <int> <chr>           
#> 1 CSF1PO        54   13.5 AUTOSOMAL           4 CAGTA…      994 [[[[[[[[[[[[[[[…
#> 2 CSF1PO        38    9.5 AUTOSOMAL           4 CAGTA…      955 [[[[[[[[[[[[[[[…
#> # ℹ 2 more variables: Quality <list>, Indices <int>

Note that as with the getGenotype function, the identifyNoise function adds a column called Indices containing the indices of the corresponding strings in the stringCoverageList-object. That is, in this example, Indices contains the values 1, 2, and 3.

Stuttering

Stuttering refers to the loss (or gain) of a motif in the allele sequence during PCR amplification. A predictor of the rate of stuttering (quantified by the stutter ratio or stutter proportion), is the Block Length of the Missing Motif (BLMM). If we compare the allele to a stutter sequence, then we can identify the approximate location (down to the nearest block) of the motif which stuttered. The BLMM is the length of the block (sub-sequence), which has lost a motif.

The findStutter function for every called allele, found with the getGenotype function, finds all potential stutters of the alleles and calculates the stutter ratio and stutter proportions between the allele and potential stutter. It also identifies the missing motif and finds the length of the corresponding block, as well as the longest uninterrupted stretch (LUS), i.e. the length of the longest block. It takes a merge of the stringCoverageList and the genotypeList created using the mergeGenotypeStringCoverage function.

stringCoverageGenotypeList <- mergeGenotypeStringCoverage(stringCoverageList, genotypeList)

stutterList <- findStutter(stringCoverageGenotypeList)
stutterTibble <- subset(do.call("rbind", stutterList), !is.na(Genotype))
head(stutterTibble, 5)
#> # A tibble: 5 × 20
#>   Marker  Genotype    ParentAllele ParentString        ParentLUS ParentLUSLength
#>   <chr>   <chr>              <dbl> <chr>               <chr>               <int>
#> 1 CSF1PO  9.5,13.5            13.5 CAGTAAAGATAGATAGAT… [AGAT]9                 9
#> 2 D13S317 29.5                29.5 AATGGCGGGTCGATGAGG… [TATC]10               10
#> 3 D18S51  31.75,37.75         37.8 GCTGAAGAAAGAAAGAAA… [AAGA]22               22
#> 4 D3S1358 31.5,32.5           32.5 ATTACTCCATTTGGCCAA… [TCTA]12               12
#> 5 D5S818  18                  18   CAGAGTAGAGTTGAGATA… [AGAT]13               13
#> # ℹ 14 more variables: ParentCoverage <int>, NeighbourAllele <dbl>,
#> #   NeighbourString <chr>, Block <chr>, MissingMotif <chr>,
#> #   BlockLengthMissingMotif <int>, NeighbourCoverage <int>,
#> #   NeighbourRatio <dbl>, NeighbourProportion <dbl>,
#> #   FLAGStutterIdentifiedMoreThanOnce <lgl>, FLAGMoreThanTwoAlleles <lgl>,
#> #   FLAGAlleleDifferenceOne <lgl>, FLAGMoreThanOneBlock <lgl>,
#> #   FLAGBlocksWithDifferentLengths <lgl>

Note that the output also contains a series of flags which could be useful:

FLAGStutterIdentifiedMoreThanOnce: Is the stutter string been identified as a stutter for more than one allele?
FLAGMoreThanTwoAlleles: Were more than two alleles provided by the merged stringCoverageGenotypeList object?
FLAGAlleleDifferenceOne: Was the difference between the two alleles one motif? If TRUE the shorter will appear as a stutter of the longer and, thereby, skew the results of any analysis. These cases should be removed.
FLAGMoreThanOneBlock: Was more than one block identified when comparing the potenetial stutter and allele sequences?
FLAGBlocksWithDifferentLengths: If FLAGMoreThanOneBlock was TRUE, did the identified blocks have different lengths? Note: This should not be possible in theory, but if it happens it is at most a difference of one and the longest is always chosen (as it should be more likely).

Workflow function

Instead of calling all the above functions, we can call the workflow function STRMPSWorkflow. The function takes a path to a file, and either returns a series of .RData files in the provided output directory, or it returns the stringCoverageList object created by the stringCoverage function. The function can also continue from previously created files using the continueCheckpoint argument (if the files are placed in the output directory).

STRMPSWorkflow(
    read_path,
    control = workflow.control(
        restrictType = "Autosomal",
        numberOfThreads = 1
    )
)

There also exists a batch version of the function called STRMPSWorkflowBatch which takes an input directory containing the .fastq files to be analysed.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.