The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The STRMPS
package is designed to extract and collect
the short tandem repeat (STR) information from the fastq
files produced by massively parallel sequencing (MPS).
The STRMPS
-package depends on R
(>=
4.4), methods
, utils
, tidyr
,
tibble
, dplyr
, stringr
,
purrr
, parallel
, as well as the bioconductor
packages Biostrings
, pwalign
,
ShortRead
, and IRanges
.
Using the readFastq
function, we can load a data file
into R
.
library("Biostrings")
library("pwalign")
library("ShortRead")
readPath <- system.file('extdata', 'sampleSequences.fastq', package = 'STRMPS')
sequences <- readFastq(readPath)
sequences@sread
#> DNAStringSet object of length 20237:
#> width seq
#> [1] 70 CTGTTCTAAGTACAGTAAAGATAGATAGATAG...GGGTGAGATTTCTATCTATGAAGGCAGTTAC
#> [2] 70 CTGTTCTAAGTACAGTAAAGATAGATAGATAG...GGGTGAGATTTCTATCTATGAAGGCAGTTAC
#> [3] 70 CTGTTCTAAGTACAGTAAAGATAGATAGATAG...GGGTGAGATTTCTATCTATGAAGGCAGTTAC
#> [4] 70 CTGTTCTAAGTACAGTAAAGATAGATAGATAG...GGGTGAGATTTCTATCTATGAAGGCAGTTAC
#> [5] 70 CTGTTCTAAGTACAGTAAAGATAGATAGATAG...GGGTGAGATTTCTATCTATGAAGGCAGTTAC
#> ... ... ...
#> [20233] 144 GGACAGATGATACCGCATGGTCCGCTAAAGCT...TACAATTCGGACCAAGTCACATACTGATTAT
#> [20234] 144 GGACAGATGATACCGCATGGTCCGCTAAAGCT...TACAATTCGGACCAAGTCACATACTGATTAT
#> [20235] 144 GGACAGATGATACCGCATGGTCCGCTAAAGCT...TACAATTCGGACCAAGTCACATACTGATTAT
#> [20236] 144 GGACAGATGATACCGCATGGTCCGCTAAAGCT...TACAATTCGGACCAAGTCACATACTGATTAT
#> [20237] 144 GGACAGATGATACCGCATGGTCCGCTAAAGCT...TACAATTCGGACCAAGTCACATACTGATTAT
We are interested in extracting the STR regions of these sequences.
They are extracted by searching for marker specific sequences in the
regions surrounding the STR region – called the flanking regions. The
following is an example of a tibble
containing the
necessary flanking region information. NOTE: this is just an example,
the following loaded flanking region tibble
is not usable
for actual applications.
head(flankingRegions, 5)
#> # A tibble: 5 × 10
#> Marker Chromosome Type ForwardFlank ReverseFlank Motif MotifLength Offset
#> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int>
#> 1 AMELOGENIN X/Y AUTO… CCCTGGGCTCT… CAGCTTCCCAGT TCAA… 6 63
#> 2 CSF1PO 5 AUTO… CTGTTCTAAGTA CTATCTATGAA… ATCT… 4 20
#> 3 D10S1248 10 AUTO… GTCACAAACAT… GTTCCTTTAATA GGAA… 4 41
#> 4 D12S391 12 AUTO… CTGTATTAGTA… GCTGGAGACC GTCT… 4 112
#> 5 D13S317 13 AUTO… GACTCTCTGGAC CTGCCTATGGC… TATC… 4 78
#> # ℹ 2 more variables: ForwardShift <dbl>, ReverseShift <dbl>
The columns of the tibble
contains the following
information:
Marker
and Type
are the name and
chromosomal type (autosomal, X, or Y) of each marker.ForwardFlank
and ReverseFlank
contain the
marker specific sequences used to identify the STR regions. The forward
and reverse flanks should occur before and after the STR regions,
respectively.Motif
and MotifLength
are the structure
and length of the STR regions motif, respectively.Offset
is used to adjust the numeric allele designation
so it can be compared to the corresponding CE allele designation.ForwardShift
and ReverseShift
(not shown)
are used to trim the extracted sequences to just include the STR
regions. If these are set to zero, then the extracted regions contain
some useful flanking region information.Using the flankingRegions
file, we can identify the STR
regions of the sequences by calling the identifySTRRegions
function. The function takes four arguments:
reads
: The sequences to search through.flankingRegions
: A tibble
or
data.frame
in the same style as shown above.numberOfMutation
: The number of allowed mismatches,
when searching the sequences.control
: A control object setting additional parameters
(to see more type ?identifySTRRegions.control
).identifiedSTRs <- identifySTRRegions(
reads = sequences,
flankingRegions = flankingRegions,
numberOfMutation = 1,
control = identifySTRRegions.control(
numberOfThreads = 1,
includeReverseComplement = FALSE
)
)
The function returns a list with the following:
n_reads
: The total number of reads in the supplied
reads
object.reverseComplement
: TRUE/FALSE value – did we search for
the reverse complementary flanking regions?identifiedMarkers
: A list containing indecies, and the
start and end positions of STR regions for each of the markers in the
flankingRegiions
object.identifiedMarkersSequencesUniquelyAssigned
: Almost
identical to identifiedMarkers
, but if multiple markers
have been found in a sequences it is disregarded.remainingSequences
: A vector of indecies of sequences
where no marker was identified.An example of what is stored in the
identifiedMarkersSequencesUniquelyAssigned
list
names(identifiedSTRs$identifiedMarkersSequencesUniquelyAssigned$CSF1PO)
#> [1] "name" "matchedSeq"
#> [3] "startForward" "endForward"
#> [5] "startReverse" "endReverse"
#> [7] "trimmedIncludingFlanks" "trimmedQualityIncludingFlanks"
#> [9] "trimmed" "trimmedQuality"
That is, we find the name of the marker, the indecies of the sequences identified as belonging to the marker, the start of the forward flank, the end of the reverse flank, and then trimmed versions of the sequences, where ‘junk’ has been removed (any information outside the searched for flanking sequences).
Given the identified STR regions, we can aggregate the identified strings, to get their coverage. When doing so, we also supply the length of the motifs of every marker, and their type, i.e. whether the marker is located on an autosomal, the X, or the Y chromosome. In order to supply the motif length and type information, we need to know which markers were actually identified, and in which order they were identified.
This could be done as follows:
sortedIncludedMarkers <- sapply(
names(identifiedSTRs$identifiedMarkersSequencesUniquelyAssigned),
function(m) which(m == flankingRegions$Marker)
)
The aggregation is then performed by calling the
stringCoverage
function with the arguments
extractedReadsListObject
, motifLength
,
Type
, and control
. In the control
argument, we pass an argument called simpleReturn
. If this
argument is set to FALSE
, the aggregation is performed
based on both the STR region and the forward and
reverse flanks. If it is set to TRUE
, the aggregation is
just performed based on the STR regions. That is, if it is
FALSE
, we will find more distinct strings, with lower
coverage. Note that when setting simpleReturn = TRUE
, the
quality is also summarised by the geometric average for each base.
stringCoverageList <- stringCoverage(
extractedReadsListObject = identifiedSTRs,
flankingRegions = flankingRegions,
control = stringCoverage.control(
numberOfThreads = 1,
trace = FALSE,
simpleReturn = TRUE
)
)
The stringCoverage
returns a list of
tibbles
one for each of the identified markers. The result
can be seen here:
stringCoverageList$CSF1PO
#> # A tibble: 3 × 9
#> Marker BasePairs Allele Type MotifLength Region Coverage AggregateQuality
#> <chr> <int> <dbl> <chr> <int> <chr> <int> <chr>
#> 1 CSF1PO 38 9.5 AUTOSOMAL 4 CAGTA… 955 [[[[[[[[[[[[[[[…
#> 2 CSF1PO 50 12.5 AUTOSOMAL 4 CAGTA… 37 [[[[[[[[[[[[[[[…
#> 3 CSF1PO 54 13.5 AUTOSOMAL 4 CAGTA… 994 [[[[[[[[[[[[[[[…
#> # ℹ 1 more variable: Quality <list>
We see on marker CSF1PO
, we have found three distinct
markers two with a coverage above 950, and one with a coverage of 37. If
the sample only contained a single contributor, we would expect that the
two strings with a coverage of above 950 to be the alleles of the
individual. While the string with low coverage is either a stutter (as
it is in the stutter position), or an error.
If the aggregated stringCoverageList
-object, contains
DNA from a single contributor and was made with a large
amount of input DNA, we can determine the genotype of the contributor. A
way of doing so is to identify the string with the largest coverage, and
determine whether or not another allele is within some pre-defined
heterozygous threshold (by default this is set to \(0.35\)).
We determine the genotype of each marker, by using the
getGenotype
function.
The created genotypeList
-object contains the genotypes
determined by the thresholdHeterozygosity
. Continuing the
example from before, the genotypes of the marker CSF1PO
are:
genotypeList$CSF1PO
#> # A tibble: 2 × 10
#> Marker BasePairs Allele Type MotifLength Region Coverage AggregateQuality
#> <chr> <int> <dbl> <chr> <int> <chr> <int> <chr>
#> 1 CSF1PO 54 13.5 AUTOSOMAL 4 CAGTA… 994 [[[[[[[[[[[[[[[…
#> 2 CSF1PO 38 9.5 AUTOSOMAL 4 CAGTA… 955 [[[[[[[[[[[[[[[…
#> # ℹ 2 more variables: Quality <list>, Indices <int>
We see that the getGenotype
function has identified the
two strings with coverage above 950, as the potential alleles of the
marker for the contributor of the sample. Furthermore, note that the
function adds a column called Indices
containing the
indices of the corresponding strings in the
stringCoverageList
-object. That is, in this example,
Indices
contains the values 1 and 3.
In a similar way to the genotype identification, we implemented
simple rules for noise identification. A string is classified as noise
if it is less than the coverage of the most prevalent string times
thresholdSignal
. By default this factor is set to \(1\%\).
The identifyNoise
function returns a list with an
element for every marker in the stringCoverageList
-object.
If an observation is classified as noise, it is then removed from the
data. Returning to our example from before, marker CSF1PO
has no strings with less than \(1\%\)
of the coverage of the most prevalent string. That is, the following
still contains three distinct sequences:
noiseList$CSF1PO
#> # A tibble: 2 × 10
#> Marker BasePairs Allele Type MotifLength Region Coverage AggregateQuality
#> <chr> <int> <dbl> <chr> <int> <chr> <int> <chr>
#> 1 CSF1PO 54 13.5 AUTOSOMAL 4 CAGTA… 994 [[[[[[[[[[[[[[[…
#> 2 CSF1PO 38 9.5 AUTOSOMAL 4 CAGTA… 955 [[[[[[[[[[[[[[[…
#> # ℹ 2 more variables: Quality <list>, Indices <int>
Note that as with the getGenotype
function, the
identifyNoise
function adds a column called
Indices
containing the indices of the corresponding strings
in the stringCoverageList
-object. That is, in this example,
Indices
contains the values 1, 2, and 3.
Stuttering refers to the loss (or gain) of a motif in the allele sequence during PCR amplification. A predictor of the rate of stuttering (quantified by the stutter ratio or stutter proportion), is the Block Length of the Missing Motif (BLMM). If we compare the allele to a stutter sequence, then we can identify the approximate location (down to the nearest block) of the motif which stuttered. The BLMM is the length of the block (sub-sequence), which has lost a motif.
The findStutter
function for every called allele, found
with the getGenotype
function, finds all potential stutters
of the alleles and calculates the stutter ratio and stutter proportions
between the allele and potential stutter. It also identifies the missing
motif and finds the length of the corresponding block, as well as the
longest uninterrupted stretch (LUS), i.e. the length of the longest
block. It takes a merge of the stringCoverageList
and the
genotypeList
created using the
mergeGenotypeStringCoverage
function.
stringCoverageGenotypeList <- mergeGenotypeStringCoverage(stringCoverageList, genotypeList)
stutterList <- findStutter(stringCoverageGenotypeList)
stutterTibble <- subset(do.call("rbind", stutterList), !is.na(Genotype))
head(stutterTibble, 5)
#> # A tibble: 5 × 20
#> Marker Genotype ParentAllele ParentString ParentLUS ParentLUSLength
#> <chr> <chr> <dbl> <chr> <chr> <int>
#> 1 CSF1PO 9.5,13.5 13.5 CAGTAAAGATAGATAGAT… [AGAT]9 9
#> 2 D13S317 29.5 29.5 AATGGCGGGTCGATGAGG… [TATC]10 10
#> 3 D18S51 31.75,37.75 37.8 GCTGAAGAAAGAAAGAAA… [AAGA]22 22
#> 4 D3S1358 31.5,32.5 32.5 ATTACTCCATTTGGCCAA… [TCTA]12 12
#> 5 D5S818 18 18 CAGAGTAGAGTTGAGATA… [AGAT]13 13
#> # ℹ 14 more variables: ParentCoverage <int>, NeighbourAllele <dbl>,
#> # NeighbourString <chr>, Block <chr>, MissingMotif <chr>,
#> # BlockLengthMissingMotif <int>, NeighbourCoverage <int>,
#> # NeighbourRatio <dbl>, NeighbourProportion <dbl>,
#> # FLAGStutterIdentifiedMoreThanOnce <lgl>, FLAGMoreThanTwoAlleles <lgl>,
#> # FLAGAlleleDifferenceOne <lgl>, FLAGMoreThanOneBlock <lgl>,
#> # FLAGBlocksWithDifferentLengths <lgl>
Note that the output also contains a series of flags which could be useful:
FLAGStutterIdentifiedMoreThanOnce
: Is the stutter
string been identified as a stutter for more than one allele?FLAGMoreThanTwoAlleles
: Were more than two alleles
provided by the merged stringCoverageGenotypeList
object?FLAGAlleleDifferenceOne
: Was the difference between the
two alleles one motif? If TRUE
the shorter will appear as a
stutter of the longer and, thereby, skew the results of any analysis.
These cases should be removed.FLAGMoreThanOneBlock
: Was more than one block
identified when comparing the potenetial stutter and allele
sequences?FLAGBlocksWithDifferentLengths
: If
FLAGMoreThanOneBlock
was TRUE
, did the
identified blocks have different lengths? Note: This should not be
possible in theory, but if it happens it is at most a difference of one
and the longest is always chosen (as it should be more likely).Instead of calling all the above functions, we can call the workflow
function STRMPSWorkflow
. The function takes a path to a
file, and either returns a series of .RData
files in the
provided output
directory, or it returns the
stringCoverageList
object created by the
stringCoverage
function. The function can also continue
from previously created files using the continueCheckpoint
argument (if the files are placed in the output
directory).
STRMPSWorkflow(
read_path,
control = workflow.control(
restrictType = "Autosomal",
numberOfThreads = 1
)
)
There also exists a batch version of the function called
STRMPSWorkflowBatch
which takes an input
directory containing the .fastq
files to be analysed.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.