The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This package provides a pipeline for short tandem repeat instability analysis from fragment analysis data. The inputs are fsa files or peak tables (e.g. Genemapper 5 software peak table output), and a user supplied metadata data-frame. The functions identify ladders, calls peaks, and calculate repeat instability metrics (i.e. expansion index or average repeat gain).
This code is not for clinical use. There are features for accurate repeat sizing if you use validated control samples, but integer repeat units are not returned.
To report bugs or feature requests, please visit the Github issue tracker here. For assistance or any other inquires, contact Zach McLean.
If you use this package, please cite this paper for now.
For an easy way to get started with your own data or to run an
example, use trace::generate_trace_template()
to generate a
document with the pipeline pre-populated.
In this package, each sample is represented by an R6 ‘fragments’
object, which are organized in lists. Functions in the package iterate
over these lists, so you usually don’t need to interact with the objects
directly. If you do, the attributes of the objects can be accessed with
$, and note that most functions modify the objects in place, so
re-assignment isn’t necessary. The only exception to that
find_fragments()
, which transitions to a new object since
the class structure changes.
There are several important factors to a successful repeat instability experiment and things to consider when using this package:
(required) Each sample has a unique id, usually the file name
(optional) Baseline control for your experiment. For example,
specifying a sample where the modal allele is the inherited repeat
length (eg a mouse tail sample) or a sample at the start of a
time-course experiment. This is indicated with a TRUE
in
the metrics_baseline_control
column of the metadata.
Samples are then grouped together with the metrics_group_id
column of the metadata. Multiple samples can be
metrics_baseline_control
, which can be helpful for the
average repeat gain metric to have a more accurate representation of the
average repeat at the start of the experiment.
(optional) Batch or repeat length correction for systematic batch effects that occur with repeat-containing amplicons in capillary electrophoresis.
Repeat containing amplicons do not run linearly with internal ladder sizes in capillary electrophoresis resulting is an underestimation of repeat length if you just convert from base-pair size. These differences are not always consistent across runs which can result in batch effects in the repeat size. So, if the repeat length is to be directly compared for samples from different runs, this batch effect needs to be corrected. This is only relevant when the absolute size of a amplicons are compared for grouping metrics as described above (otherwise instability metrics are all relative and it doesn’t matter that there’s systematic batch effects across runs), when plotting traces from different runs, or if an accurate repeat length is desired.
There are two main correction approaches that are somewhat
related: either ‘batch’ or ‘repeat’ in call_repeats()
.
Batch correction is relatively simple and just requires you to link
samples across batches by indicating them from metadata. But even though
the repeat size that is return will be precise, it will not be accurate
and underestimates the real repeat length. By contrast, repeat
correction can be used to accurately call repeat lengths (which also
corrects the batch effects). However, the repeat correction will only be
as good as your sample(s) used to call the repeat length, so this can a
challenging and advanced feature. You need to use a sample that reliably
returns the same peak as the modal peak, or you need to be willing to
understand the shape of the distribution and manually validate the
repeat length of each control sample for each run.
If starting from fsa files, the GeneScan™ 1200 LIZ™ dye Size Standard ladder assignment may not work very well. The ladder identification algorithm is optimized for GeneScan™ 500 LIZ™ or GeneScan™ 600 LIZ™ or other ladders with relatively few peaks. The 1200 LIZ™ ladder has a challenging pattern of ladder peaks to automatically assign. However, these ladders can be fixed by playing with the various parameters or manually with the built-in fix_ladders_interactive() app.
You can install from GitHub with:
# install.packages("pak")
::pak("zachariahmclean/trace") pak
Then load the package:
library(trace)
First, we read in the raw data. In this case we will used example
data within this package, but usually this would be fsa files that are
read in using read_fsa()
. The example data is also cloned
since the next step modifies the object in place.
<- lapply(cell_line_fsa_list, function(x) x$clone()) fsa_list
The known ladder sizes are assigned to peaks in the ladder channel and bp are predicted for each scan.
find_ladders(
fsa_list,show_progress_bar = FALSE
)
visually inspect each ladder to make sure that the ladders were correctly assigned
plot_ladders(fsa_list[1])
If the ladders are are not assigned correctly, you can adjust parameters or manually using the built-in fix_ladders_interactive() app.
The fragment peaks are identified in the raw continuous trace data. These objects are assigned because find_fragments transitions to a new object since the class structure changes. This reflects the data moving from a continuous trace to a peak table.
<- find_fragments(
fragments_list
fsa_list,min_bp_size = 300
)
Visually inspect the traces and called peaks to make sure they were correctly assigned.
plot_traces(fragments_list[1],
xlim = c(400, 550),
ylim = c(0, 1200)
)
Alternatively, this is where you would use data exported from
Genemapper if you would rather use the Genemapper bp sizing and peak
identification algorithms. However, this is not recommended as some of
the functionality of this package would not be accessible (mainly in
call_repeats()
, with batch_correction
and
repeat calling algorithms)
<- peak_table_to_fragments(example_data,
fragments_list_genemapper data_format = "genemapper5",
dye_channel = "B",
min_size_bp = 300
)
Metadata can be incorporated to allow additional functionality in
call_repeats()
(batch or repeat correction) and
assign_index_peaks()
(assigning index peak from another
sample). Prepare a file (eg spreadsheet saved as .csv) with the
following columns. If you use the specified column names, it will be
automatically parsed by add_metadata()
, otherwise you will
need to match up which column name belongs to which metadata category
(as done below in add_metadata()
):
Metadata table column | Functionality metadata is associated with | Description |
---|---|---|
unique_id | Required for adding metadata using add_metadata() |
The unique identifier for the fsa file. Usually the sample file name. This must be unique, including across runs. |
metrics_group_id | assign_index_peaks() , allows setting
grouped |
This groups the samples for instability metric calculations. Provide a group id value for each sample. For example, in a mouse experiment and using the expansion index, you need to group the samples since they have the same metrics baseline control (eg inherited repeat length), so provide the mouse id. |
metrics_baseline_control | assign_index_peaks() , allows setting
grouped |
This is related to metrics_group_id. Indicate with ‘TRUE’ to specify which sample is the baseline control (eg mouse tail for inherited repeat length, or day-zero sample in cell line experiments) |
batch_run_id | call_repeats() , allows setting correction
= “batch” or “repeat” |
This groups the samples by batch. Provide a value for each fragment analysis run (eg date). |
batch_sample_id | call_repeats() , allows setting correction
= “batch” or “repeat” |
This groups the samples across batches. Give a unique sample id to each different sample. |
batch_sample_modal_repeat | call_repeats() , allows setting correction
= “repeat” |
The validated modal repeat length for the particular
batch_sample_id sample used to accurately call repeat
length. |
add_metadata(
fragments_list = fragments_list,
metadata_data.frame = metadata,
unique_id = "unique_id",
metrics_group_id = "metrics_group_id",
metrics_baseline_control = "metrics_baseline_control",
batch_run_id = "batch_run_id",
batch_sample_id = "batch_sample_id",
batch_sample_modal_repeat = "batch_sample_modal_repeat"
)
Next we identify the modal peaks with find_alleles()
and
convert the base pair fragments to repeats with
call_repeats()
.
find_alleles(fragments_list)
call_repeats(fragments_list)
We can view the distribution of repeat sizes and the identified modal peak with a plotting function.
plot_traces(fragments_list[1], xlim = c(110, 150))
A key part of several instability metrics is the index peak. This is
the repeat length used as the reference for relative instability metrics
calculations, like expansion index or average repeat gain. In the
metadata, samples are grouped by a metrics_group_id
and a
subset of the samples are set as metrics_baseline_control
,
meaning they are the samples taken at day 0 in this experiment. This
allows us to set grouped = TRUE
and set the index peak for
the expansion index and other metrics. For mice, if just a few samples
have the inherited repeat signal shorter than the expanded population,
you could not worry about this and instead use the
index_override_dataframe
in
assign_index_peaks()
.
assign_index_peaks(
fragments_list,grouped = TRUE
)
We can validate that the index peaks were assigned correctly with a dotted vertical line added to the trace. This is perhaps more useful in the context of mice where you can visually see when the inherited repeat length should be in the bimodal distribution.
plot_traces(fragments_list[1], xlim = c(110, 150))
All of the information we need to calculate the repeat instability
metrics has now been identified. We can finally use
calculate_instability_metrics
to generate a dataframe of
per-sample metrics.
<- calculate_instability_metrics(
metrics_grouped_df fragments_list = fragments_list,
peak_threshold = 0.05
)
These metrics can then be used to quantify repeat instability. For example, this reproduces Figure 7e of our manuscript.
First, prepare the data for plotting by removing poor quality samples and finding the average repeat gain relative to the DMSO group for each cell line
library(dplyr)
<- metrics_grouped_df |>
plot_data ::left_join(metadata, by = dplyr::join_by(unique_id)) |>
dplyr::filter(
dplyr> 0,
day > 500
modal_peak_signal |>
) ::group_by(metrics_group_id) |>
dplyr::mutate(
dplyrrel_gain = average_repeat_change / median(average_repeat_change[which(treatment == 0)]),
genotype = forcats::fct_rev(genotype)
)
Then we can plot the instability metrics
library(ggplot2)
ggplot(
plot_data,aes(genotype, rel_gain, colour = genotype)
+
) geom_boxplot(outlier.shape = NA) +
geom_jitter() +
facet_wrap(vars(as.factor(treatment)), nrow = 1) +
labs(
y = "Average repeat gain\n(relative to DMSO)",
x = "PMS1 pseudoexon status"
+
) theme(legend.position = "none")
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.