The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Quality control checks are an important part of any high throughput
dataset analysis, and analyzing NR-seq data is no different. Therefore,
EZbakR provides a function, EZQC() to help you identify
potential problems in your NR-seq data. In this section, we will discuss
how to run EZQC() and what it looks for. I will also
provide some suggestions for how to best design and analyze NR-seq
experiments. Let’s start by loading EZbakR, which we will use throughout
this vignette:
NOTE: EZQC() is EZbakR’s instantiation
of bakR’s QC_checks(), though differs in a number of key
ways. Thus, don’t expect its output or behavior to exactly mimic that of
QC_checks(). That being said, much of the discussion of
QCing NR-seq data present in bakR’s
vignette is still relevant for interpreting EZbakR’s output.
EZQC() can take two different inputs:
EZbakRData object. This can be created with
EZbakRData().EZbakRFractions object. This is an
EZbakRData object on which you have run
EstimateFractions().An example of both of these are shown below:
simdata <- EZSimulate(250)
ezbdo <- EZbakRData(simdata$cB,
simdata$metadf)
### Input: EZbakRData object
qc <- EZQC(ezbdo)
#> CHECKING RAW MUTATION RATES...
#> CHECKING INFERRED MUTATION RATES...
#> CHECKING READ COUNT CORRELATIONS...
#> log10(read counts) correlation for each pair of replicates are:
#> # A tibble: 12 × 3
#> sample_1 sample_2 correlation
#> <chr> <chr> <dbl>
#> 1 sample1 sample2 0.983
#> 2 sample1 sample3 0.981
#> 3 sample1 sample7 0.983
#> 4 sample2 sample3 0.981
#> 5 sample2 sample7 0.981
#> 6 sample3 sample7 0.979
#> 7 sample4 sample5 0.982
#> 8 sample4 sample6 0.981
#> 9 sample4 sample8 0.982
#> 10 sample5 sample6 0.983
#> 11 sample5 sample8 0.983
#> 12 sample6 sample8 0.983
#>
#> log10(read counts) correlations are high, suggesting good reproducibility!
#>
### Input: EZbakRFractions object
ezbdo <- EstimateFractions(ezbdo)
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
qc_fn <- EZQC(ezbdo)
#> CHECKING RAW MUTATION RATES...
#> CHECKING INFERRED MUTATION RATES...
#> CHECKING READ COUNT CORRELATIONS...
#> log10(read counts) correlation for each pair of replicates are:
#> # A tibble: 12 × 3
#> sample_1 sample_2 correlation
#> <chr> <chr> <dbl>
#> 1 sample1 sample2 0.983
#> 2 sample1 sample3 0.981
#> 3 sample1 sample7 0.983
#> 4 sample2 sample3 0.981
#> 5 sample2 sample7 0.981
#> 6 sample3 sample7 0.979
#> 7 sample4 sample5 0.982
#> 8 sample4 sample6 0.981
#> 9 sample4 sample8 0.982
#> 10 sample5 sample6 0.983
#> 11 sample5 sample8 0.983
#> 12 sample6 sample8 0.983
#>
#> log10(read counts) correlations are high, suggesting good reproducibility!
#>
#> CHECKING FRACTION LABELED DISTRIBUTIONS...
#> Average fractions for each sample are:
#> # A tibble: 6 × 3
#> sample avg_fraction fraction_type
#> <chr> <dbl> <chr>
#> 1 sample1 0.287 fraction_highTC
#> 2 sample2 0.283 fraction_highTC
#> 3 sample3 0.286 fraction_highTC
#> 4 sample4 0.288 fraction_highTC
#> 5 sample5 0.286 fraction_highTC
#> 6 sample6 0.283 fraction_highTC
#>
#> Labeling rates (e.g., fraction labeled for single label experiments) look good!
#>
#> CHECKING FRACTION LABELED CORRELATIONS...
#> logit(fraction_highTC) correlation for each pair of replicates are:
#> # A tibble: 6 × 3
#> sample_1 sample_2 correlation
#> <chr> <chr> <dbl>
#> 1 sample1 sample2 0.959
#> 2 sample1 sample3 0.966
#> 3 sample2 sample3 0.964
#> 4 sample4 sample5 0.958
#> 5 sample4 sample6 0.953
#> 6 sample5 sample6 0.950
#>
#> logit(fraction_highTC) correlations are high, suggesting good reproducibility!
#> In the first case (EZbakRData input), the following are
checked:
In the second case (EZbakRFractions input), all of the
same things are checked, with the addition of:
Replicate correlation is an intuitive QC metric that ensures replicates agree with each other well. The other three metrics are NR-seq specific metrics that assess the extent to which metabolic label was readily incorporated into nascent RNA, and the appropriateness of the metabolic label feed time used (i.e., how long cells were fed with the label, sometimes referred to as the label time). In general, you want to see:
Inside of the output objects (qc and qc_fn
in the above code), you will find a number of plots. The output is a
named list, with one or more plot present in each element of this list.
The named elements and their contents are as follows:
Raw_mutrates: If you have a single mutation type in
your cB, this will be a barplot of raw mutation rates of that type in
all samples. If you have multiple mutation types, then this will be a
named list, with one element per mutation type, with each element being
this same barplot.
Inferred_mutrates: If you have a single mutation type
in your cB, this will be a barplot of the inferred labeled and unlabeled
read mutation rates in all samples. If you have multiple mutation types,
then it will be a named list of similar barplots for each mutation type,
as in Raw_mutrates.
Readcount_corr: This is a list, whose elements are
collections of read count correlation plots for a given group of
replicates. Groups of replicates are determined by the provided metadf
in your EZbakRData object. +label and -label replicates of
a given treatment condition are grouped together. Different label times
for the same treatment condition are also grouped together here.
Fraction_labeled_dist: This is a list of density plots
for each label fed sample. In the case of a single label experiment,
these will be the distribution of estimated fraction labeled’s for each
sample. If you have multiple labels/mutation types, then this will be a
list with density plots for each estimated fraction.
EZbakRFractions
object as input!Fraction_labeled_corr: Same as
Readcount_corr, but this time correlating fraction
estimates provided by EstimateFractions(). Unlike in
Readcount_corr, this excludes -label samples, and considers
distinct label times distinct replicates.
EZbakRFractions
object as input!Potential problem #1: Labeled read mutation rates are low
Possible solutions:
pold_from_no_label=TRUE in your
call to EstimateFractions(). In fact, I even like
defaulting to this setting when -label samples are available.Potential problem #2: Labeled read mutation rates are acceptable, but raw mutation rates are low. In this case, fraction estimate replicate correlation may also be low.
Possible solutions:
pold_from_no_label=TRUE in your
EstimateFractions() call, as described above, is likely a
good idea to ensure accurate labeled read mutation rate inference.Potential problem #3: Poor correlation of read counts across replicates, especially between -label and +label, or different label time +label, samples.
This may be a sign of dropout, described here, , here, and here. This is when reads from highly labeled RNA are underrepresented in +label data. You can try:
CorrectDropout() to try and
bioinformatically correct for the biases induced by dropout. This
requires you to have -label data from all distinct biological
conditions.Suggestions 1 and 2 can help no matter the cause of dropout (e.g., adverse effects of the label, RNA handling problems, read alignment problems, etc.). The other three suggestions are more suited in the case when dropout does not seem to have a biological origin. A general, proven strategy to distinguish the cause of dropout does not exist, but you may want to try assessing trends in sequencing tracks (e.g., look dropoff in coverage near the 3’ end of transcripts upon increased labeling) or performing differential expression and GO analysis of + and -label samples to assess potential upregulation of transcriptional repression stress pathways.
Below I will list and discuss what, in my opinion, makes an ideal NR-seq dataset:
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.