GCalignR Step by Step

Meinolf Ottensmann, Martin A. Stoffel, Joseph Hoffman

2017-02-05

Introduction

GCalignR provides a simple means of aligning peaks from gas chromatography data based on retention times. The package also contains functions to visually evaluate the quality of the alignment and allow users to adjust the algorithm parameters to optimize the alignment. The aligned data can easily be used as input for statistical packages such as vegan. We specifically developed and tested GCalignR as a preprocessing tool prior to the statistical analysis of chemical samples from animal skin and preen glands (see Stoffel et al. 2015 for an application of the underlying algorithm). The implemented algorithm is purely based on retention time data, which is why the quality of the alignment is highly dependent on the quality of the raw data and the parameters used for the initial peak detection. In other words: The clearer the peaks are which were extracted from the chromatograms, the better the alignment will be. GCalignR has been created for situations where the main interest of the research is in exploring broader patterns rather then the specific function of a certain chemical, which is unlikely to be determined correctly in all cases when just retention times are used. Also, we highly recommend to at least partially double-check the resulting alignment with mass-spectrometry data (of available) or visual ways of comparing peaks across chromatograms.

GCalignR workflow in a larger context

In the flow diagram below, we visualized the functionality of GCalignR within a complete workflow of analysing chemical data. After (1) analysis of the chemical samples with GC or GC-MS, an often proprietary software is used to extract a list of peaks (retention times, peak area, also often peak height and other variables). Steps (3)-(7) are the the alignment steps within GCalignR, detailed below. After alignment and normalisation, the output can be used as input for multivariate statistics in other packages such as vegan (8).

Installation

The development version can be downloaded from GitHub with the following code:

install.packages("devtools") 
devtools::install_github("mastoffel/GCalignR", build_vignettes = TRUE) 
library("GCalignR") 

The package documentation can be accessed with:

?GCalignR # documentation

The functions below form the core of GCalignR:

The alignment algorithm

The alignment algorithm implemented in the align_chromatograms function contains the following steps: (Here we refer to a peak list as all extracted peaks from a given sample chromatogram)

  1. One sample provides a reference peak list, which is used to align all other peak lists by means of shifting them simultaneously to maximise the number of shared peaks with the reference. This step corrects systematic shifts in the retention times of chromatograms. The maximum shift can be specified with the max_linear_shift argument.

  2. After the complete chromatograms are aligned, there will still be variation in the retention time of a substance across samples. The second step corrects this unreliability of individual peak retention times and essentially tries to minimise variation within a retention time row. The maximum shift per peak can be specified with the max_diff_peak2mean argument.

  3. Inherent to the algorithm, the same substance might have been split apart into two substances across all samples. Therefore, in a third step, retention time rows are merged if they have similar retention time means and non of the samples shows peaks in both rows (with the assumption that these two rows represent a single substance). The maximum mean difference between two retention time rows can be specified with the min_diff_peak2peak argument.

Optional steps:

  1. Delete peaks that occur in just one sample by setting the delete_single_peak argument to TRUE

  2. Delete all peaks that occur in negative control samples by specifying their names as argument to blanks

Input data

Extraction of chromatogram peaks outside of GCalignR

The statistical analysis of GC or GC-MS data is usually based on the detection of substance peaks within chromatograms instead of the full chromatogram. This can be done with proprietary software or free programs such as AMDIS (Stein 1999). The extraction of peaks from chromatograms is based on a certain threshold which might severely influence the quality of the alignment with GCalignR. If for instance the peak data includes very small peaks that occurred in a very low abundance, the retention time of the peak itself will have a higher potential error than highly abundant and sharp peaks.

The peak data of a chromatogram usually contain the retention time of a given peak plus additional information such as the area under the peak or its height which are used in the subsequent analysis. GCalignR uses the retention times (and not the mass-spectra, which may not be available, e.g. when using gas-chromatography coupled to a flame ionization detector (FID)) to align the peaks across individuals for subsequent chemometric analysis and pattern detection. The simple assumption is that peaks with similar retention times represent the same substances. However, it is highly recommended to verify this assumption by comparing also the mass-spectra (if available) of the substances of interest.

Input file format

The input file for GCalignR is a plain text file, whereby all elements should be separated by tabs (with sep = “/t”) or any other separator, which has to be specified with the sep argument (see ?read.table for a list of separators). The decimal seperator has to be the point.

Naturally, not all chromatograms contain the same number of peaks.

Alternative input from R

Alternatively to reading a text file, GCalignR also takes input directly from R. Here, data has to be a list of data frames. Each list element (data frame) has the identity of a sample as its name and the data frame itself contains the gc peak data for this sample (again, the minimum number of columns is one column for the retention time and one column for another variable such as the area under the peak or it’s height is required for using norm_peaks). All column names within data frames have to be the same. The attached dataset peak_data contains data from skin swabs of Antarctic Fur Seals Arctocephalus gazella (Stoffel et al. 2015)

data("peak_data")
length(peak_data) # number of individuals, i.e. number of list elements
> [1] 84
names(peak_data) # names of individuals, i.e. names of list elements 
>  [1] "C3"  "C2"  "M2"  "M3"  "M4"  "M5"  "M6"  "M7"  "M8"  "M9"  "M10"
> [12] "M12" "M14" "M15" "M16" "M17" "M18" "M19" "M20" "M21" "M23" "M24"
> [23] "M25" "M26" "M27" "M28" "M29" "M30" "M31" "M33" "M35" "M36" "M37"
> [34] "M38" "M39" "M40" "M41" "M43" "M44" "M45" "M46" "M47" "M48" "P2" 
> [45] "P3"  "P4"  "P5"  "P6"  "P7"  "P8"  "P9"  "P10" "P12" "P14" "P15"
> [56] "P16" "P17" "P18" "P19" "P20" "P21" "P23" "P24" "P25" "P26" "P27"
> [67] "P28" "P29" "P30" "P31" "P33" "P35" "P36" "P37" "P38" "P39" "P40"
> [78] "P41" "P43" "P44" "P45" "P46" "P47" "P48"
head(peak_data[[1]]) # column names and data, i.e. one data.frame of list element 
>   time     area
> 1 4.53  3331224
> 2 4.55  1462381
> 3 4.62  4834211
> 4 4.68  7754401
> 5 4.71  1267617
> 6 4.79 10356487

GCalignR workflow

Check the input

To check the data formatting for the most common errors, use the check_input function. This will test for conformity with the the main requirements of the aligning algorithm and give a warning message if these aren’t met. When a .text file is used as input, the decimal has to be a point (not a comma). However, there could potentially be a variety of different error sources so it is advisable to make sure by yourself that the data is in the correct format.

# if plot = T, a histogram of peaks is plotted
check_input(data = peak_data,plot = F)  
> All checks passed!

One of the steps in the algorithm is to adjust peaks that potentially represent the same substances but show slight differences in their retention times. For the parameter min_diff_peak2peak in the main function align_chromatograms it is therefore good to know about the difference between two peaks or substances within a chromatogram. The peak_interspace function plots the distances between adjacent peaks in the data. This will give an insight into how to specify the minimum difference in retention times of two different substances in a dataset with the min_diff_peak2peak parameter. Here we are plotting just a certain quantile_range of the data, as we are primarily interested in the minimum distance between peaks. Note, the differences between peaks presented here are derived within samples.

peak_interspace(data = peak_data, rt_col_name = "time",
                quantile_range = c(0, 0.8), quantiles = 0.05)

> $Summary
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
>  0.0000  0.0600  0.1000  0.2542  0.2500 11.3700 
> 
> $Quantiles
>   5% 
> 0.03

The histogram shows the distribution of retention-time ‘spaces’ between peaks. Most peaks are around 0.05 minutes apart from each other. From the distribution we want to infer the potential error-margin around a peak,that GCalignR will correct. We can see from the histogram that very few peaks are closer together than 0.03 minutes. By looking at the original chromatograms it becomes clear that these peaks are usually substances with low concentration which show a ‘double peak’, i.e. two peaks appear for what we believe is just a single substance. Therefore, we decide to take a value of 0.03 for the min_diff_peak2peak parameter below. It is noteworthy that this does not set a strict threshold and substances with a smaller difference in mean retention times can still be formed during the alignment (because those peaks are known to exist within samples). Therefore, we suggest to carefully check the aligned data and revise the initial peak calling if required.

Align chromatograms

The core function in GCalignR is align_chromatograms, which will align the peak lists with the algorithm described above. See ?align_chromatograms for a detailed description of the arguments. The alignment process will take minutes to several hours on a standard computer depending on mainly three factors; (1) the number of samples and (2) the number of peaks per sample and (3) the number of substances that are classified during the alignment procedure. The latter prevents us from predicting the computational effort.

peak_data <- peak_data[1:4] # subset for speed reasons
peak_data_aligned <- align_chromatograms(data = peak_data, # input data
    rt_col_name = "time", # retention time variable name 
    rt_cutoff_low = 15, # remove peaks below 15 Minutes
    rt_cutoff_high = 45, # remove peaks exceeding 45 Minutes
    reference = NULL, # choose automatically 
    max_linear_shift = 0.05, # max. shift for linear corrections
    max_diff_peak2mean = 0.03, # max. distance of a peak to the mean across samples
    min_diff_peak2peak = 0.03, # min. expected distance between peaks 
    blanks = "C2", # negative control
    delete_single_peak = TRUE, # delete peaks that are present in just one sample 
    write_output = NULL) # add variable names to write aligned data to text files

The aligned data matrices are now stored in data frame's which can be accessed as follows:

peak_data_aligned$aligned$time # to access the aligned retention times
peak_data_aligned$aligned$area # to access the aligned area data

The package includes the already aligned data set for all samples:

data("aligned_peak_data") 

Visual diagnostics for the aligned data

The gc_heatmap function can be used to visualise aligned datasets. A white filling indicates the absence of a peak in a sample, when using the default option of a binary heatmap. The basic rationale of the alignment is to sort the substances with very similar retention times together, as they most likely represent one substance. The heatmap shows how the single peaks for the individuals deviate from the mean retention time of a substance. As a rule of thumb: The larger the deviation, the less likely it is the same substance. This is especially true if one or few samples deviate significantly, whereas other substances might be more variable than others. However, going back to the original chromatograms and looking at the quality of the peak might help. The heatmap can be invoked using custom threshold values. As an first orientation the default value of 0.05 can be used to inspect outliers among substances. The actual variation that will be observed on substance level will also depend on the alignment parameters, i.e. larger deviations have to be expected with increasing distance between peaks specified by min_diff_peak2peak in align_chromatograms. Check out the documentation with for further possibilities.

gc_heatmap(aligned_peak_data,threshold = 0.03) 

The plot function shows a four figure plot for the aligned data. The first histogram shows the number of peaks per sample before and after alignment. The number of peaks is much smaller after the alignment as peaks have been deleted which were present in the control samples, as well as peaks that were found in just a single individual. The histograms on the bottom left shows the full chromatogram shifts (the first step in the algorithm). The bottom middle shows how much peaks vary around their means across samples. The histogram on the bottom right shows how many peaks are shared across samples. In this case, there is just a single substance present in all samples, while often 10-12 samples share a single substance (the mode of the distribution).

plot(aligned_peak_data,which_plot = "all") # Plots, can be invoked separetely

Also, print provides a verbal summary of the alignment procedure.

print(aligned_peak_data) 
> Summary of Peak Alignment running align_chromatograms
> Input: peak_data
> Start:  2017-02-01 18:04:11   Finished:  2017-02-01 18:41:11 
> 
> Call:
>   GCalignR::align_chromatograms(data=peak_data, rt_col_name=time,
>   max_linear_shift=0.05, blanks=(C2, C3), sep=\t, rt_cutoff_low=NULL,
>   rt_cutoff_high=NULL, reference=NULL, max_diff_peak2mean=0.02,
>   min_diff_peak2peak=0.08, delete_single_peak=FALSE)
> 
> Summary of scored substances:
>    total   blanks retained 
>      490      171      319 
> 
> In total 490 substances were identified among all samples. 171 substances were
>   present in blanks. The corresponding peaks as well as the blanks were removed
>   from the data set. 319 substances are retained after all filtering steps.
> 
> Sample overview:
>   The following 84 samples were aligned to the reference 'P31':
>   M2, M3, M4, M5, M6, M7, M8, M9, M10, M12, M14, M15, M16, M17, M18, M19, M20,
>   M21, M23, M24, M25, M26, M27, M28, M29, M30, M31, M33, M35, M36, M37, M38, M39,
>   M40, M41, M43, M44, M45, M46, M47, M48, P2, P3, P4, P5, P6, P7, P8, P9, P10,
>   P12, P14, P15, P16, P17, P18, P19, P20, P21, P23, P24, P25, P26, P27, P28, P29,
>   P30, P31, P33, P35, P36, P37, P38, P39, P40, P41, P43, P44, P45, P46, P47, P48
> 
> For further details type...
>   'gc_heatmap(aligned_peak_data)' to retrieve heatmaps
>   'plot(aligned_peak_data)' to retrieve further diagnostic plots

Normalise peaks and log+1 transformation

norm_peaks is used to standardize the concentration of peaks across samples to obtain the relative abundance. This is an essential step prior to the analysis if the absolute concentration of chemicals varies across samples. Note that this step is required when retention time cut-offs, single peak deletion or blank peak removal was applied, even if the data already contained a measure of relative abundance. The output is a list of data frames containing the relative abundance of peaks for every individual.

## normalise area and return a data frame
scent <- norm_peaks(aligned_peak_data, conc_col_name = "area",rt_col_name = "time",out = "data.frame") 
## common transformation for abundance data to reduce the extent of mean-variance trends
scent <- log(scent + 1) 

Visualise patterns by ordination plots using the vegan package

vegan offers a variety of useful function for the analysis of multivariate abundance data such as the scent profiles handled here. Check out for a first overview.

Non-metric multidimensional scaling

## GCalignR contains factors for the chemical dataset
data("peak_factors") 
## keep order of rows consistent
scent <- scent[match(row.names(peak_factors),row.names(scent)),] 
## NMDS using Bray-Curtis dissimilarities
scent_nmds <- vegan::metaMDS(comm = scent, distance = "bray")
## get x and y coordinates
scent_nmds <- as.data.frame(scent_nmds[["points"]])  
## add the colony as a factor to each sample
scent_nmds <- cbind(scent_nmds,colony = peak_factors[["colony"]])
## ordiplot with ggplot2
library(ggplot2)
ggplot(data = scent_nmds,aes(MDS1,MDS2,color = colony)) +
    geom_point(size = 4) + 
    stat_ellipse(size = 2) + 
    labs(title = "", x = "MDS1", y = "MDS2") +  
    theme_void() + 
    theme(panel.background = element_rect(colour = "black", size = 2,fill = NA),aspect.ratio = 1)

Multivariate analysis using adonis

Using adonis and betadisper we can immediately do multivariate statistics, showing that the two colonies differ significantly. This illustrates a location effect.

## colony effect
vegan::adonis(scent ~ peak_factors$colony,permutations = 999) 
> 
> Call:
> vegan::adonis(formula = scent ~ peak_factors$colony, permutations = 999) 
> 
> Permutation: free
> Number of permutations: 999
> 
> Terms added sequentially (first to last)
> 
>                     Df SumsOfSqs MeanSqs F.Model     R2 Pr(>F)    
> peak_factors$colony  1    2.5351 2.53514  11.492 0.1256  0.001 ***
> Residuals           80   17.6486 0.22061         0.8744           
> Total               81   20.1837                 1.0000           
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## no dispersion effect
anova(vegan::betadisper(vegan::vegdist(scent),peak_factors$colony))
> Analysis of Variance Table
> 
> Response: Distances
>           Df   Sum Sq   Mean Sq F value Pr(>F)
> Groups     1 0.000347 0.0003474   0.095 0.7587
> Residuals 80 0.292452 0.0036557

Literature

Stein, Stephen E. 1999. “An Integrated Method for Spectrum Extraction and Compound Identification from Gas Chromatography/Mass Spectrometry Data.” Journal of the American Society for Mass Spectrometry 10 (8). Elsevier: 770–81.

Stoffel, Martin A, Barbara A Caspers, Jaume Forcada, Athina Giannakara, Markus Baier, Luke Eberhart-Phillips, Caroline Müller, and Joseph I Hoffman. 2015. “Chemical Fingerprints Encode Mother–offspring Similarity, Colony Membership, Relatedness, and Genetic Quality in Fur Seals.” Proceedings of the National Academy of Sciences 112 (36). National Acad Sciences: E5005–E5012.