CorMID

library(CorMID)

Purpose of the package

The package provides as the main functionality function CorMID which will estimate for a numeric vector of measured ion intensities originating from a compound analyzed by GC-APCI-MS (see below)

Definitions

Gas chromatography (GC)

Atmospheric pressure chemical ionization (APCI)

Mass spectrometry (MS)

Mass Isotopomer Distributions (MID)

.

Helper functions

In addition to the main function currently termed CorMID I make use of two other functions: CountChemicalElements and CalcTheoreticalMDV. The first one simply counts the digit following a certain letter in a chemical sum formula. Here, we use it to determine the number of carbon, silicon and sulfor atoms (neglecting nitrogen, as the 15N isotope is of low abundance). As the anticipated user will probably work on TMS derivatized compounds I included two additional letters to the chemical alphabet, T for TMS and M for a MEOX substitution. In consequence for compound Glucose (5TMS 1MEOX) we would count:

fml <- "C6H12O6T5M1"
CountChemicalElements(x = fml)
#>  C  H  O  T  M 
#>  6 12  6  5  1
CountChemicalElements(x = fml, ele=c("C","Si","T","Cl"))
#>  C Si  T Cl 
#>  6  0  5  0

and receive as output a named vector for all present elements or only a selection of elements as specified by parameter ele.

The elements with a significant amount of natural occuring isotopes are relevant to calculate the theoretical mass distribution vector (or rather matrix respectively) of the compound. In the above example this is effectively Carbon and Silicon. As we have a 5TMs Glucose molecule we need to consider in total 21 C and 5 Si in our calculations:

fml <- "C21Si5"
td <- CalcTheoreticalMDV(fml=fml)
round(td,4)
#>       M+0    M+1    M+2    M+3    M+4    M+5    M+6
#> M0 0.5291 0.2575 0.1475 0.0471 0.0147 0.0034 0.0007
#> M1 0.0000 0.5354 0.2546 0.1464 0.0460 0.0143 0.0032
#> M2 0.0000 0.0000 0.5430 0.2522 0.1457 0.0450 0.0141
#> M3 0.0000 0.0000 0.0000 0.5566 0.2523 0.1466 0.0445
#> M4 0.0000 0.0000 0.0000 0.0000 0.5880 0.2600 0.1519
#> M5 0.0000 0.0000 0.0000 0.0000 0.0000 0.6987 0.3013
#> M6 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000

The first row of the matrix (M0) gives the relative amounts of all potential isotopes for C21Si5 assuming natural abundance conditions. The second row (M1) shows the relative amounts in case of at least one 13C contained. The final row (M6) shows the relative amounts when all biological carbon atoms are assumed to be 13C. The amount of biological carbon is estimated based on the amount of Si within the function. This might be overwritten specifying attributes defining the number of C of biological origin nbio and by specifying the number of measured ion signals above the detection limit nmz:

attr(fml, "nmz") <- 21
attr(fml, "nbio") <- 21
round(CalcTheoreticalMDV(fml=fml)[-(5:19),-(5:19)],4)
#>       M+0    M+1    M+2    M+3
#> M0 0.5291 0.2575 0.1475 0.0471
#> M1 0.0000 0.5354 0.2546 0.1464
#> M2 0.0000 0.0000 0.5430 0.2522
#> M3 0.0000 0.0000 0.0000 0.5566

Main function

Idea

The problem in GC-APCI-MS that we try to overcome is the formation of fragments forming superimposed MIDs. The ones we identified so far are [M+H], [M+], [M+H]-H2 and [M+H]+H2O-CH4. If we assume [M+H] to be generally the most abundant and hence use it as our fix point (base MID, shift = 0), than we observe superimposed MIDs starting at -2, -1 and +2 relative to [M+H] for [M+], [M+H]-H2 and [M+H]+H2O-CH4 respectively.

The basic idea of the correction is that we measure a superimposed/composite MID of one to several fragments all derived from the same base MID. This base MID is exactly what we are looking for. Correcting for it is complicated because we dont know the distribution of fragments, i.e. the amount of the individual occuring fragments or their ratios to each other respectively. Hence, we have to estimate a base MID and a ratio vector r giving the distribution of present fragments, which together represent our measurement data optimally.

Example

Lets start with an artificial Glucose spectrum where 10% is M6 labeled:

fml <- "C21Si5"
td1 <- CalcTheoreticalMDV(fml = fml)
bMID <- c(0.9,rep(0,5),0.1)
md1 <- apply(td1*bMID,2,sum)
round(md1,4)
#>    M+0    M+1    M+2    M+3    M+4    M+5    M+6 
#> 0.4762 0.2318 0.1328 0.0424 0.0132 0.0030 0.1006

to obtain the measure distribution md1 which is our measured intensity values expressed relative. See that the M+6 value corresponds to 10% as specified. Now we may use CorMID to decompose this back:

CorMID(int=md1, fml=fml, r="M+H")
#>   M0   M1   M2   M3   M4   M5   M6 
#> 87.5  0.0  0.0  0.0  0.0  0.0 12.5 
#> attr(,"err")
#>        err 
#> 0.04781867 
#> attr(,"ratio")
#> M+H 
#>   1 
#> attr(,"ratio_status")
#> [1] "estimated"
#> attr(,"mid_status")
#> [1] "estimated"

Notice, that we allowed only [M+H] to be present in option r. The result is a labeled vector representing the corrected or baseMID together with information on the fitting error err and regarding the options used during the function call as attributes ratio, ratio_status and mid_status with mid being estimated and ratio being fixed during the function call.

We could achieve something similar testing for other potential fragments/rearrangement in the r option:

CorMID(int=md1, fml=fml)
#>   M0   M1   M2   M3   M4   M5   M6 
#> 87.5  0.0  0.0  0.0  0.0  0.0 12.5 
#> attr(,"err")
#>        err 
#> 0.04781867 
#> attr(,"ratio")
#>       M+H        M+       M-H M+H2O-CH4 
#>         1         0         0         0 
#> attr(,"ratio_status")
#> [1] "estimated"
#> attr(,"mid_status")
#> [1] "estimated"

We essentially get the same result as before (except for ratio related attributes) because there is no superimposition in our test data. Now lets generate more difficult composite data to be fit by including a 20% proton loss…

md2 <- unlist(list("M-1"=0,0.8*md1)) + c(0.2*md1,0)
round(md2,4)
#>    M-1    M+0    M+1    M+2    M+3    M+4    M+5    M+6 
#> 0.0952 0.4273 0.2120 0.1147 0.0365 0.0112 0.0225 0.0805

and let CorMID decompose this back…

CorMID(int=md2, fml=fml)
#>       M0       M1       M2       M3       M4       M5       M6 
#> 88.28125  0.00000  0.00000  0.00000  0.00000  0.00000 11.71875 
#> attr(,"err")
#>        err 
#> 0.03807508 
#> attr(,"ratio")
#>       M+H        M+       M-H M+H2O-CH4 
#>      0.81      0.19      0.00      0.00 
#> attr(,"ratio_status")
#> [1] "estimated"
#> attr(,"mid_status")
#> [1] "estimated"

which is pretty close to the truth, albeit not perfect. :)

Function Details

Lets look into the details of the function. Appart from some sanity checks and data preparation steps done by the wrapper function CorMID the main idea is to model a theoretical distribution based on a provided sum formula and fit a base MID and fragment ratios according to measurement data by function FitMID which is discussed in the following. The approach is brute force using two nested estimators for ratio and MID seperately. It builds on the idea to test a crude grid of parameters first, identify the best solution and iteratively approaching the true value by minimizing the grid.

The grid is set by an internal function poss_local. Basically, if we have a two carbon molecule we expect an baseMID of length=3 (M0, M1 and M2). Let’s assume that the true baseMID is {0.9, 0, 0.1}. Using a wide grid we would than test the following possibilities:

CorMID:::poss_local(vec=c(0.5,0.5,0.5), d=0.5)
#>   Var1 Var2 Var3
#> 2    1    0    0
#> 3    0    1    0
#> 5    0    0    1

and identify {1, 0, 0} as best match after subjecting to a testing function. We decrease the step size of the grid by 50% and test in the next iteration:

CorMID:::poss_local(vec=c(1,0,0), d=0.25)
#>      [,1] [,2] [,3]
#> [1,]    1    0    0

and still identify {1, 0, 0} as best match. However, in the next iteration:

CorMID:::poss_local(vec=c(1,0,0), d=0.125)
#>      [,1] [,2] [,3]
#> [1,]    1    0    0

we will get closer to the truth and find {0.875, 0, 0.125} to give the lowest error.

In summary, using this approach we can approximate a vector which should be of certain length and sum up to one in 13 steps to reach a precision <0.1%. We can nest MID fitting inside ratio fitting and thereby do both in parallel. However, I did not test this for robustness extensively yet (only 3 metabolites over 30 samples each).

The error function currently employed is simply the square root of the summed squared errors in comparing the provided measurement data and the expected value based on a baseMID and a specific ratio of fragments.

I hope this was all understandable, the code readable. I am more than happy for improvements, error checking, suggestions and critics. :)