Using MAGNAMWAR: a case study with Drosophila melanogaster

Corinne Sexton

2017-04-26

Introduction

This vignette describes a recommended workflow to take full advantage of MAGNAMWAR. The data are from a study on bacterial determinants of Drosophila melanogaster triglyceride content, and are representative of any number of datasets that associate one phenotype with one bacterial species. The data included in the package and used in documentation examples are highly subsetted (2 orders of magnitude smaller) from the original dataset to increase speed of the example analyses. Most analyses can be run on a standard laptop computer, although we recommend a desktop computer with at least 16GB RAM for large (>500 phenotype measures) datasets. The functions are presented in their recommended order of operations using the case study fruit fly triglyceride data included in the package.

1. A Brief Outline

The following table outlines the specific input and outputs per function in the order each function should be used. Only the essential functions are listed; optional functions are discussed in their relevant sections.

Function Purpose Input Output Case Study Input Case Study Output
format_MCLfastas prep amino acid fasta files for OrthoMCL analysis 4-letter abbreviated amino acid fasta files of every host-mono-associated organism concatenated fasta file of all inputs, removes any duplicate ids fastas in extdata/fasta_dir/ MCLformatted_all.fasta
format_afterOrtho format output of OrthoMCL clusters to be used with analyses groups file from OrthoMCL Parsed data contained in a list of 2 objects (presence absence matrix, protein ids) extdata/groups_ example_r.txt after_ortho_format_grps
analyzeOrthoMCL main analysis of data format_afterOrtho output; phenotypic data a matrix with 7 variables (cluster group, p-value, corrected p-value…) after_ortho_format_grps; pheno_data mcl_mtrx_grps
join_repseq appends representative sequences with analyzeOrthoMCL matrix format_afterOrtho output; fasta files; output matrix of analyzeOrthoMCL a data frame of the joined matrix after_ortho_format_grps; extdata/fasta_dir/; mcl_mtrx_grps joined_mtrx_grps

2. Format For OrthoMCL Gene Clustering

The first step is to assign each gene to a cluster of orthologous groups. This pipeline uses OrthoMCL for clustering, either through a local install (requires extensive RAM; see supplementary data) or web-based executable (http://orthomcl.org/orthomcl/; max 100,000 sequences). OrthoMCL software requires specific fasta sequence header formats and that there are no duplicate protein ids. The fasta header format contains 2 pipe-separated pieces of information: a unique 3-4 alphanumeric taxon designation and a unique protein ID classifier, e.g. >apoc|WP_000129691. Two functions aid in file formatting: format_MCLfastas, which formats genbank files, and RASTtogbk, which formats files from RAST.

format_MCLfastas

format_MCLfastas converts amino-acid fasta files to an OrthoMCL compliant format and combines them into a concatenated file called “MCLformatted_all.fasta”. All amino acid fasta files must first be placed in a user-specified directory and given a 3-4 letter alphanumeric name, beginning with a non-numeric character (e.g. aac1.fasta). Example fasta files are included in the MAGNAMWAR package. The output “MCLformatted_all.fasta” file is the input for the OrthoMCL clustering software.

RASTtogbk

For fasta files that are not in the NCBI format, they should be converted in a separate folder to an NCBI-compatible format before running format_MCLfastas. For example, files in the RAST format have the initial header >fig|unique_identifer; not >ref|xxxx|xxxx|unique_identifier|annotation. To convert these or any other files to NCBI-compatible format, use the RASTtogbk function to merge the annotation using the unique identifier as a lookup, specifying the name of the fasta file (with a 3-4 letter alphanumeric name beginning with a non-numeric character), path to the reference file containing the annotation, and the folder bearing the .If the reference file bearing the annotation is from RAST the lookup file can be downloaded as the ‘spreadsheet (tab separated text format)’. If the reference file is from a different source, format it so that the unique identifier is in the 2nd column and the reference annotation is in the 8th column.

lfrc_fasta <- system.file('extdata', 'RASTtogbk//lfrc.fasta', package='MAGNAMWAR')
lfrc_reference <- system.file('extdata', 'RASTtogbk//lfrc_lookup.csv', package='MAGNAMWAR')
lfrc_path <- system.file('extdata', 'RASTtogbk//lfrc_out.fasta', package='MAGNAMWAR')

RASTtogbk(lfrc_fasta,lfrc_reference,lfrc_path)
## Finished writing out /private/var/folders/d3/bdwl6h8d3n379df4lxv5s7b00000gn/T/RtmpZ3DIwz/Rinstbed16f8fe867/MAGNAMWAR/extdata/RASTtogbk//lfrc_out.fasta

2.5 Call Gene Clusters with OrthoMCL Software

After formatting fasta files, run OrthoMCL clustering software either locally or online.

More information about OrthoMCL clustering software can be found at: http://www.orthomcl.org/.

3. Format Clusters for Statistical Analysis

format_afterOrtho

The format_afterOrtho function reformats the direct output from OrthoMCL for the MAGNAMWAR pipeline. To use the command, call format_afterOrtho, specifying the location of the OrthoMCL clusters file (online: OrthologGroups file; local: output of orthomclMclToGroups command) and whether the software was run online (“ortho”; default) or locally (“groups”). The following sample data were derived using local clustering.

# file_groups is the file path to your output file from OrthoMCL

parsed_data <- format_afterOrtho(file_groups, "groups")

The new stored variable is a list of matrices. The first matrix is a presence/absence matrix of taxa that bear each cluster of orthologous groups (OGs). The second matrix lists the specific protein ids within each cluster group. The data in each can be accessed by calling parsed_data[[1]] or parsed_data[[2]], respectively, although this is not necessary for subsequent pipeline steps. Because it contains the specific protein ids for each OG, this list is a key input for most subsequent functions.

a subset of resulting variable parsed_data:

  1. OG presence/absence matrix:

parsed_data[[1]][,1:13]

##           aace apap apas atro bsub dmos ecol efao efav ehor geur gfra ghan
## ex_r00000 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00001 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00002 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00003 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "0"  "1" 
## ex_r00004 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00005 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00006 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00007 "1"  "1"  "1"  "1"  "1"  "0"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00008 "0"  "1"  "1"  "0"  "1"  "0"  "1"  "1"  "1"  "1"  "1"  "0"  "0" 
## ex_r00009 "1"  "1"  "0"  "0"  "1"  "0"  "1"  "0"  "0"  "1"  "0"  "0"  "1" 
## ex_r00010 "0"  "1"  "0"  "1"  "1"  "0"  "1"  "1"  "1"  "1"  "1"  "1"  "1"
  1. Protein IDs in each OG:

parsed_data[[2]][,1:2]

##     ex_r00000                ex_r00001               
## V2  "aace|ZP_08699022.1"     "bsub|NP_389730.2"      
## V3  "apap|940282.4.peg.2315" "bsub|NP_390261.1"      
## V4  "apas|ZP_16282535.1"     "aace|ZP_08696506.1"    
## V5  "atro|ZP_08644184.1"     "apap|940282.4.peg.1501"
## V6  "bsub|NP_391359.1"       "apas|ZP_16281966.1"    
## V7  "dmos|ZP_08468863.1"     "atro|ZP_08644394.1"    
## V8  "ecol|NP_415408.1"       "dmos|ZP_08469662.1"    
## V9  "efao|YP_005708185.1"    "ecol|NP_414920.1"      
## V10 "efav|NP_815059.1"       "efao|YP_005708912.1"   
## V11 "ehor|ZP_08496675.1"     "efav|NP_816072.1"      
## V12 "geur|ZP_08902778.1"     "ehor|ZP_08497259.1"    
## V13 "gfra|ZP_11376413.1"     "geur|ZP_08902509.1"    
## V14 "ghan|ZP_06835021.1"     "gfra|ZP_11376782.1"    
## V15 "gobo|ZP_08896945.1"     "ghan|ZP_06833288.1"    
## V16 "gxyl|YP_004867578.1"    "gobo|ZP_08898428.1"    
## V17 "lani|ZP_08549469.1"     "gxyl|YP_004868985.1"   
## V18 "lbre|ZP_03940694.1"     "lani|ZP_08548797.1"    
## V19 "lbuc|YP_004398648.1"    "lbuc|YP_004398815.1"   
## V20 "lcas|YP_006751102.1"    "lcas|YP_006752056.1"   
## V21 "lfal|ZP_08312985.1"     "lfal|ZP_08312677.1"    
## V22 "lfer|ZP_03944435.1"     "lfer|ZP_03944015.1"    
## V23 "lfru|ZP_08652182.1"     "lfru|ZP_08652638.1"    
## V24 "llac|1358.26.peg.2055"  "llac|1358.26.peg.706"  
## V25 "lmai|ZP_09447100.1"     "lmai|ZP_09447014.1"    
## V26 "lmal|ZP_09441330.1"     "lmal|ZP_09441246.1"    
## V27 "lpla|YP_004888740.1"    "lpla|YP_004888561.1"   
## V28 "lrha|YP_003170666.1"    "lrha|YP_003171609.1"   
## V29 "lver|ZP_09442552.1"     "lver|ZP_09443795.1"    
## V30 "pbur|ZP_16361594.1"     "pbur|ZP_16363149.1"    
## V31 "pput|YP_001266156.1"    "pput|YP_001270272.1"   
## V32 "smar|AMF-0004114"       "smar|AMF-0001686"      
## V33 "spar|YP_006309044.1"    "spar|YP_006310046.1"   
## V34 "swit|YP_001264509.1"    "swit|YP_001264948.1"   
## V35 "efao|YP_005709159.1"    "efao|YP_005707743.1"   
## V36 "efav|NP_816368.1"       "efav|NP_814698.1"      
## V37 "lfer|ZP_03944497.1"     "llac|1358.26.peg.28"   
## V38 "lver|ZP_09443619.1"     ""

4. Perform Metagenome-wide Association

analyzeOrthoMCL

The analyzeOrthoMCL function performs the statistical tests to compare the phenotypes of taxa bearing or lacking each OG. It requires the R object output of the format_afterOrtho function and a phenotype matrix containing the variables for the statistical tests. Seven different tests are supported, each deriving the significance of OG presence/absence on a response variable, and specified by the following codes:

To run analyzeOrthoMCL, the following parameters are required:

  1. the output of format_afterOrtho
  2. a data frame (NOT a path) of phenotypic data*
  3. model name (as described above)
  4. name of column containing 4-letter taxa designations
  5. column names of certain variables depending on which model is specified:
    • lm: resp
    • wx: resp
    • lmeR1: resp, rndm1
    • lmeR2ind: resp, rndm1, rndm2
    • lmeR2nest: resp, rndm1, rndm2
    • lmeF2: resp, rndm1, rndm2, fix2
    • survmulti: time, event, rndm1, rndm2, multi
    • survmulticensor: time, time2, event, rndm1, rndm2, fix2, multi, output_dir

In order to create a data frame for your phenotypic data use the following command where file_path is the path to your phenotypic data file: (the file_path used in this case study is not available because pheno_data exists as an .rda file)*

pheno_data <- read.table(file_path, header = TRUE, sep = ',')

A subset of the phenotypic file for TAG content pheno_data is shown below:

##   Treatment  RespVar Vial Experiment
## 1      apoc 2.821429    A          A
## 2      apoc 5.500000    B          A
## 3      apoc 8.464286    C          A
## 4      jacg 5.392857    A          A
## 5      jacg 2.821429    B          A

To run analyze_OrthoMCL use the headers within pheno_data to specify variables:

It is not necessary to specify the OG presence/absence variable, which is automatically populated from the format_afterOrtho output. All other variables must be specified using column names in the phenotype matrix.

We will thus populate analyzeOrthoMCL using the headers within pheno_data to specify variables:

For example, to test the effect of gene presence/absence on fat content using a mixed model with nested random effects, the following command should be used:

mcl_matrix <- analyze_OrthoMCL(parsed_data, 
                               pheno_data, 
                               model = 'lmeR2nest', 
                               species_name = 'Treatment',  
                               resp = 'RespVar', 
                               rndm1 = 'Experiment', 
                               rndm2 = 'Vial')

The resulting output matrix contains 7 columns:

  1. OG - Clustered Orthologous Group from OrthoMCL
  2. pval1 - p-value of test
  3. corrected_pval1 - Bonferroni corrected p-value
  4. mean_OGContain - mean of phenotypic data value of all taxa in OG
  5. mean_OGLack - mean of phenotypic data value of all taxa not in OG
  6. taxa_contain - taxa in OG
  7. taxa_miss - taxa not in OG

mcl_mtrx[,1:3] Showing the first three columns of mcl_matrix.

##       OG          pval1                  corrected_pval1      
##  [1,] "ex_r00000" "0.0680152959317386"   "0.748168255249125"  
##  [2,] "ex_r00001" "0.0680152959317386"   "0.748168255249125"  
##  [3,] "ex_r00002" "0.000169433279973097" "0.00186376607970407"
##  [4,] "ex_r00003" "0.217474911482261"    "1"                  
##  [5,] "ex_r00004" "0.4482634415059"      "1"                  
##  [6,] "ex_r00005" "0.0899794361093273"   "0.989773797202601"  
##  [7,] "ex_r00007" "0.999790593888518"    "1"                  
##  [8,] "ex_r00008" "0.00669725842125168"  "0.0736698426337685" 
##  [9,] "ex_r00009" "0.186956618632854"    "1"                  
## [10,] "ex_r00010" "0.00824936629366202"  "0.0907430292302822"

5. Post-statistical analysis

join_repseq

join_repseq randomly selects a representative protein annotation and amino acid sequence from each OG and appends it to the analyzeOrthoMCL output matrix. The purpose is to identify OG function. The result is a data frame bearing 4 additional variables:

  1. rep_taxon - taxon that contains the representative sequence
  2. rep_id - representative protein id
  3. rep_annot - representative protein annotation
  4. rep_seq - representative protein sequence

Four inputs are specified for join\_repseq:

  1. The list produced by format_afterOrtho
  2. The fasta directory of the 4-letter abbreviated fasta files
  3. The matrix produced by analyzeOrthoMCL
  4. The NCBI sequence format (“old”(default) or “new”)
dir <- system.file('extdata', 'fasta_dir', package='MAGNAMWAR')
dir <- paste(dir,'/',sep='')
joined_matrix <- join_repseq(after_ortho_format_grps, dir, mcl_mtrx_grps, fastaformat = 'old')
## 
## picking and writing representative sequence for PDG:
## 9.09%__18.18%__27.27%__36.36%__45.45%__54.55%__63.64%__72.73%__81.82%__90.91%__

joined_matrix[1,8:10] An example of three of the appended columns are shown below:

##   rep_taxon        rep_id               rep_annot
## 1      lfru ZP_08652182.1  thioredoxin reductase

6. Exporting Data

MAGNAMWAR statistical analysis can be exported into either tab-separated matrices or into graphical elements.

6a. Matrices

printOGseqs

Allows the user to output all protein sequences in a specified OG in fasta format.

surv_append_matrix

When survival models are used in the MGWA a single .csv file is printed for each test. surv_append_matrix joins each individual file into one complete matrix.

write_mcl

Writes the matrix result from analyzeOrthoMCL into a tab-separated file.

6b. Graphics

MAGNAMWAR also offers several options for graphical outputs. The several ways to visualize data are explained in detail below.

pdgplot

pdgplot visualizes the phenotypic effect of bearing a OG. The phenotype matrix is presented as a bar chart, and gene presence/absence is represented by different bar shading.

Calling pdgplot requires the same pheno_data variable as analyzeOrthoMCL and similarly takes advantage of the user specified column names from the pheno_data data frame. It also requires the mcl_matrix object and specification of which OG to highlight.

For example, to visualize the means and standard deviations of the taxa which are present in OG “ex_r00002”, the OG with the lowest corrected p-value (“0.00186”). Green and gray bars represent taxa that do or do not contain a gene in the OG, respectively.

pdgplot(pheno_data, mcl_matrix, OG = 'ex_r00002', species_colname = 'Treatment', data_colname = 'RespVar', ylab = "TAG Content")

The different taxa can be ordered alphabetically, as above, or by specifying the order with either the tree parameter (for phylogenetic sorting) or the order parameter, which takes a vector to determine order. For example, the taxa can be ordered by phylogeny by calling a phylogenetic tree file with the taxon names as leaves (note any taxa in pheno_data that are not in the tree file will be omitted).

tree <- system.file('extdata', 'muscle_tree2.dnd', package='MAGNAMWAR')

pdgplot(pheno_data, mcl_matrix, 'ex_r00002', 'Treatment', 'RespVar', ylab = "TAG Content", tree = tree)

phydataerror

phydataerror is similar to pdgplot and adds visualization of phylogenetic tree with the phenotypic means and standard deviation.

Calling phydataerror requires the following parameters:

  • a phylogenetic tree
  • the same pheno_data variable as analyzeOrthoMCL (and its column names)
  • the mcl_matrix object
  • a specification of which OG to highlight
phydataerror(tree, pheno_data, mcl_matrix, species_colname = 'Treatment', data_colname = 'RespVar', OG='ex_r00002', xlabel='TAG Content')

pdg_v_OG

pdg_v_OG produces a histogram of the number of OGs in each phylogenetic distribution group (PDG). A PDG is the set of OGs present and absent in the exact same set of bacterial taxa. The main purpose of the graph is to determine the fraction of OGs that are present in unique or shared sets of taxa, since the phenotypic effect of any OGs in the same PDG cannot be discriminated from each other.

To run pdg_v_OG, provide the output of format_afterOrtho. The num parameter (default 40) is used to specify the amount of OGs per PDG that should be included on the x axis.

For example, in the graph below there are 11 PDGs only contain one OG, meaning that 11 PDGs have one group that has a unique distribution of taxa present and absent.

pdg_v_OG(parsed_data,0)

Because this data is an extreme subset, it isn’t very informative. A full data set is shown below. This shows us that in this particular OrthoMCL clustering data, 4822 PDGs exist with only 1 unique OG in the PDG, and around 500 PDGs exist with 2 OGs that share the same presence and absence of taxa and so on.

qqplotter

A simple quartile-quartile plot function that is generated using the matrix of analyzeOrthoMCL. To run qqplotter, provide the output matrix of analyzeOrthoMcl.

qqplotter(mcl_matrix)

manhat_grp

manhat_grp produces a manhattan plot for visual analysis of the output of analyzeOrthoMCL. In a traditional genome wide association study, a Manhattan plot is a visualization tool to identify significant p-values and potential linkage disequilibrium blocks. A traditional Manhattan plot sorts p-value by the SNPs along the 23 chromosomes. In our Manhattan plot, we sort by taxa instead of by chromosome number as shown below. The lines that emerge in our plot show the clustered proteins across taxa (which because they are in the same OG would have the same p-value).

To run manhat_grp, provide the output of format_afterOrtho and the output of analyzeOrthoMCL as shown below.

manhat_grp(parsed_data, mcl_matrix)
## ordering by protein id within taxa...
## creating manhattan plot...

## finished.

Again because this data is an extreme subset, it isn’t very informative. Another full data set example is shown below.