The ParseMSF package provides several functions for inspecting Thermo MSF files. The most useful of these functions is make_area_table
, which constructs a data frame containing all peptides and their corresponding peak areas. This data frame also includes the protein (protein_desc
) for each peptide.
NOTE: Only Thermo MSF files generated by Proteome Discoverer 1.4.x are supported. Using ParseMSF functions with a file produced by any other version of Proteome Discoverer may produce unexpected results.
library(parsemsf)
# Replace `parsemsf_example("test_db.msf")` with the path to a Thermo MSF file
area_table <- make_area_table(parsemsf_example("test_db.msf"))
knitr::kable(head(area_table))
peptide_id | spectrum_id | protein_desc | sequence | area | mass | m_z | charge | intensity | first_scan |
---|---|---|---|---|---|---|---|---|---|
27146 | 15646 | NP_041997.1 | AALTDQVALGK | 55120084 | 1086.616 | 544.0622 | 2 | 634147.8 | 17577 |
27177 | 15663 | NP_041997.1 | AALTDQVALGK | 55120084 | 1086.615 | 544.0622 | 2 | 721063.0 | 17595 |
35484 | 20122 | NP_041997.1 | ANFQADQIIAK | 37046635 | 1218.648 | 610.5803 | 2 | 152654.2 | 22420 |
35511 | 20136 | NP_041997.1 | ANFQADQIIAK | 37046635 | 1218.648 | 610.5803 | 2 | 169355.0 | 22436 |
37869 | 21360 | NP_041997.1 | TQAAYLAPGENLDDK | NA | 1605.775 | NA | 2 | 382864.4 | 23744 |
37913 | 21384 | NP_041997.1 | TQAAYLAPGENLDDK | NA | 1605.775 | NA | 2 | 282891.8 | 23769 |
See the documentation for make_area_table
for a description of each column.
The peak area information stored in one or more Thermo MSF files can be used to estimate protein abundances. The combine_tech_reps
function estimates these abundances across one or more technical replicates. Technical replicates are different mass spec injections of the same biological sample. The combine_tech_reps
function will produce more accurate protein abundance estimates if it is provided with multiple technical replicates.
# Replace `parsemsf_example("test_db.msf")` with the path to a Thermo MSF file
abundances <- quantitate(c(parsemsf_example("test_db.msf"),
parsemsf_example("test_db2.msf")))
## Now processing: /Users/ben/Box Sync/projects/2017/parsemsf/parsemsf-package/inst/extdata/test_db.msf
## Now processing: /Users/ben/Box Sync/projects/2017/parsemsf/parsemsf-package/inst/extdata/test_db2.msf
## Quantitating...
knitr::kable(head(abundances))
protein_desc | area_mean | area_sd | peps_per_rep |
---|---|---|---|
NP_041997.1 | 0.0917469 | 0.0207773 | 3 |
Abundances are estimated by taking the top three most abundant peptides by area, and averaging them together (area_mean
) [reference]. If provided multiple technical replicates, quantitate
will, by default, estimate protein abundances by matching peptides across technical replicates. That is, it will only average areas from peptides that are present in both technical replicates. The number unique peptides used to estimate the protein abundances are given by peps_per_rep
.
Protein abundances can also be estimated from a single Thermo MSF File.
# Replace `parsemsf_example("test_db.msf")` with the path to a Thermo MSF file
abundances <- quantitate(parsemsf_example("test_db.msf"))
## Now processing: /Users/ben/Box Sync/projects/2017/parsemsf/parsemsf-package/inst/extdata/test_db.msf
## Quantitating...
knitr::kable(head(abundances))
protein_desc | area_mean | area_sd | peps_per_rep |
---|---|---|---|
NP_041997.1 | 0.0963672 | 0.0250473 | 3 |
The ParseMSF package includes a function for inspecting the distribution of peptides within a single protein. The map_peptides
function produces a data frame of peptides with their respective locations within the protein sequence.
peptide_locs <- map_peptides(parsemsf_example("test_db.msf"))
# Select columns with start and end locations
peptide_locs <- peptide_locs[c("peptide_id", "protein_desc",
"peptide_sequence", "start", "end")]
knitr::kable(head(peptide_locs))
peptide_id | protein_desc | peptide_sequence | start | end |
---|---|---|---|---|
27146 | NP_041997.1 | AALTDQVALGK | 172 | 182 |
27177 | NP_041997.1 | AALTDQVALGK | 172 | 182 |
35484 | NP_041997.1 | ANFQADQIIAK | 314 | 324 |
35511 | NP_041997.1 | ANFQADQIIAK | 314 | 324 |
37869 | NP_041997.1 | TQAAYLAPGENLDDK | 69 | 83 |
37913 | NP_041997.1 | TQAAYLAPGENLDDK | 69 | 83 |
We can plot these peptide locations with the ggplot2 and dplyr packages.
library(ggplot2)
library(dplyr)
peptide_summary <- peptide_locs %>%
group_by(start, end) %>%
summarize(spectral_count = n()) # Count peptides
pep_plot <- ggplot(peptide_summary,
aes(x = start, xend = end, y = spectral_count, yend = spectral_count)) +
geom_segment(size = 1) +
ylim(0, 5) +
xlab("peptide position within protein") +
ylab("peptide count")
pep_plot