RHybridFinder

Frederic Saab, Peter Kubiniok

2021-06-18

RHybridFinder is a package for the analysis of Mass spectrometry (MS) for the discovery of putative hybrid peptides. For the analysis of your sample, please note that the proposed workflow in the context of this package consists of two major steps:

Loading the package

After installing the package and in order to be able to use the package, it has to be loaded

library(RHybridFinder)

Example data

For demonstration purposes, the data showcased in this vignette, which is also available in the package (denovo sequencing and database search results, in .csv format) is from HLA Ligand Atlas (Human liver, Autonomous Donor 17) (Marcu et al., 2020). In order to download the human proteome database .fasta file, please visit the uniProt website.

In order to access the example denovo sequencing results and database search results through the package:

#retrieve the denovo sequencing results for the example data
data(package="RHybridFinder", denovo_Human_Liver_AUTD17)

#retrieve the database search results for the example data
data(package="RHybridFinder", db_Human_Liver_AUTD17)

The RAW Mass Spectrometry (MS) Files for the dataset are provided by the authors on Proteomics Identifications Database (PRIDE): PXD019643

Step 1

The step 1 consists of running the HybridFinder function.

HybridFinder

Description

The HybridFinder function is based on the workflow proposed by Faridi et al. (2018), with some modifications. Whereby, while using denovo sequencing results with database search results and the proteome database, HybridFinder extracts High confidence denovo peptides and then goes through a 3-step search of these into the proteome. If peptide sequences are matched fully within proteins then they are considered as being “Linear” and Linear peptides within a given spectrum are filtered based on the highest ALC (Average Local Confidence: a significance score for the sequence) score. The rest of the spectra go through the second step during which lists of pair fragments from each peptide sequence are created and then searched in the proteome database, if pair fragments are matched within one protein, these are considered to be potentially cis-spliced. Then, only the highest ALC peptides from each spectrum group are kept. The rest of the spectra goes through the last step, which consists of searching for pair combination matches within two proteins, those that match are considered as being potentially trans-spliced, and only the highest ALC peptides within each spectrum group are kept. Finally, the list of hybrid candidates are concatenated into different ‘fake’ proteins, with the goal being the creation of a hybrid proteome which would mimic the actual proteome. And this hybrid proteome is merged with the reference proteome.

Loading data

In order to run HybridFinder, three inputs must be provided to HybridFinder

  1. all de novo candidates export file - all denovo sequencing candidates, loaded into R as dataframe
  2. DB search psm export file - database search results, loaded into R as dataframe.
  3. folder path to the proteome database used.

Please note that it is recommended to have a folder structure that looks as follows, as it helps keep all results organized:

  • (parent folder)- Exp 1 (could be any name )
    • (child folder): first_run
      • denovo
      • db
    • (2nd child folder): second_run
folder_Human_Liver_AUTD17 <- file.path("./data/Human_Liver_AUTD17")
denovo_Human_Liver_AUTD17 <- read.csv(file.path(folder_Human_Liver_AUTD17, "first_run","all de novo candidates.csv"), sep=",", head=TRUE,stringsAsFactors = FALSE)
db_Human_Liver_AUTD17 <- read.csv(file.path(folder_Human_Liver_AUTD17, "first_run","DB search psm.csv"), sep=",", head=TRUE,stringsAsFactors = FALSE)
proteome_Human_Liver_AUTD17<- file.path(folder_Human_Liver_AUTD17, "uniprot-proteome-human_UP000005640-reviewed_validated.fasta")

Run HybridFinder

Once the inputs are loaded, running HybridFinder is a piece of cake. Please note that the HybridFinder function can use parallel computing in order to obtain results fast. It will be good to make sure whether the PC used can support that.


results_HybridFinder_Human_Liver_AUTD17<- HybridFinder(denovo_candidates =  denovo_Human_Liver_AUTD17, db_search =  db_Human_Liver_AUTD17, proteome_db = proteome_Human_Liver_AUTD17, with_parallel = FALSE, export_files = TRUE, export_dir=folder_Human_Liver_AUTD17)

Output

The function returns a list composed of 3 elements

  • a dataframe representing the HybridFinder output, containing the high confidence denovo peptides that made it through the different searches (listed above).
  • a vector containing all the hybrid candidate sequences
  • a list consisting of the merged reference+hybrid proteome.
Ouput 1: HybridFinder Step1 output (HF_step1_output)
#display HybridFinder(HF) step1 output
print(head(results_HybridFinder_Human_Liver_AUTD17[[1]]))
#>     Fraction     Scan      m/z    RT   Peptide Length Potential_spliceType ALC
#> 398        1 F1:12157 574.2689 61.13 NYGELFEKF      9                  cis  92
#> 10         1 F1:10576 569.8348 55.81 KLADRFLLY      9               Linear  89
#> 405        1 F1:12242 574.2689 61.13 DYGELFEKF      9                  cis  89
#> 97         1 F1:16670 575.7930 72.95 SYLEHLFEL      9               Linear  93
#> 355        3 F3:16765 575.7935 72.81 SYLEHLFEL      9               Linear  91
#> 58         1 F1:14442 509.8185 68.48 LLDPHVVLL      9               Linear  82
#>                                          proteome_database_used
#> 398 uniprot-proteome-human_UP000005640-reviewed_validated.fasta
#> 10  uniprot-proteome-human_UP000005640-reviewed_validated.fasta
#> 405 uniprot-proteome-human_UP000005640-reviewed_validated.fasta
#> 97  uniprot-proteome-human_UP000005640-reviewed_validated.fasta
#> 355 uniprot-proteome-human_UP000005640-reviewed_validated.fasta
#> 58  uniprot-proteome-human_UP000005640-reviewed_validated.fasta
Ouput 2: List of step1 candidate hybrid peptides
#display list of candidate hybrid peptides
print(head(results_HybridFinder_Human_Liver_AUTD17[[2]]))
#> [1] "KAVNLLLSY" "AKVNLLLSY" "KLADFRLLY" "KLADLFRLY" "NYGELFEKF" "DYGELFEKF"
Ouput 3: merged proteome
#display the merged proteome
print(tail(results_HybridFinder_Human_Liver_AUTD17[[3]]))
#> $`sp|Q8TDM6-2|DLG5_HUMAN`
#> [1] "MEPQRRELLAQCQQSLAQAMTEVEAVLGLLEAAGALSPGERRQLDEEAGGAKAELLLKLLLAKERDHFQDLRAALEKTQPHLLPLLYLNGVVGPPQPAEGAGSTYSVLSTMPSDSESSSSLSSVGTTGKAPSPPPLLTDQQVNEKVENLSLQLRLMTRERNELRKRLAFATHGTAFDKRPYHRLNPDYERLKLQCVRAMSDLQSLQNQHTNALKRCEEVAKETDFYHTLHSRLLSDQTRLKDDVDMLRRENGQLLRERNLLQQSWEDMKRLHEEDQKELGDLRAQQQQVLKHNGSSELLNKLYDTAMDKLEVVKKDYDALRKRYSEKVALHNADLSRLEQLGEENQRLLKQTEMLTQQRDTALQLQHQCALSLRRFEALHHELNKATAQNKDLQWEMELLQSELTELRTTQVKTAKESEKYREERDAVYSEYKLLMSERDQVLSELDKLQTEVELAESKLKSSTSEKKAANEEMEALRQLKDTVTMDAGRANKEVELLRKQCKALCQELKEALQEADVAKCRRDWAFQERDKLVAERDSLRTLCDNLRRERDRAVSELAEALRSLDDTRKQKNDVSRELKELKEQMESQLEKEARFRQLMAHSSHDSALDTDSMEWETEVVEFERETEDLDLKALGFDMAEGVNEPCFPGDCGLFVTKVDKGSLADGRLRVNDWLLRLNDVDLLNKDKKQALKALLNGEGALNMVVRRRKSLGGKVVTPLHLNLSGQKDSGLSLENGVYAAAVLPGSPAAKEGSLAVGDRLVALNGLALDNKSLNECESLLRSCQDSLTLSLLKEQKCVPASGELSPELQEWAPYSPGHSSRHSNPPLYPSRPSVGTVPRSLTPSTTVSSLLRNPLYTVRSHRVGPCSSPPAARDAGPQGLHPSVQHQGRLSLDLSHRTCSDYSEMRATHGSNSLPSSARLGSSSNLQFKAERLKLPSTPRYPRSVVGSERGSVSHSECSTPPQSPLNLDTLSSCSQSQTSASTLPRLAVNPASLGERRKDRPYVEEPRHVKVQKGSEPLGLSLVSGEKGGLYVSKVTVGSLAHQAGLEYGDQLLEFNGLNLRSATEQQARLLLGQQCDTLTLLAQYNPHVHQLSSHSRSSSHLDPAGTHSTLQGSGTTTPEHPSVLDPLMEQDEGPSTPPAKQSSSRLAGDANKKTLEPRVVFLKKSQLELGVHLCGGNLHGVFVAEVEDDSPAKGPDGLVPGDLLLEYGSLDVRNKTVEEVYVEMLKPRDGVRLKVQYRPEEFTKAKGLPGDSFYLRALYDRLADVEQELSFKKDDLLYVDDTLPQGTFGSWMAWQLDENAQKLQRGQLPSKYVMDQEFSRRLSMSEVKDDNSATKTLSAAARRSFFRRKHKHKRSGSKDGKDLLALDAFSSDSLPLFEDSVSLAYQRVQKVDCTALRPVLLLGPLLDVVKEMLVNEAPGKFCRCPLEVMKASQQALERGVKDCLFVDYKRRSGHFDVTTVASLKELTEKNRHCLLDLAPHALERLHHMHLYPLVLFLHYKSAKHLKEQRDPLYLRDKVTQRHSKEQFEAAQKLEQEYSRYFTGVLQGGALSSLCTQLLAMVNQEQNKVLWLPACPL"
#> attr(,"name")
#> [1] "sp|Q8TDM6-2|DLG5_HUMAN"
#> attr(,"Annot")
#> [1] ">sp|Q8TDM6-2|DLG5_HUMAN Isoform 2 of Disks large homolog 5 OS=Homo sapiens OX=9606 GN=DLG5"
#> attr(,"class")
#> [1] "SeqFastaAA"
#> 
#> $`sp|Q8TDM6-3|DLG5_HUMAN`
#> [1] "MEPQRRELLAQCQQSLAQAMTEVEAVLGLLEAAGALSPGERRQLDEEAGGAKAELLLKLLLAKERDHFQDLRAALEKTQPHLLPLLYLNGVVGPPQPAEGAGSTYSVLSTMPSDSESSSSLSSVGTTGKELKEQMESQLEKEARFRQLMAHSSHDSALDTDSMEWETEVVEFERETEDLDLKALGFDMAEGVNEPCFPGDCGLFVTKVDKGSLADGRLRVNDWLLRLNDVDLLNKDKKQALKALLNGEGALNMVVRRRKSLGGKVVTPLHLNLSGQKDSGLSLENGVYAAAVLPGSPAAKEGSLAVGDRLVALNGLALDNKSLNECESLLRSCQDSLTLSLLKVFPQSSSWSGQNLFENLKDSDKMLSFRAHGPEVQAHNKRNLLQHNNSTQTDLFYTDRLEDRKEPGPPGGSSSFLHKPFPGGPLQVCPQACPSASERSLSSFRSDASGDRGFGLVDVRGRRPLLPFETEVGPCGVGEASLDKADSEGSNSGGTWPKAMLSSTAVPEKLSVYKKPKQRKSLFDPNTFKRPQTPPKLDYLLPGPGPAHSPQPSKRAGPLTPPKPPRRSDSLKFQHRLETSSESEATLVGSSPSTSPPSALPPDVDPGEPMHASPPRKARVRLASSYYPEGDGDSSHLPAKKSCDEDLTSQKVDELGQKRRRPKSAPSFRPKLAPVVLPAQFLEV"
#> attr(,"name")
#> [1] "sp|Q8TDM6-3|DLG5_HUMAN"
#> attr(,"Annot")
#> [1] ">sp|Q8TDM6-3|DLG5_HUMAN Isoform 3 of Disks large homolog 5 OS=Homo sapiens OX=9606 GN=DLG5"
#> attr(,"class")
#> [1] "SeqFastaAA"
#> 
#> $`sp|Q8TDM6-4|DLG5_HUMAN`
#> [1] "MPSDSESSSSLSSVGTTGKAPSPPPLLTDQQVNEKVENLSLQLRLMTRERNELRKRLAFATHGTAFDKRPYHRLNPDYERLKLQCVRAMSDLQSLQNQHTNALKRCEEVAKETDFYHTLHSRLLSDQTRLKDDVDMLRRENGQLLRERNLLQQSWEDMKRLHEEDQKELGDLRAQQQQVLKHNGSSELLNKLYDTAMDKLEVVKKDYDALRKRYSEKVALHNADLSRLEQLGEENQRLLKQTEMLTQQRDTALQLQHQCALSLRRFEALHHELNKATAQNKDLQWEMELLQSELTELRTTQVKTAKESEKYREERDAVYSEYKLLMSERDQVLSELDKLQTEVELAESKLKSSTSEKKAANEEMEALRQLKDTVTMDAGRANKEVELLRKQCKALCQELKEALQEADVAKCRRDWAFQERDKLVAERDSLRTLCDNLRRERDRAVSELAEALRSLDDTRKQKNDVSRELKELKEQMESQLEKEARFRQLMAHSSHDSALDTDSMEWETEVVEFERETEDLDLKALGFDMAEGVNEPCFPGDCGLFVTKVDKGSLADGRLRVNDWLLRLNDVDLLNKDKKQALKALLNGEGALNMVVRRRKSLGGKVVTPLHLNLSGQKDSGLSLENGVYAAAVLPGSPAAKEGSLAVGDRLVALNGLALDNKSLNECESLLRSCQDSLTLSLLKVFPQSSSWSGQNLFENLKDSDKMLSFRAHGPEVQAHNKRNLLQHNNSTQTDLFYTDRLEDRKEPGPPGGSSSFLHKPFPGGPLQVCPQACPSASERSLSSFRSDASGDRGFGLVDVRGRRPLLPFETEVGPCGVGEASLDKADSEGSNSGGTWPKAMLSSTAVPEKLSVYKKPKQRKSLFDPNTFKRPQTPPKLDYLLPGPGPAHSPQPSKRAGPLTPPKPPRRSDSLKFQHRLETSSESEATLVGSSPSTSPPSALPPDVDPGEPMHASPPRKARVRLASSYYPEGDGDSSHLPAKKSCDEDLTSQKVDELGQKRRRPKSAPSFRPKLAPVVLPAQFLEEQKCVPASGELSPELQEWAPYSPGHSSRHSNPPLYPSRPSVGTVPRSLTPSTTVSSLLRNPLYTVRSHRVGPCSSPPAARDAGPQGLHPSVQHQGRLSLDLSHRTCSDYSEMRATHGSNSLPSSARLGSSSNLQFKAERLKLPSTPRYPRSVVGSERGSVSHSECSTPPQSPLNLDTLSSCSQSQTSASTLPRLAVNPASLGERRKDRPYVEEPRHVKVQKGSEPLGLSLVSGEKGGLYVSKVTVGSLAHQAGLEYGDQLLEFNGLNLRSATEQQARLLLGQQCDTLTLLAQYNPHVHQLSSHSRSSSHLDPAGTHSTLQGSGTTTPEHPSVLDPLMEQDEGPSTPPAKQSSSRLAGDANKKTLEPRVVFLKKSQLELGVHLCGGNLHGVFVAEVEDDSPAKGPDGLVPGDLLLEYGSLDVRNKTVEEVYVEMLKPRDGVRLKVQYRPEEFTKAKGLPGDSFYLRALYDRLADVEQELSFKKDDLLYVDDTLPQGTFGSWMAWQLDENAQKLQRGQLPSKYVMDQEFSRRLSMSEVKDDNSATKTLSAAARRSFFRRKHKHKRSGSKDGKDLLALDAFSSDSLPLFEDSVSLAYQRVQKVDCTALRPVLLLGPLLDVVKEMLVNEAPGKFCRCPLEVMKASQQALERGVKDCLFVDYKRRSGHFDVTTVASLKELTEKNRHCLLDLAPHALERLHHMHLYPLVLFLHYKSAKHLKEQRDPLYLRDKVTQRHSKEQFEAAQKLEQEYSRYFTGVLQGGALSSLCTQLLAMVNQEQNKVLWLPACPL"
#> attr(,"name")
#> [1] "sp|Q8TDM6-4|DLG5_HUMAN"
#> attr(,"Annot")
#> [1] ">sp|Q8TDM6-4|DLG5_HUMAN Isoform 4 of Disks large homolog 5 OS=Homo sapiens OX=9606 GN=DLG5"
#> attr(,"class")
#> [1] "SeqFastaAA"
#> 
#> $`sp|Q8TDM6-5|DLG5_HUMAN`
#> [1] "MRATHGSNSLPSSARLGSSSNLQFKAERLKLPSTPRYPRSVVGSERGSVSHSECSTPPQSPLNLDTLSSCSQSQTSASTLPRLAVNPASLGERRKDRPYVEEPRHVKVQKGSEPLGLSLVSGEKGGLYVSKVTVGSLAHQAGLEYGDQLLEFNGLNLRSATEQQARLLLGQQCDTLTLLAQYNPHVHQLSSHSRSSSHLDPAGTHSTLQGSGTTTPEHPSVLDPLMEQDEGPSTPPAKQSSSRLAGDANKKTLEPRVVFLKKSQLELGVHLCGGNLHGVFVAEVEDDSPAKGPDGLVPGDLLLEYGSLDVRNKTVEEVYVEMLKPRDGVRLKVQYRPEEFTKAKGLPGDSFYLRALYDRLADVEQELSFKKDDLLYVDDTLPQGTFGSWMAWQLDENAQKLQRGQLPSKYVMDQEFSRRLSMSEVKDDNSATKTLSAAARRSFFRRKHKHKRSGSKDGKDLLALDAFSSDSLPLFEDSVSLAYQRVQKVDCTALRPVLLLGPLLDVVKEMLVNEAPGKFCRCPLEVMKASQQALERGVKDCLFVDYKRRSGHFDVTTVASLKELTEKNRHCLLDLAPHALERLHHMHLYPLVLFLHYKSAKHLKEQRDPLYLRDKVTQRHSKEQFEAAQKLEQEYSRYFTGVLQGGALSSLCTQLLAMVNQEQNKVLWLPACPL"
#> attr(,"name")
#> [1] "sp|Q8TDM6-5|DLG5_HUMAN"
#> attr(,"Annot")
#> [1] ">sp|Q8TDM6-5|DLG5_HUMAN Isoform 5 of Disks large homolog 5 OS=Homo sapiens OX=9606 GN=DLG5"
#> attr(,"class")
#> [1] "SeqFastaAA"
#> 
#> $`sp|denovo_HF_fake_protein1`
#> [1] "KAVNLLLSYAKVNLLLSYKLADFRLLYKLADLFRLYNYGELFEKFDYGELFEKFDYGELFQKFKLADFLRLYYKPSPFFVFYNLPWLENLLEPFLLPTLPELFLLPTLLYEQFVPLLLEYQFVPLLPLEFLLPTLLTTSWMSLKEQLPLRLSAQELPLRLSASEAPPTNGAALPYFSPCLELLPENLLHALMEDLLKLLAMENLLKLLAYPDLNFRNLLPVDLQRYLLPVNLQRYLLYEVLLKNFFPYYAPELLPFYYAPELLFSVHMVTHFLLYYASNYRRFLVGSLPKESAPPTNQAFENGEWRELQLADLFRLYEFTQHLFEL"
#> 
#> $`sp|denovo_HF_fake_protein2`
#> [1] "LTMNLVQELLTMDLVQELLTMDLVQQLLWDLSLTRLYLNPSFTVLYKPSFPFVFKYPSMLFVFYKPSLMFVFKYPSLMFVFDQDLRSMATALLYYASNRYLLYYASRNYSLVMTQTPKFLLTTMSLGSFLLTERYGSFFYGKQAVQFYKVYTSVSWMMALLTHGLLMMAKVFVFSYLPLAHFMAYVSELFPAFSFVSELFPAFSYLHELFELNYLPWLELNYVLAQAVLSLTVVMTQTPKFRYFSTSVSWYRFSTSVSWLTFVPGAMVLPVDLTAFQLLPVNLATFQLLPVDLATFQLLPVNLKSLTMSYLEHLFYTLPEMWFPLL"

Export

If export is set to TRUE and a valid directory is provided in export_dir, then the results are exported .csv, .csv and .fasta format, respectively.

Even if the export parameters were not set at the beginning, the results returned can always be exported with the export_HybridFinder_results function as long as as the results obtained from the HybridFinder function are stored which is also indicated in the results_list parameter of the export_HybridFinder_results function.

Interim external step: Second database search using the merged proteome

After finishing this, a second database search has to be done on the raw MS however with the merged proteome (.fasta) exported from the HybridFinder function results.

Step2

The second step in RHybridFinder consists of either using checknetMHCpan or step2_wo_netMHCpan, while using the results from step 1 in order to retrieve for the final list of peptides which includes the hybrid candidates, their potential splice types.

checknetMHCpan

Description

the checknetMHCpan function represents step 2 of Faridi et al. (2018)’s workflow and also features the use of netMHCpan (Jurtz et al., 2017, Reynisson et al., 2020) for obtaining the peptide-MHC-I predicted binding affinities. Please note that netMHCpan needs to be installed in order to be able to run this function. The package also contains a function that runs step2 without netMHCpan (Please refer to the step2_wo_netMHCpan part).

Loading data

In order to run checknetMHCpan, four inputs must be provided to checknetMHCpan

  1. netmhcpan_directory: the directory in which netMHCpan is located in (i.e ‘/usr/bin/’ or ‘/usr/bin/local/’)
  2. netmhcpan_alleles: the alleles to be tested again, in a vector format if multiple (i.e alleles<- c(‘HLA-A03:01’, ’HLA-A24:02’))
  3. peptide_rerun: the database search results from the 2nd run loaded into R as a dataframe
  4. HF_step1_output: the dataframe of the first element of the HybridFinder output.
netmhcpan_dir<- '/usr/bin/'

alleles_Human_liver_AUTD17<- c("HLA-A*03:01", "HLA-A*24:02", "HLA-B*35:03", "HLA-B*45:01", "HLA-C*04:01", "HLA-C*16:01")

db_rerun_Human_liver_AUTD17 <- read.csv(file.path(folder_Human_Liver_AUTD17, "second_run","DB search psm.csv"), sep=",", head=TRUE,stringsAsFactors = FALSE)

HF_output_Human_liver_AUTD17<- results_HybridFinder_Human_Liver_AUTD17[[1]]

Run checknetMHCpan

Once the inputs are loaded, running checknetMHCpan is easier than ABC.


results_checknetMHCpan_Human_Liver_AUTD17<- checknetMHCpan(netmhcpan_directory = netmhcpan_dir, netmhcpan_alleles = alleles_Human_liver_AUTD17, peptide_rerun = db_rerun_Human_liver_AUTD17, HF_step1_output = HF_output_Human_liver_AUTD17, export_files = TRUE, export_dir=folder_Human_Liver_AUTD17)

Output

The function returns a list composed of 3 elements: - the netMHCpan results in long format, that is the binding affinity results are displayed for each peptide with a given allele from those chosen. - the netMHCpan results in wide format, that is the binding affinity levels per peptide summarized for all HLA alleles chosen. - the database results with the respective potential splice types retrieved from step 1

Ouput 1: netMHCpan results in long format
#display netmhcpan output(long version)
print(head(results_checknetMHCpan_Human_Liver_AUTD17[[1]]))
#>    Pos         HLA    Peptide      Core Of Gp Gl Ip Il      Icore Identity
#> 5    1 HLA-A*03:01  DYENLFLKF DYENLFLKF  0  0  0  0  0  DYENLFLKF  PEPLIST
#> 6    1 HLA-A*03:01  RYFSTSVSW RYFSTSVSW  0  0  0  0  0  RYFSTSVSW  PEPLIST
#> 7    1 HLA-A*03:01 AFSHLLLTTM AFSHLLLTM  0  7  1  0  0 AFSHLLLTTM  PEPLIST
#> 8    1 HLA-A*03:01  PYMARVAFF PYMARVAFF  0  0  0  0  0  PYMARVAFF  PEPLIST
#> 9    1 HLA-A*03:01  LYRPTAAAF LYRPTAAAF  0  0  0  0  0  LYRPTAAAF  PEPLIST
#> 10   1 HLA-A*03:01 FPVELAKYYM FVELAKYYM  0  1  1  0  0 FPVELAKYYM  PEPLIST
#>        Score Aff(nM)   %Rank  BindLevel strongBinder weakBinder  noneBinder
#> 5  0.0232400 38883.6 68.5113 Non binder                         HLA-A*03:01
#> 6  0.0939940 18084.0 14.1393 Non binder                         HLA-A*03:01
#> 7  0.0874160 19418.0 15.5482 Non binder                         HLA-A*03:01
#> 8  0.0423440 31622.6 38.9120 Non binder                         HLA-A*03:01
#> 9  0.0728480 22733.1 19.7710 Non binder                         HLA-A*03:01
#> 10 0.0459210 30422.0 35.4191 Non binder                         HLA-A*03:01
#>    Potential_spliceType
#> 5                Linear
#> 6                 trans
#> 7                Linear
#> 8                Linear
#> 9                Linear
#> 10               Linear
Ouput 2: netMHCpan results in wide format
#display netmhcpan output tidied version (wide)
print(head(results_checknetMHCpan_Human_Liver_AUTD17[[2]]))
#>        Peptide                        strongBinder              weakBinder
#> 1  AAEYPSVTNYL                                                 HLA-A*24:02
#> 7    AAFFEEPEL                                     HLA-B*35:03,HLA-C*16:01
#> 13  AAMLDTVVFK                         HLA-A*03:01                        
#> 19   AANPHSFVF HLA-B*35:03,HLA-C*04:01,HLA-C*16:01             HLA-A*24:02
#> 25   AANPNGRYY                                                 HLA-C*16:01
#> 31  AAPPQLRALL                                                            
#>                                                                 noneBinder
#> 1              HLA-A*03:01,HLA-B*35:03,HLA-B*45:01,HLA-C*04:01,HLA-C*16:01
#> 7                          HLA-A*03:01,HLA-A*24:02,HLA-B*45:01,HLA-C*04:01
#> 13             HLA-A*24:02,HLA-B*35:03,HLA-B*45:01,HLA-C*04:01,HLA-C*16:01
#> 19                                                 HLA-A*03:01,HLA-B*45:01
#> 25             HLA-A*03:01,HLA-A*24:02,HLA-B*35:03,HLA-B*45:01,HLA-C*04:01
#> 31 HLA-A*03:01,HLA-A*24:02,HLA-B*35:03,HLA-B*45:01,HLA-C*04:01,HLA-C*16:01
#>    %Rank.HLA-B*35:03 %Rank.HLA-A*24:02 %Rank.HLA-C*04:01 %Rank.HLA-A*03:01
#> 1            13.2675            1.3806            2.8884           19.5261
#> 7             1.0785           27.1675           10.0025           43.1042
#> 13            8.8987           18.3151           11.6722            0.1091
#> 19            0.3034            1.2434            0.2483            8.6520
#> 25            7.0735           38.2717            3.5019            5.3500
#> 31            6.9323           10.8503            5.7910           44.3338
#>    %Rank.HLA-B*45:01 %Rank.HLA-C*16:01 strongBinder_count weakBinder_count
#> 1             5.5163           10.6324                  0                1
#> 7            45.5616            0.6543                  0                2
#> 13           10.4367            7.4449                  1                0
#> 19            9.1652            0.0156                  3                1
#> 25           21.7060            0.5745                  0                1
#> 31           73.9918            4.4549                  0                0
#>    noneBinder_count Potential_spliceType
#> 1                 5               Linear
#> 7                 4               Linear
#> 13                5               Linear
#> 19                2               Linear
#> 25                5               Linear
#> 31                6               Linear
Ouput 3: Database search results updated
#display the updated database search results with the categorizations from step1
print(head(results_checknetMHCpan_Human_Liver_AUTD17[[3]]))
#>     Peptide X.10lgP     Mass Length ppm      m.z Z   RT   Area Fraction    Id
#> 1 DYENLFLKF   23.98 1187.586      9 0.1 594.8004 2 73.2 243590        3 44426
#> 2 DYENLFLKF   23.72 1187.586      9 0.1 594.8004 2 73.2 243590        3 44427
#> 3 DYENLFLKF   23.13 1187.586      9 0.1 594.8004 2 73.2 243590        3 44428
#> 4 DYENLFLKF   22.98 1187.586      9 0.1 594.8004 2 73.2 243590        3 44429
#> 5 DYENLFLKF   22.70 1187.586      9 0.1 594.8004 2 73.2 243590        3 44430
#> 6 DYENLFLKF   20.88 1187.586      9 0.1 594.8004 2 73.2 243590        3 44432
#>       Scan from.Chimera
#> 1 F3:16451           No
#> 2 F3:16501           No
#> 3 F3:16546           No
#> 4 F3:16609           No
#> 5 F3:16633           No
#> 6 F3:16686           No
#>                                                 Source.File
#> 1 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#> 2 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#> 3 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#> 4 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#> 5 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#> 6 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#>                  Accession PTM AScore Found.By Peptide_no_mods
#> 1 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#> 2 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#> 3 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#> 4 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#> 5 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#> 6 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#>   Potential_spliceType
#> 1               Linear
#> 2               Linear
#> 3               Linear
#> 4               Linear
#> 5               Linear
#> 6               Linear

Export

If export is set to TRUE and a valid directory is provided in export_dir, then the results are exported .csv, .tsv (tab-separated) and .csv format, respectively.

Even if the export parameters were not set at the beginning, the results returned can always be exported with the export_checknetMHCpan_results function as long as as the results obtained from the checknetMHCpan function are stored which is also indicated in the results_list parameter of the export_checknetMHCpan_results function.

step2_wo_netMHCpan

The step2_wo_netMHCpan, removes peptide modifications and prepare a peptide (.pep) file for use in webversion of netMHCpan, in case netMHCpan is not installed, OS is windows or the user would like to run in another software. Additionally, the function matches peptide sequences in the database search rerun (the second database search where the merged proteome was used), with the predicted splice type obtained from step 1.

Description

The step2_wo_netMHCpan, removes peptide modifications and runs netMHCpan on peptides between 9 and 12-mers. Additionally, the function matches peptide sequences in the database search rerun (the second database search where the merged proteome was used), with predicted splice type obtained from step 1.

Loading data

In order to run checknetMHCpan, four inputs must be provided to checknetMHCpan

  1. peptide_rerun: the database search results from the 2nd run loaded into R as a dataframe
  2. HF_step1_output: the dataframe of the first element of the HybridFinder output.
db_rerun_Human_liver_AUTD17 <- read.csv(file.path(folder_Human_Liver_AUTD17, "second_run","DB search psm.csv"), sep=",", head=TRUE,stringsAsFactors = FALSE)

HF_output_Human_liver_AUTD17<- results_HybridFinder_Human_Liver_AUTD17[[1]]

Run step2_wo_netMHCpan

Once the inputs are loaded, running step2_wo_netMHCpan is easier than ABC.


results_step2_Human_Liver_AUTD17<- step2_wo_netMHCpan(peptide_rerun = db_rerun_Human_liver_AUTD17, HF_step1_output = HF_output_Human_liver_AUTD17, export_files = TRUE, export_dir=folder_Human_Liver_AUTD17)

Output

The function returns a list composed of 2 elements: - a character vector containing the list of unique peptides from the database search rerun without modifications and of length 9 to 12 amino acids - the database results with the respective potential splice types retrieved from step 1

Ouput 1: netMHCpan-ready input
#display the netmhcpan-ready input / list of all peptides 9-12 aa, without 
#modifications
print(head(results_step2_Human_Liver_AUTD17[[1]]))
#>       Peptide
#> 1   DYENLFLKF
#> 13  RYFSTSVSW
#> 17 AFSHLLLTTM
#> 19  PYMARVAFF
#> 21  LYRPTAAAF
#> 31 FPVELAKYYM
Ouput 2: Database search results updated
#display the updated database search results table with the categorizations from 
#step1
print(head(results_step2_Human_Liver_AUTD17[[2]]))
#>     Peptide X.10lgP     Mass Length ppm      m.z Z   RT   Area Fraction    Id
#> 1 DYENLFLKF   23.98 1187.586      9 0.1 594.8004 2 73.2 243590        3 44426
#> 2 DYENLFLKF   23.72 1187.586      9 0.1 594.8004 2 73.2 243590        3 44427
#> 3 DYENLFLKF   23.13 1187.586      9 0.1 594.8004 2 73.2 243590        3 44428
#> 4 DYENLFLKF   22.98 1187.586      9 0.1 594.8004 2 73.2 243590        3 44429
#> 5 DYENLFLKF   22.70 1187.586      9 0.1 594.8004 2 73.2 243590        3 44430
#> 6 DYENLFLKF   20.88 1187.586      9 0.1 594.8004 2 73.2 243590        3 44432
#>       Scan from.Chimera
#> 1 F3:16451           No
#> 2 F3:16501           No
#> 3 F3:16546           No
#> 4 F3:16609           No
#> 5 F3:16633           No
#> 6 F3:16686           No
#>                                                 Source.File
#> 1 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#> 2 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#> 3 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#> 4 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#> 5 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#> 6 171002_AM_BD-ZH17_Liver_W_10%_DDA_#3_400-650mz_msms6.mzML
#>                  Accession PTM AScore Found.By Peptide_no_mods
#> 1 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#> 2 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#> 3 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#> 4 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#> 5 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#> 6 |denovo_HF_fake_protein9            PEAKS DB       DYENLFLKF
#>   Potential_spliceType
#> 1               Linear
#> 2               Linear
#> 3               Linear
#> 4               Linear
#> 5               Linear
#> 6               Linear

Export

If export is set to TRUE and a valid directory is provided in export_dir, then the results are exported .csv, .csv and csv format, respectively.

Even if the export parameters were not set at the beginning, the results returned can always be exported with the export_step2_results function as long as as the results obtained from the step2_wo_netMHCpan function are stored which is also indicated in the results_list parameter of the export_step2_results function.

References

Faridi, P., Li, C., Ramarathinam, S. H., Vivian, J. P., Illing, P. T., Mifsud, N. A., Ayala, R., Song, J., Gearing, L. J., Hertzog, P. J., Ternette, N., Rossjohn, J., Croft, N. P., & Purcell, A. W. (2018). A subset of HLA-I peptides are not genomically templated: Evidence for cis- and trans-spliced peptide ligands. Science Immunology, 3(28), eaar3947. 10.1126/sciimmunol.aar3947, Link

Hanada K, Yewdell JW, Yang JC. Immune recognition of a human renal cancer antigen through post-translational protein splicing. Nature. 2004 Jan 15;427(6971):252-6. DOI 10.1038/nature02240, Link

Marcu A, Bichmann L, Kuchenbecker L, et al HLA Ligand Atlas: a benign reference of HLA-presented peptides to improve T-cell-based cancer immunotherapyJournal for ImmunoTherapy of Cancer 2021;9:e002071. 10.1136/jitc-2020-002071, Link

Birkir Reynisson, Bruno Alvarez, Sinu Paul, Bjoern Peters, Morten Nielsen, NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data, Nucleic Acids Research, Volume 48, Issue W1, 02 July 2020, Pages W449–W454, 10.1093/nar/gkaa379, Link

Jurtz V, Paul S, Andreatta M, Marcatili P, Peters B, Nielsen M. NetMHCpan-4.0: Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data. J Immunol. 2017 Nov 1;199(9):3360-3368. Epub 2017 Oct 4. PMID: 28978689; PMCID: PMC5679736 10.4049/jimmunol.1700893, Link

The UniProt Consortium, UniProt: the universal protein knowledgebase in 2021, Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D480–D489, 10.1093/nar/gkaa1100, Link