The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Feature extraction or feature encoding is a fundamental step in the construction of high-quality machine learning-based models. Specifically, this is a key step for determining the effectiveness of the trained models to make a prediction. In the last two decades, a variety of feature encoding schemes have been proposed in order to exploit useful patterns from protein sequences. Such schemes are often based on sequence information or physicochemical properties of amino acids. Although direct features derived from sequences themselves (such as amino acid compositions, dipeptide compositions, and counting of k-mers) are regarded as essential for training models, an increasing number of studies have shown that evolutionary information in the form of PSSM profiles is much more informative than sequence information alone. Accordingly, PSSM-based feature descriptors have been commonly used as indispensable primary features to construct models, filling a major gap in the current bioinformatics research. For example, PSSM-based feature descriptors have successfully improved the prediction performance of structural and functional properties of proteins across a wide spectrum of bioinformatics applications. These predictions can be applied for protein fold recognition and the prediction of protein structural classes, protein-protein interactions, protein subcellular localization, RNA-binding sites, and protein functions. But at the same time, there is no comprehensive, simple tool in the R programming language for extracting all of these features from the PSSM and displaying it in the output. PSSMCOOL package here is developed in R for these purposes. First, in figure 1 which is a table that lists all of the features implemented in this package with their feature-lengths is brought. Then each one of these features will be explained in full detail.
PSSMCOOL Package is currently available on CRAN website:
https://CRAN.R-project.org/package=PSSMCOOL
for issues about this package :
https://github.com/BioCool-Lab/PSSMCOOL/issues
*feature vector length depends on the choice of parameter **these features produce a Matrix of features which its dimension depends on choice of parameter
library(PSSMCOOL)
This feature, which stands for auto-covariance transformation, for column j, calculates the average of this column as shown in figure 2. Then, subtracts the resulting number from the elements on the rows i and i + g of this column, and finally multiplies them and calculates the sum by changing the variable i from 1 to L-g. Because the variable j changes between 1 and 20 and the variable g changes between 1 and 10, eventually a feature vector of length 200 will be obtained [1].
\[\begin{equation} {PSSM-AC}_{i,j}=\frac{1}{(L-g)}\sum_{i=1}^{L-g}(S_{i,j}-\frac{1}{L}\sum_{i=1}^{L}S_{i,j})(S_{i+g,j}-\frac{1}{L}\sum_{i=1}^{L}S_{i,j}) \end{equation}\]
Usage of this feature in PSSMCOOL package:
X<-pssm_ac(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] 0.4362 0.5243 1.2938 0.9478 0.1355 4.0726 0.5714 1.2965 1.8656
## [10] 1.4541 2.6448 0.1574 0.8410 4.5394 -0.0686 0.8545 0.5419 0.8503
## [19] 0.4201 0.8307 0.4234 0.1116 0.4955 0.3260 1.6343 3.7697 0.5782
## [28] 0.9432 1.2285 1.8984 2.3239 0.7709 1.5190 4.0621 3.0589 0.5373
## [37] 0.8191 -0.3130 -0.9802 1.2718 0.7315 0.5606 0.6318 1.1806 -0.4041
## [46] 3.3667 1.6450 0.4724 1.7022 1.1787
These are three feature vectors. AAC stands for Amino Acid composition which is actually mean of PSSM Matrix columns which its length is 20.DPC stands for dipeptide composition, which multiplies the values that are located in two consecutive rows and two different columns. Having calculated these values for different rows and columns, they are summed. Next, for both columns, the sum is divided by L-1.Since the result depends on two different columns, eventually a feature vector of length 400, according to figure 3 and following equation ,will be obtained. AADP is combination of AAC and DPC feature vectors. [2].
\[\begin{equation} y_{i,j}=\frac{1}{(L-1)}\sum_{k=1}^{L-1}S_{k,i}S_{k+1,j}, \\(1\leq{i,j}\leq{20}) \end{equation}\]
In the above equation,\(S_{i,j}\)’s,are the PSSM elements and \(y_{i,j}\)’s are the 20x20 matrix elements, which by placing the rows of this matrix next to each other,DPC-PSSM feature vector of length 400 is obtained.
Usage of this features in PSSMCOOL package:
X<-aac_pssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] -0.7023 -1.4122 -1.2214 -1.4122 -2.1145 -0.3130 -1.1679 -1.3893 -1.2214
## [10] -1.8779 -1.7863 -1.1527 -1.3817 -1.3359 -1.7786 -0.6565 -1.3359 -2.3817
## [19] -1.7634 -1.6107
ss<-dpc_pssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] -0.7023 -1.4122 -1.2214 -1.4122 -2.1145 -0.3130 -1.1679 -1.3893 -1.2214
## [10] -1.8779 -1.7863 -1.1527 -1.3817 -1.3359 -1.7786 -0.6565 -1.3359 -2.3817
## [19] -1.7634 -1.6107
ss<-aadp_pssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] -0.7023 -1.4122 -1.2214 -1.4122 -2.1145 -0.3130 -1.1679 -1.3893 -1.2214
## [10] -1.8779 -1.7863 -1.1527 -1.3817 -1.3359 -1.7786 -0.6565 -1.3359 -2.3817
## [19] -1.7634 -1.6107
This feature vector is of length 8000, and is extracted from the PSSM. If we multiply elements available in three consecutive rows and three different columns of the PSSM by each other, and apply this to all rows (all three consecutive rows) and then sum these numbers, eventually one of the elements of feature vector with length 8000, corresponding to the three selected columns will be obtained. Because we have 20 different columns, the final feature vector will be of length 8000 = 20 * 20 * 20. Figure 4: shows these steps. For example, in this figure for three marked rows and columns, the numbers obtained from the intersection of these rows and columns highlighted in a blue dotted circle around them, are multiplied to each other [3].
\[\begin{equation} T_{m,n,r}=\sum_{i=1}^{L-2}P_{i,m}P_{i+1,n}P_{i+2,r} \end{equation}\]
Usage of this feature in PSSMCOOL package:
X<-trigrame_pssm(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GSI6.txt.pssm"))
head(X, n = 50)
## [1] 6.8369 3.7511 4.1531 2.6852 2.6406 4.3763 2.8829 4.2257 3.7228 5.1379
## [11] 5.2524 4.4484 5.1002 3.8583 5.1569 7.5018 5.2072 2.4309 2.8760 5.6669
## [21] 3.5204 3.0895 2.7871 1.9459 0.7295 3.4388 2.1569 2.4319 3.2735 1.6680
## [31] 1.9269 3.1770 2.1466 2.0473 3.1579 4.1613 2.8492 1.5444 1.8998 1.9281
## [41] 3.3102 2.3229 2.7297 1.6388 1.3118 3.0535 1.9647 2.2095 2.7645 2.8273
The length of this feature vector is 320. The first 20 numbers of this feature vector are the mean of 20 columns in PSSM, and the next values for each column are the mean squares of the difference between the elements of row i and i + lag in this column. Because the lag value varies between 1 and 15, the final feature vector will have a length of 320. Figure 5: and following equation shows the process of this function and the corresponding mathematical relationship, respectively [4].
\[\begin{equation} p(k)=\frac{1}{L-lag}\sum_{i=1}^{L-lag}(p_{i,j}-p_{i+lag,j})^2 \\j=1,2,...,20,lag=1,2,...,15\\k=20+j+20(lag-1) \end{equation}\]
Usage of this feature in PSSMCOOL package:
X<-pse_pssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] -0.7023 -1.4122 -1.2214 -1.4122 -2.1145 -0.3130 -1.1679 -1.3893 -1.2214
## [10] -1.8779 -1.7863 -1.1527 -1.3817 -1.3359 -1.7786 -0.6565 -1.3359 -2.3817
## [19] -1.7634 -1.6107 6.4308 11.2692 6.3462 14.7692 29.1000 6.9231 10.2000
## [28] 11.0615 9.0769 8.5385 7.3538 12.3615 7.3154 12.4231 18.2615 4.6692
## [37] 4.2846 18.3692 12.0462 9.8462
This feature is almost identical to the DPC feature, and in fact, the DPC feature is part of this feature (for k = 1) and for two different columns, it considers rows that have distance k [5].
\[\begin{equation} T_{m,n}(k)=\sum_{i=1}^{L-k}p_{i,m}p_{i+k,n}\quad ,(1\leq{m,n}\leq{20}) \end{equation}\]
Usage of this feature in PSSMCOOL package:
X<-k_separated_bigrams_pssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),5)
head(X, n = 50)
## [1] 17.8176 13.2148 15.5094 14.4319 8.3346 22.1273 16.1736 13.4963 13.7240
## [10] 11.5597 11.8734 14.7134 12.4310 17.7145 11.2181 18.3770 12.4118 8.3759
## [19] 10.2801 13.0504 14.3991 12.7802 13.3367 12.7504 6.5913 16.3527 13.2403
## [28] 9.4264 10.8272 8.9290 7.9045 12.3299 7.7908 8.3621 10.7674 15.1463
## [37] 10.1184 6.4232 6.2908 10.3672 14.7014 13.7350 16.3146 14.6568 7.6190
## [46] 20.1827 15.6412 13.8415 13.7924 8.5747
In this group of features, in order to use uniform dimensions to show proteins of different lengths, in the first step, the average evolutionary score between adjacent residues is calculated using the following equations:
\[\begin{equation} Aver_1=(p_{i-1,k}+p_{i,s})/2 \\ Aver_2=(p_{i,s}+p_{i+1,t})/2 \\ i=1,2,...,L\quad and \quad k,s,t=1,2,...,20 \end{equation}\]
Where L is equal to the length of the protein, \(Aver_1\) is the mean score of positions i and i-1, and \(Aver_2\) is the mean score of positions i and i+1. The evolutionary difference formula (EDF) is then defined as follows:
\[\begin{equation} EDF:x_{i-1,i+1}=(Aver_1-Aver_2)^2 \end{equation}\]
\(x_{i-1,i+1}\) represents the mean of evolutionary difference between the residues of a given protein sequence. According to EDF, a given protein can be expressed by a 20 x 20 matrix called ED-PSSM, which is defined by the equations
\[\begin{equation} ED-PSSM=(e_1,e_2,...,e_{20})\quad where \quad e_t=(e_{1,t},e_{2,t},...,e_{20,t})^T \\ e_{k,t}=\sum_{i=2}^{L-1}x_{i-1,i+1}/L-2 \quad , k,t=1,2,..,20 \end{equation}\]
Using this ED-PSSM, the three features EDP, EEDP, and MEDP are defined by the following equation. EDP and EEDP have a length of 20 and 400, respectively. The MEDP feature is obtained by merging these two feature vectors [6]. output of this function is a list of three feature vectors (EDP, EEDP, MEDP)
\[\begin{equation} EDP=[\psi_1,\psi_2,...,\psi_{20}]^T \quad where \quad \psi_t=\sum_{k=1}^{20}e_{k,t}/20 \quad ,t=1,2,...,20 \\ EEDP=[\psi_{21},\psi_{22},...,\psi_{420}]^T \quad where \quad \psi_u=e_{k,t} \quad ,u=21,22,...,420 \\ MEDP=[\psi_1,\psi_2,...,\psi_{420}]^T \end{equation}\]
Figure 7: also shows the process of this work. It is noteworthy that in the following equation the value of \(p_{i,s}\) is removed during the subtraction of \(Aver_1\) and \(Aver_2\), and therefore this value is not shown in Figure 7.
\[\begin{equation} Aver_1=(p_{i-1,k}+p_{i,s})/2 \\ Aver_2=(p_{i,s}+p_{i+1,t})/2 \end{equation}\]
Usage of this feature in PSSMCOOL package:
X<-EDP_EEDP_MEDP(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GS61.txt.pssm"))
head(X[[3]], n = 50) # in here X[[3]] indicates MEDP feature vector
## [1] 1.7186 1.0074 1.2424 1.3901 1.1034 1.0319 1.1189 1.7631 1.2699 1.5474
## [11] 1.4929 1.0104 1.1456 1.5791 1.4819 1.9171 2.4254 1.7011 1.5154 1.7371
## [21] 0.9475 1.7500 1.7575 2.2350 1.4825 1.3800 1.5575 2.2200 2.0875 1.4950
## [31] 1.4900 1.4225 1.1150 2.1150 1.8875 1.5150 2.0100 3.0325 2.0200 1.2525
## [41] 1.7175 0.3400 0.7025 0.7600 0.5275 0.5600 0.5525 1.3700 0.5825 1.1400
in this feature At first, each protein sequence is divided into 20 equal parts, each of which is called a block, and in each block, the row vectors of the PSSM related to that block are added together. The resulting final vector is divided by the length of that block, which is equal to 5% of protein length. Finally, by placing these 20 vectors side by side, the first feature vector of length 400 is obtained.Figure 8 depicts this process [7].
Usage of this feature in PSSMCOOL package:
X<- AB_PSSM(system.file("extdata","C7GRQ3.txt.pssm",package="PSSMCOOL"))
head(X[1], n = 50)
## [1] -0.6667
In this features, at first, a TPM matrix is constructed from the PSSM, which has represented by a vector corresponding to the following equation:
\[\begin{equation} Y_{TPM}=(y_{1,1},y_{1,2},...,y_{1,20},...,y_{i,1},...,y_{i,20},...,y_{20,1},...,y_{20,20})^T \end{equation}\]
Where the components are as follows:
\[\begin{equation}
y_{i,j}=(\sum_{k=1}^{L-1}P_{k,i}\times P_{k+1,j})/(\sum_{j=1}^{20}\sum_{k=1}^{L-1}P_{k+1,j}\times P_{k,i}) \\ 1\leq{i,j}\leq{20}
\end{equation}\]
In the above equation, the numerator is the same as the equation related to DPC-PSSM feature without considering its coefficient. By placing these components together, a TPC feature vector of length 400 is obtained, and if we add the AAC feature vector of length 20 which is the average of columns of the PSSM to the beginning of this vector, AATP feature vector of length 420 is obtained [8]. output is a list of two features (TPC and AATP)
Usage of this feature in PSSMCOOL package:
X<-AATP_TPC(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GQS7.txt.pssm"))
head(X[[2]], n = 50) #in here X[[2]] indicates AATP feature vector
## [1] -0.7023 -1.4122 -1.2214 -1.4122 -2.1145 -0.3130 -1.1679 -1.3893 -1.2214
## [10] -1.8779 -1.7863 -1.1527 -1.3817 -1.3359 -1.7786 -0.6565 -1.3359 -2.3817
## [19] -1.7634 -1.6107 0.0333 0.0565 0.0397 0.0148 0.0697 0.0227 0.0193
## [28] 0.0232 0.0641 0.0708 0.0711 0.0353 0.0453 0.0661 0.0243 0.0397
## [37] 0.0504 0.1092 0.0778 0.0666 0.0341 0.0605 0.0581 0.0620 0.0602
## [46] 0.0123 0.0372 0.0548 0.0490 0.0563
This feature consists of a combination of several types of features, and in general, the obtained feature vector would have a length of 700. Here all parts of this feature are described separately.
First, the consensus sequence is obtained from the PSSM according to the following equation, then from this consensus sequence next feature vectors are obtained
\[\begin{equation}
\alpha(i)=argmax{P_{i,j}:\ 1\leq j\leq {20}} \quad ,1\leq i\leq L
\end{equation}\]
where \(\alpha(i)\) is the index for the largest element in row i of PSSM, and the i-th component in the consensus sequence equals to the \(\alpha(i)\)’th amino acid in the standard amino acid alphabet which the column names of PSSM are labeled by them. Now, using the following equation, the feature vector of length 20 is obtained:
\[\begin{equation}
CSAAC=\frac{n(j)}{L}\quad ,1\leq j\leq{20}
\end{equation}\] Here \(n(j)\) shows the number of j-th amino acid occurrences in the consensus sequence
This feature vector is obtained using the following equation from the consensus sequence:
\[\begin{equation} CSCM=\frac{\sum_{j=1}^{n_i}n_{i,j}}{L(L-1)}\quad ,1\leq i\leq{20}\quad ,1\leq j\leq L \end{equation}\] Here \(n(i)\) shows the number of i-th amino acid occurrences in the consensus sequence and \(n_{i,j}\) indicates the j-th position of the i-th amino acid in the consensus sequence.
Here the PSSM is divided into n segments, which corresponds to dividing the initial protein sequence into n segments, if n = 2:
\[\begin{equation} L_1=round(L/2) \quad , L_2=L-L_1 \end{equation}\]
Where L represents the length of the initial protein and indexed L’s indicate the length of the first and second segments, respectively. Now, using the following equations, the feature vector components of the length 200 are obtained
\[\begin{equation} \alpha_j^\lambda=\left\{\begin{array}{ll}\frac{1}{L_1}\sum_{i=1}^{L_1}P_{i,j} & j=1,2,...,20,\lambda=0 \\ \frac{1}{{L_1}-\lambda}\sum_{i=1}^{{L_1}-\lambda}(P_{i,j}-P_{i+\lambda,j})^2 & j=1,2,...,20,\lambda=1,2,3,4\end{array}\right. \end{equation}\] \[\begin{equation} \beta_j^\lambda=\left\{\begin{array}{ll}\frac{1}{L-L_1}\sum_{i=L_1+1}^{L}P_{i,j} & j=1,2,...,20,\lambda=0 \\ \frac{1}{L-L_1-\lambda}\sum_{i=L_1+1}^{L-\lambda}(P_{i,j}-P_{i+\lambda,j})^2 & j=1,2,...,20,\lambda=1,2,3,4\end{array}\right. \end{equation}\] And if n = 3 then we have:
\[\begin{equation} L_1=round(L/3) \quad , L_2=2L_1 \quad ,L_3=L-2L_1 \end{equation}\]
Therefore, using the following equations, the components of a feature vector with length 180 are obtained as follows:
\[\begin{equation} \theta_j^\lambda=\left\{\begin{array}{ll}\frac{1}{L_1}\sum_{i=1}^{L_1}P_{i,j} & j=1,2,...,20,\lambda=0 \\ \frac{1}{L_1-\lambda}\sum_{i=1}^{L_1-\lambda}(P_{i,j}-P_{i+\lambda,j})^2 & j=1,2,...,20,\lambda=1,2\end{array}\right. \end{equation}\]
\[\begin{equation} \mu_j^\lambda=\left\{\begin{array}{ll}\frac{1}{L_1}\sum_{i=L_1+1}^{2L_1}P_{i,j} & j=1,2,...,20,\lambda=0 \\ \frac{1}{L_1-\lambda}\sum_{i=L_1+1}^{2L_1-\lambda}(P_{i,j}-P_{i+\lambda,j})^2 & j=1,2,...,20,\lambda=1,2\end{array}\right. \end{equation}\]
\[\begin{equation} v_j^\lambda=\left\{\begin{array}{ll}\frac{1}{L-2L_1}\sum_{i=2L_1+1}^{L}P_{i,j} & j=1,2,...,20,\lambda=0 \\ \frac{1}{L-2L_1-\lambda}\sum_{i=2L_1+1}^{L-\lambda}(P_{i,j}-P_{i+\lambda,j})^2 & j=1,2,...,20,\lambda=1,2\end{array}\right. \end{equation}\]
In total, using the previous feature vector, a feature vector of length 380 is obtained for this group.
In this group, using the previous equations and the following equations, the feature vector of length 280 could be obtained. when n=2:
\[\begin{equation} AC1_j^{lg}=\frac{1}{L_1-lg}\sum_{i=1}^{L_1-lg}(P_{i,j}-\alpha_j^0)(P_{i+lg,j}-\alpha_j^0)\\ AC2_j^{lg}=\frac{1}{L-L_1-lg}\sum_{i=L_1+1}^{L-lg}(P_{i,j}-\beta_j^0)(P_{i+lg,j}-\beta_j^0)\\ j=1,2,...,20,lg=1,2,3,4 \end{equation}\]
when n=3:
\[\begin{equation} AC1_j^{lg}=\frac{1}{L_1-lg}\sum_{i=1}^{L_1-lg}(P_{i,j}-\theta_j^0)(P_{i+lg,j}-\theta_j^0)\\ AC2_j^{lg}=\frac{1}{L_1-lg}\sum_{i=L_1+1}^{2L_1-lg}(P_{i,j}-\mu_j^0)(P_{i+lg,j}-\mu_j^0)\\ AC3_j^{lg}=\frac{1}{L-2L_1-lg}\sum_{i=2L_1+1}^{L-lg}(P_{i,j}-v_j^0)(P_{i+lg,j}-v_j^0)\\ j=1,2,...,20,lg=1,2 \end{equation}\]
If we stick all these feature vectors to each other, we will get a feature vector with a length of 700, which is reduced by PCA method and is used as input for the support vector machine classifier [9].
Usage of this feature in PSSMCOOL package:
X<-CS_PSe_PSSM(system.file("extdata", "C7GSI6.txt.pssm", package="PSSMCOOL"),"total")
head(X, n = 50)
## [1] 0.0833 0.0278 0.0694 0.0139 0.0556 0.0139 0.0000 0.0278 0.0139 0.1250
## [11] 0.0833 0.0417 0.0556 0.0556 0.1111 0.0833 0.0278 0.0139 0.0417 0.0556
## [21] 0.0350 0.0139 0.0299 0.0082 0.0401 0.0104 0.0000 0.0211 0.0012 0.0841
## [31] 0.0401 0.0335 0.0280 0.0411 0.0149 0.0327 0.0125 0.0121 0.0196 0.0358
## [41] 0.4777 0.2352 0.2671 0.1705 0.1904 0.2805 0.2288 0.2839 0.2522 0.3366
If we sum the numbers of each column in the PSSM, we get a feature vector of length 20 as follows:
\[\begin{equation}
D=(d_1,d_2,...,d_{20})
\end{equation}\]
If we remove the negative elements somehow of the PSSM and call the resulting new matrix FPSSM and then calculate this feature vector for FPSSM, the components of this vector will depend on the length of the original protein, so to eliminate this dependency, we normalize the components of this vector using the following equation:
\[\begin{equation} d_i=\frac{d_i-min}{max\times L} \end{equation}\]
Where, min and max represent the smallest and largest values of the previous vector components, respectively, and L represents the length of the original protein. The second feature vector with length 400 is obtained as follows:
\[\begin{equation}
S=(s_1^{(1)},s_2^{(1)},...,s_{20}^{(1)},s_1^{(2)},s_2^{(2)},...,s_{20}^{(2)},...,s_1^{(20)},s_2^{(20)},...,s_{20}^{(20)})
\end{equation}\]
If we name the columns of the FPSSM from \(a_1\) to \(a_{20}\) in the order from left to right, then \(S_j^{(i)}\) is equal to the sum of those members in the j-th column in the FPSSM whose corresponding row amino acid is equal to \(a_i\). Figure 9 schematically shows these steps [10]:
Usage of this feature in PSSMCOOL package:
X<-FPSSM(system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL"),20)
head(X, n = 50)
## [[1]]
## [1] 62 65 48 87 105 124 61 73 75 49 63 74 48 116 83 53 21 80 55
## [20] 54
##
## [[2]]
## [1] 14 6 1 2 5 5 0 7 7 1 2 3 1 6 2 2 2 0 0 1 3 11 2 0 2
## [26] 0 0 0 3 0 1 1 1 0 0 1 1 0 1 0 9 1 11 12 0 3 1 14 0 1
## [51] 0 1 0 9 4 5 2 0 0 3 3 4 4 41 0 7 14 1 6 0 3 5 0 0 0
## [76] 3 0 0 0 0 0 0 0 0 87 0 0 0 0 0 0 0 0 5 0 2 0 0 0 0
## [101] 3 10 6 8 0 52 12 6 15 0 1 2 10 7 21 10 1 1 0 0 0 1 0 3 0
## [126] 4 5 4 0 0 0 0 0 0 1 0 0 1 0 0 6 0 7 2 1 4 2 23 2 2
## [151] 1 4 4 5 0 0 0 6 0 1 0 2 4 1 0 16 3 3 17 0 0 8 5 1 0
## [176] 0 0 0 0 0 1 0 0 0 1 0 0 0 0 16 9 6 4 4 0 0 0 1 0 9
## [201] 2 0 0 5 2 0 5 0 0 11 23 0 4 2 0 0 0 0 1 6 2 19 4 5 6
## [226] 7 10 3 7 0 0 40 0 0 0 5 2 6 2 0 0 0 0 0 0 0 0 0 0 0
## [251] 3 0 9 0 0 0 0 0 0 0 5 2 1 0 1 8 0 1 6 4 14 0 6 46 0
## [276] 0 1 12 11 4 5 1 1 7 0 5 0 10 3 0 0 0 0 1 52 1 0 7 1 0
## [301] 5 3 4 0 0 5 4 0 3 0 1 3 1 1 1 20 1 9 1 3 4 0 0 1 0
## [326] 1 1 1 1 0 0 0 0 5 1 3 11 2 2 0 0 0 0 0 0 0 0 0 0 2
## [351] 1 0 0 7 0 0 0 16 2 0 0 0 0 0 0 2 0 0 2 2 0 0 0 9 0
## [376] 0 0 16 30 3 0 5 3 0 0 5 4 0 3 10 4 1 3 8 1 1 0 3 4 24
To generate this feature vector, the consensus sequence corresponding to the protein sequence is extracted using the PSSM. Then, by placing these two sequences next to each other, a matrix with dimensions of 2 * L will be created. In the next step, each component in the upper row of this matrix is connected to two components in the lower row of this matrix, and thus a graph similar to a bipartite graph could be created. Now in this graph, each path of length 2 specifies a 3-mer and each path of length 1 denotes a 2-mer corresponding to these two sequences. Now if we consider a table consisting of two rows and 8000 columns so that the first row contains all possible 3-mers of 20 amino acids, then for every 3-mer obtained from this graph, we put number 1 below the corresponding cell With that 3-mer in the aforementioned table and 0 in other cells. so This gives us a vector of length 8000. For the 2-mers obtained from this graph, a vector of length 400 is obtained in a similar way. figures 10 and 11 show these processes [11].
Usage of this feature in PSSMCOOL package:
X<- scsh2(system.file("extdata","C7GRQ3.txt.pssm",package="PSSMCOOL"),2)
head(X, n = 200)
## [1] 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0
## [38] 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1
## [75] 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 1
## [112] 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 1
## [149] 1 1 0 1 1 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 0
## [186] 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1
If we represent the PSSM as follows:
\[\begin{equation} D=(P_A,P_R,P_N,P_D,P_C,P_Q,P_E,P_G,P_H,P_I,P_L,P_K,P_M,P_F,P_P,P_S,P_T,P_W,P_Y,P_V) \end{equation}\]
The indices will show the standard 20 amino acids. If we assume that our primary protein has length L, each of the above columns is as follows:
\[\begin{equation} P_A=(P_{1,A},P_{2,A},...,P_{L,A})^T \end{equation}\]
Now, using the following equations, we merge the columns of the PSSM and obtain a matrix with dimensions \(L\times 10\):
\[\begin{equation} P_1=\frac{P_F+P_Y+P_W}{3},P_2=\frac{P_M+P_L}{2}P_3=\frac{P_I+P_V}{2}\\ P_4=\frac{P_A+P_T+P_S}{2},P_5=\frac{P_N+P_H}{2},P_6=\frac{P_Q+P_E+P_D}{3}\\ P_7=\frac{P_R+P_K}{2},P_8=P_C,P_9=P_G,P_{10}=P_P \end{equation}\]
\[\begin{equation} RD=\begin{pmatrix} -&1&2&3&4&5&6&7&8&9&10\\ a_1&p_{1,1}&p_{1,2}&p_{1,3}&p_{1,4}&p_{1,5}&p_{1,6}&p_{1,7}&p_{1,8}&p_{1,9}&p_{1,10}\\ a_2&p_{2,1}&p_{2,2}&p_{2,3}&p_{2,4}&p_{2,5}&p_{2,6}&p_{2,7}&p_{2,8}&p_{2,9}&p_{2,10}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ a_L&p_{L,1}&p_{L,2}&p_{L,3}&p_{L,4}&p_{L,5}&p_{L,6}&p_{L,7}&p_{L,8}&p_{L,9}&p_{L,10} \end{pmatrix} \end{equation}\]
Now using this new matrix we get a feature vector of length 10 as follows:
\[\begin{equation} D_s=\frac{1}{L}\sum_{i=1}^L (p_{i,s}-\overline p_s)^2 \end{equation}\]
where
\[\begin{equation} \overline p_s=\frac{1}{L}\sum_{i=1}^L p_{i,s} \quad ,s=1,2,...,10,\ i=1,2,...,L \ ,p_{i,s} \in RD \end{equation}\]
now using following equations; will create a feature vector of length 100 and by combining the feature vector of length 10 mentioned previously, the final feature vector of length 110 will be generated [12].
\[\begin{equation} \begin{aligned} x_{i,i+1}&=(p_{i,s}-\frac{p_{i,s}+p_{i+1,t}}{2})^2+(p_{i+1,t}-\frac{p_{i,s}+p_{i+1,t}}{2})^2\\ &=\frac{(p_{i,s}-p_{i+1,t})^2}{2}\quad i=1,2,...,L-1 \quad ,s,t=1,2,...,10 \end{aligned} \end{equation}\]
\[\begin{equation} \begin{aligned} D_{s,t}&=\frac{1}{L-1}\sum_{i=1}^{L-1}x_{i,i+1} \\ &=\frac{1}{L-1}\sum_{i=1}^{L-1}[(p_{i,s}-\frac{p_{i,s}+p_{i+1,t}}{2})^2+(p_{i+1,t}-\frac{p_{i,s}+p_{i+1,t}}{2})^2] \\ &=\frac{1}{L-1}\sum_{i=1}^{L-1}\frac{(p_{i,s}-p_{i+1,t})^2}{2} \quad ,s,t=1,2,...,10 \end{aligned} \end{equation}\]
Usage of this feature in PSSMCOOL package:
X<-rpssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] 5.0936 3.8723 4.5861 4.7846 5.5044 6.0192 6.2483 11.3585 7.3509
## [10] 8.1547 4.6368 2.9221 3.8923 3.1175 4.9356 6.0795 5.1827 9.2192
## [19] 6.4115 7.3077 5.6502 3.5894 4.2192 3.5220 5.5144 6.4173 5.3500
## [28] 8.8788 7.2519 7.5250 3.9765 3.0890 3.8028 1.6051 2.8826 3.6774
## [37] 3.3970 8.5483 4.7816 6.2393 4.9053 4.8663 6.0731 3.4271 2.8433
## [46] 3.5276 4.6500 11.0904 4.7596 6.7250
This feature, which is similar to the PSSM-AC feature, stands for cross-covariance transformation. For column \(j_1\), it calculates the average of this column as shown in figure 12, and then subtract the result from the number on the i-th row in this column. Similarly, the feature calculates the average for the column \(j_2\) and then subtracts the resulting number from the value on row i + g of this column and finally multiplies them. By changing the variable i from 1 to L-g, it calculates the sum of these, because the variable \(j_1\) changes between 1 and 20 and the variable \(j_2\) changes in the same interval (1,20) except for the number selected for the variable \(j_1\). Eventually feature vector of length 20.19.LG will be obtained [13].
\[\begin{equation} PSSM-CC_{j1,j2,g}={\frac {\sum_{i=1}^{L-g}(P_{i,j1}-\overline {P_{j1}})(P_{i+g,j2}-\overline {P_{j2}})}{L-g}}\\ 1\leq j1,j2\leq 20 \end{equation}\]
where \(j_1\), \(j_2\) are two different amino acids and \(\overline {P_{j1}} (\overline {P_{j2}})\) is the average score for amino acid \(j_1\) (\(j_2\)) along the sequence.Since the PSSM-CC variables are not symmetric, the total number of PSSM-CC variables is 380∗LG .The maximum value of LG is the length of the shortest sequence minus one in the Database studied.by default LG is 10 here, hence the feature vector would be of length 3800 here
Usage of this feature in PSSMCOOL package:
X<-pssm_cc(system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL"))
head(X, n = 50)
## [1] 0.2548 -0.5649 0.4391 0.4263 -0.2663 -0.3238 0.9284 0.6214 0.6801
## [10] 0.1856 0.2232 0.8791 -0.5687 0.6519 0.4574 1.3451 0.9119 0.7022
## [19] 0.4430 0.7217 0.6205 -0.5507 0.1707 -0.0249 0.3202 0.3633 -0.4055
## [28] -0.8444 0.7007 -0.1290 -1.0169 -0.0933 0.3972 0.5116 -0.1171 0.2789
## [37] -0.0206 -0.0038 0.5909 1.1034 -1.2932 1.8297 0.5340 1.3181 1.4671
## [46] -1.3588 -1.4603 0.5015 -0.1133 -0.1206
Discrete cosine transforms can be described as follows:
\[\begin{equation} DCT(u,v)=\rho(u)\rho(v)\sum_{x=0}^{M-1}\sum_{y=0}^{N-1}f(x,y)\cos\frac{(2x+1)u\pi}{2M}\cos\frac{(2y+1)v\pi}{2N}\\ 0\leq{u}\leq{M-1} \quad ,0\leq{v}\leq{N-1} \end{equation}\]
where:
\[\begin{equation} \rho(u)=\left\{\begin{array}{ll}\sqrt{\frac{1}{M}} & u=0\\ \sqrt{\frac{2}{M}} & 1\leq{u}\leq{M-1}\end{array}\right.\\ \rho(v)=\left\{\begin{array}{ll}\sqrt{\frac{1}{N}} & v=0\\ \sqrt{\frac{2}{N}} & 1\leq{v}\leq{N-1}\end{array}\right. \end{equation}\]
In above equation, the matrix \(f(x,y)\in P^{N\times M}\) is the input signal and here represents the PSSM with dimensions \(N\times 20\). According to the aforementioned equation, it is clear that the length of the resulting feature vector depends on the length of the original protein, so in most articles that have used this feature vector, the final feature vector DCT, which encodes a protein sequence by choosing the first 400 coefficients is obtained [14].
Usage of this feature in PSSMCOOL package:
X<-Discrete_Cosine_Transform(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] 3.7586 -0.8904 -1.8115 0.9043 1.3727 -1.3320 -0.1494 0.4772 0.3272
## [10] 0.2038 -0.3245 -0.9033 1.0473 0.2255 -1.2477 0.3651 1.1561 -0.8085
## [19] -0.7821 1.1143 9.4695 -1.3575 -0.4747 0.3296 1.2757 0.6152 -0.9732
## [28] 1.0494 -0.5279 0.2443 -0.7019 -0.8720 -1.4273 0.1932 -0.0210 0.5049
## [37] -0.8403 -1.4087 0.6963 -0.1914 6.2212 -0.6490 -0.3823 0.3074 0.6949
## [46] -0.7493 -0.3240 1.9688 -0.0276 0.2205
Wavelet transform (WT) is defined as the signal image \(f(t)\) on the wavelet function according to the following equation:
\[\begin{equation} T(a,b)=\sqrt{\frac{1}{a}}\int_0^t f(t)\psi{(\frac{t-b}{a})}dt \end{equation}\]
Where a is a scale variable and b is a transition variable. \(\psi{(\frac{t-b}{a})}\) is the analyze wavelet function. \(T(a,b)\) is the conversion factor found for both specific locations on the signal as well as specific wavelet periods. Discrete wavelet transform can decompose amino acid sequences into coefficients in different states and then remove the noise component from the profile.Assuming that the discrete signal \(f(t)\) is equal to \(x[n]\), where the length of the discrete signal is equal to N, we have the following equations:
\[\begin{equation} y_{j,low}[n]=\sum_{k=1}^N x[k]g[2n-k]\\ y_{j,high}[n]=\sum_{k=1}^N x[k]h[2n-k] \end{equation}\]
In these equations, g is the low-pass filter and h is the high-pass filter. \(y_{low}[n]\) is the approximate coefficient (low-frequency components) of the signal and \(y_{high}[n]\) is the exact coefficient (high-frequency components) of the signal.This decomposition is repeated to further increase the frequency resolution, and the approximate coefficients are decomposed with high and low pass filters and then are sampled lower. By increasing the level of j decomposition, we can see more accurate characteristics of the signal. We use level 4 DWT and calculate the maximum, minimum, mean, and standard deviation of different scales (4 levels of both coefficients High and low frequency). Because high-frequency components have high noise, only low-frequency components are more important.A schematic diagram of a level 4 DWT is shown in figure 13:
The PSSM has 20 columns. Therefore, the PSSM consists of 20 types of discrete signals (L-length). Therefore, we used the level 4 DWT as mentioned above to analyze these discrete signals from PSSM (each column) and also to extract the PSSM-DWT feature vector from the PSSM, which is a feature vector of length 80 [15].
Usage of this feature in PSSMCOOL package:
X<-dwt_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] -1.2032 2.1352 0.0547 0.4630 -0.9770 2.0423 0.0309 0.4129 -0.7134
## [10] 2.3055 0.0884 0.4222 -0.6967 1.7063 0.0818 0.4345 -0.7764 1.0817
## [19] 0.0411 0.3327 -1.2524 3.9385 0.0943 0.5317 -0.8283 2.3753 0.0723
## [28] 0.4349 -0.8160 2.0562 0.0561 0.4058 -0.7764 2.6698 0.0822 0.4252
## [37] -0.6910 2.2393 0.0964 0.3850 -0.7962 2.7484 0.0878 0.4019 -0.6288
## [46] 1.9567 0.0677 0.4157 -0.7807 1.9298
For the purpose of predicting disulfide bond in protein at first, the total number of cysteine amino acids in the protein sequence is counted and their position in the protein sequence is identified. Then, using a sliding window with a length of 13, moved on the PSSM from top to bottom so that the middle of the window is on the amino acid cysteine, then the rows below the matrix obtained from the PSSM with the dimension of 13 x 20 are placed next to each other to get a feature vector with a length of 260 = 20 * 13 per cysteine. If the position of the first and last cysteine in the protein sequence is such that the middle of sliding window is not on cysteine residue while moving on PSSM, then the required number of zero rows from top and bottom is added to the PSSM to achieve this goal.Thus, for every cysteine amino acid presented in protein sequence, a feature vector with a length of 260 is formed.Then all the pairwise combinations of these cysteines is wrote in the first column of a table. In front of each of these pairwise combinations, the corresponding feature vectors are stuck together to get a feature vector of length 520 for each of these compounds.Finally, the table obtained in this way will have the number of rows equal to the number of all pairwise combinations of these cysteines and the number of columns will be equal to 521 (the first column includes the name of these pair combinations). It is easy to divide this table into training and testing data and predict the desired disulfide bonds between cysteines.Figure 14 shows a schematic of this process [16]:
Usage of this feature in PSSMCOOL package:
X<-disulfid(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X[,1:50])
## 1 2 3 4 5 6 7 8 9 10 11
## 1 c1c2 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 2 c1c3 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 3 c1c4 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 4 c1c5 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 5 c1c6 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 6 c1c7 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 12 13 14 15 16 17 18 19 20 21 22 23
## 1 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 2 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 3 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 4 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 5 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 6 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 24 25 26 27 28 29 30 31 32 33 34 35
## 1 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 2 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 3 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 4 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 5 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 6 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 36 37 38 39 40 41 42 43 44 45 46 47
## 1 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 2 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 3 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 4 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 5 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 6 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 48 49 50
## 1 0.8808 0.0474 0.2689
## 2 0.8808 0.0474 0.2689
## 3 0.8808 0.0474 0.2689
## 4 0.8808 0.0474 0.2689
## 5 0.8808 0.0474 0.2689
## 6 0.8808 0.0474 0.2689
The extraction of this feature is carried out by using the following equations from the PSSM:
\[\begin{equation} \begin{aligned} P_{DP-PSSM}^{\alpha}&=[T',G']=[p_1,p_2,...,p_{40+40\times{\alpha}}]\\ T'&=[\bar{T}_1^P,\bar{T}_1^N,\bar{T}_2^P,\bar{T}_2^N,...,\bar{T}_{20}^P,\bar{T}_{20}^N] \end{aligned} \end{equation}\]
\[\begin{equation} \left\{\begin{array}{ll}\bar{T}_j^P=\frac{1}{NP_j}\sum T_{i,j} & ,if \ T_{i,j}\geq 0\\ \bar{T}_j^N=\frac{1}{NN_j}\sum T_{i,j} & ,if \ T_{i,j}< 0 \end{array}\right. \end{equation}\]
\[\begin{equation} \begin{aligned} G'&=[G_1,G_2,...,G_{20}]\\ G_j&=[\bar{\Delta}_{1,j}^P,\bar{\Delta}_{1,j}^N,\bar{\Delta}_{2,j}^P,\bar{\Delta}_{2,j}^N,...,\bar{\Delta}_{\alpha,j}^P,\bar{\Delta}_{\alpha,j}^N] \end{aligned} \end{equation}\]
\[\begin{equation} \left\{\begin{array}{ll}\bar{\Delta}_{k,j}^P=\frac{1}{NDP_j}\sum [T_{i,j}-T_{i+k,j}]^2 & ,if \ T_{i,j}-T_{i+k,j}\geq 0\\ \bar{\Delta}_{k,j}^N=\frac{-1}{NDN_j}\sum [T_{i,j}-T_{i+k,j}]^2 & ,if \ T_{i,j}-T_{i+k,j}< 0 \end{array}\right.\\ 0<k\leq{\alpha} \end{equation}\]
In the above equations, \(T_{i,j}\) represents the value on the i-th row and j-th column of the normalized PSSM, which is denoted by \(M_T\). This matrix is constructed from the PSSM using the following equations:
\[\begin{equation} mean_i=\frac{1}{20}\sum_{i=1}^{20}E_{i,k}\\ STD_i=\sqrt{\frac{\sum_{u=1}^{20}[E_{i,u}-mean_i]^2}{20}}\\ T_{i,j}=\frac{E_{i,j}-mean_i}{STD_i} \end{equation}\]
In the above equations, \(\overline{T}_j^P\) represents the mean of the positive values of \(\{T_{i,j}|i=1,2,...,L\}\) and \(\overline T_j^N\) represents the mean of the negative values of the above set. In fact the above set represents the j-th column of the matrix \(M_T\) and the expression \(NP_j\) indicates the number of positive values of set \(\{T_{i,j}|i=1,2,...,L\}\) and \(NN_j\) is related to the number of negative values of the mentioned set. It is clear that this feature vector arises from the connection of two vectors \(G'\),\(T'\). According to the equations, it is clear that the length of the first feature vector is 40 and the length of the second feature vector is \(\alpha\times 40\), which by selecting 2 in the used article, a feature vector of length 120 is created from the PSSM [17].
Usage of this feature in PSSMCOOL package:
X<-DP_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] 0.8343 -0.3033 0.8988 -0.5809 0.7000 -0.5259 1.1522 -0.6925 1.8422
## [10] -0.9234 0.9965 -0.4440 0.7749 -0.6224 1.1720 -0.5579 0.9059 -0.5047
## [19] 0.9162 -0.7204 1.1064 -0.7280 0.9960 -0.4775 0.7133 -0.4963 1.2897
## [28] -0.8638 1.4462 -0.7080 0.6736 -0.2366 0.5071 -0.3118 1.5519 -0.9470
## [37] 0.9599 -0.6075 0.9613 -0.6132 1.6840 -1.4978 1.5630 -1.4420 1.5522
## [46] -1.3858 1.7386 -1.7475 1.8204 -1.6987
In this feature, each of the columns of the PSSM is considered as a non-static time series. Assuming that \(\{x_i\}\),\(\{y_i\}\) for i=1,2,…,L represent two different columns of the PSSM, then two cumulative time series X,Y of these two columns are obtained according to the following equations:
\[\begin{equation} \left\{\begin{array}{ll}X_k=\sum_{i=1}^K x_i & k=1,2,...,L \\ Y_k=\sum_{i=1}^k y_i & k=1,2,...,L \end{array}\right. \end{equation}\]
Now, using these two series, two backward moving average are obtained according to the following equations:
\[\begin{equation} \left\{\begin{array}{ll}\tilde{X}_{k,s}=\frac{1}{s}\sum_{i=-(s-1)}^0 X_{(k-i)} \\ \tilde{Y}_{k,s}=\frac{1}{s}\sum_{i=-(s-1)}^0 Y_{(k-i)} \end{array}\right.\\ 1<s\leq L \end{equation}\]
Finally, each element of the DFMCA feature vector is obtained using the above equations and the following formula:
\[\begin{equation} f_{DFMCA}^2(s)=\frac{1}{L-s+1}\sum_{k=1}^{L-s+1}(X_k-\tilde{X}_{k,s})(Y_k-\tilde{Y}_{k,s}) \end{equation}\]
According to the above equation, it is clear that each element of this feature vector is obtained by using two different columns of the PSSM, and since we have 20 different columns and the order of the columns does not matter, the length of the obtained feature vector will be equal to \(\binom{20}{2}=\frac{20\times 19}{2}=190\) [18]
Usage of this feature in PSSMCOOL package:
X<-DFMCA_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),7)
head(X, n = 50)
## [1] 1.5671 1.0027 1.3320 1.1677 0.5471 2.3738 1.3395 1.1099 1.2298 1.0010
## [11] 1.1077 1.0803 0.9034 1.5549 0.8586 1.5949 0.8651 0.4748 0.5141 1.0397
## [21] 1.0206 1.1935 1.0423 0.7185 1.5056 1.1396 1.0414 1.0555 0.9881 0.9916
## [31] 1.0804 0.9890 1.1909 0.9159 1.4816 1.0687 0.6386 0.7624 1.0960 1.0048
## [41] 0.8881 0.4023 1.2723 1.0321 0.7996 0.8973 0.5799 0.5536 0.9635 0.6525
This function produces a feature vector of length 100. The first 20 components of this vector are the same as the normalized frequency of 20 standard amino acids in the protein. The second 20 components of this vector are the average of the 20 columns of the PSSM corresponding to the protein, and the grey system model method is used to define the next 60 components [19]. If we show this feature vector with length 100 as follows:
\[\begin{equation} V=(\psi_1,\psi_2,...,\psi_{100}) \end{equation}\]
then the first 20 components are as follows:
\[\begin{equation} \psi_i=f_i \quad (i=1,2,...,20) \end{equation}\]
Where \(f_i\) is the normalized frequency of type i amino acids of the 20 standard amino acids in the protein chain. If we denote the entries of the PSSM by \(p_{i,j}\), then the next 20 components of this feature vector are obtained according to the following equation:
\[\begin{equation} \psi_{j+20}=\alpha_j \quad (j=1,2,...,20),\\ \alpha_j=\frac{1}{L}\sum_{i=1}^L p_{i,j} \end{equation}\]
the next 60 components are obtained by following equations:
\[\begin{equation} \psi_{j+40}=\delta_j \quad (j=1,2,...,60) \end{equation}\]
in this equation \(\delta_j\)’s are obtained as follows:
\[\begin{equation} \left\{\begin{array}{ll} \delta_{3j-2}=f_ja_1^j\\ \delta_{3j-1}=f_ja_2^j \quad (j=1,2,...,20) \\ \delta_{3j}=f_jb^j \end{array}\right. \end{equation}\]
where \(f_j\)’s are as above and \(a_1^j\), \(a_2^j\), \(b^j\) are obtained as follows:
\[\begin{equation} \begin{bmatrix}a_1^j\\a_2^j\\b^j\end{bmatrix}=(B_j^TB_j)^{-1}B_j^TU_j \quad (j=1,2,...,20) \end{equation}\]
where:
\[\begin{equation} B_j=\begin{bmatrix} -p_{2,j}&-(p_{1,j}+0.5p_{2,j})&1\\ -p_{3,j}&-(\sum_{i=1}^2 p_{i,j}+0.5p_{3,j})&1\\ \vdots&\vdots&\vdots\\ -p_{L,j}&-(\sum_{i=1}^{L-1} p_{i,j}+0.5p_{L,j})&1 \end{bmatrix} \end{equation}\]
and:
\[\begin{equation} U_j=\begin{bmatrix} p_{2,j}-p_{1,j}\\ p_{3,j}-p_{2,j}\\ \vdots\\ p_{L,j}-p_{L-1,j} \end{bmatrix} \end{equation}\]
Usage of this feature in PSSMCOOL package:
X<-grey_pssm_pseAAC(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] 0.0534 0.0229 0.0611 0.0687 0.0611 0.1069 0.0153 0.0534 0.0305
## [10] 0.0382 0.0534 0.0840 0.0076 0.0763 0.0687 0.0611 0.0305 0.0153
## [19] 0.0305 0.0611 0.3773 0.2965 0.3371 0.3049 0.1816 0.4356 0.3395
## [28] 0.2866 0.3109 0.2576 0.2622 0.3150 0.2776 0.3187 0.2383 0.3846
## [37] 0.2755 0.1791 0.2180 0.2808 -0.0485 -0.0001 -0.0207 -0.0220 0.0000
## [46] -0.0051 -0.0432 0.0000 -0.0131 -0.0618
This feature has been used to predict RNA binding sites in proteins, and therefore a specific feature vector is generated for each residue. To generate this feature vector, a matrix called smoothed_pssm is first created from the PSSM using a parameter ws called the smooth window size, which usually has a default value of 7 and can be Changed between 3, 5, 7, 9 and 11. The i-th row in the smoothed_PSSM is obtained by summing ws row vectors around the i-th row, which of course corresponds to the i-th row in the protein. If we show this problem with a mathematical equation, then we will have:
\[\begin{equation} V_{smoothed_i}=V_{i-\frac{(ws-1)}{2}}+...+V_i+...+V_{i+\frac{(ws-1)}{2}} \end{equation}\]
In this regard, \(V_i\)’s represent the row vectors of the PSSM. To obtain the first and last rows of the smoothed_PSSM corresponding to the N-terminal and C-terminal of the protein, zero vectors are added to the beginning and end of the PSSM. The figure 15 represents these processes schematically.
Now, using another parameter w called the slider window size, which its default value is 11 and can be changed in the interval (3,41)(ste=2), for the residue \(\alpha_i\), the feature vector obtained from the smoothed-PSSM will be in the form as follows:
\[\begin{equation} (V_{smoothed_{i-\frac{(w-1)}{2}}},...,V_{smoothed_i},...,V_{smoothed_{i+\frac{(w-1)}{2}}}) \end{equation}\]
Here, as in the previous case, if the residue in question is the first or last residue of the protein, number of \((\frac{w-1}{2})\) zero vectors are added to the beginning or end of the smoothed matrix to obtain the feature vector. Therefore, the parameter w will determine the length of the feature vector per residue. If its value is 11, the length of the obtained feature vector will be equal to 220 = 11 * 20. Eventually, the feature vector values are normalized between -1 and 1 [20].
Usage of this feature in PSSMCOOL package:
X<-smoothed_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),7,11,c(2,3,8,9))
head(X[,1:50], n = 50)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636
## [2,] 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636
## [3,] -0.0909 0.1273 -0.3455 -0.5636 -0.3091 0.0909 -0.3455 -0.4182 -0.0182
## [4,] -0.0545 0.0909 -0.4182 -0.6727 -0.3455 0.0545 -0.4182 -0.4545 -0.0545
## [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]
## [1,] 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636
## [2,] 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636
## [3,] 0.2000 0.4182 -0.0545 0.7091 0.1273 -0.4545 0.1273 -0.0182 0.1273 0.1636
## [4,] 0.2000 0.4182 -0.1273 0.6727 0.1636 -0.4545 0.2364 0.0182 0.1273 0.0909
## [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28]
## [1,] 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636
## [2,] 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636
## [3,] -0.0182 -0.0545 0.0909 -0.4182 -0.6727 -0.3455 0.0545 -0.4182 -0.4545
## [4,] -0.0545 0.0182 0.0909 -0.4182 -0.6727 -0.3455 0.0182 -0.4182 -0.4909
## [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
## [1,] 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636
## [2,] 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636
## [3,] -0.0545 0.2000 0.4182 -0.1273 0.6727 0.1636 -0.4545 0.2364 0.0182 0.1273
## [4,] -0.0909 0.3091 0.4182 -0.1636 0.4545 0.1636 -0.4182 0.2727 0.0545 0.2727
## [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47]
## [1,] 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636
## [2,] 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636 0.1636
## [3,] 0.0909 -0.0545 0.0182 0.0909 -0.4182 -0.6727 -0.3455 0.0182 -0.4182
## [4,] 0.1273 0.0545 0.0182 0.0545 -0.3455 -0.7091 -0.3818 -0.0909 -0.4909
## [,48] [,49] [,50]
## [1,] 0.1636 0.1636 0.1636
## [2,] 0.1636 0.1636 0.1636
## [3,] -0.4909 -0.0909 0.3091
## [4,] -0.5273 -0.2727 0.3818
For producing this feature vector similar to the smoothed-PSSM feature, firstly PSSM is smoothed by appending zero vectors to its head and tail and a sliding window with odd size is utilized. Then this smoothed PSSM is condensed by the Kidera factors to produce feature vector for each residue [21].
Usage of this feature in PSSMCOOL package:
X<-kiderafactor(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),c(2,3,8,9))
head(X[,1:50], n = 50)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 2 0.000 0.00 0.000 0.000 0.00 0.000 0.00 0.000 0.000 0.000 0.000 0.000
## 3 0.000 0.00 0.000 0.000 0.00 0.000 0.00 0.000 0.000 0.000 0.000 0.000
## 8 0.000 0.00 0.000 0.000 0.00 0.000 0.00 0.000 0.000 0.000 -1.072 0.650
## 9 -1.072 0.65 0.667 -0.199 0.44 -0.484 -0.03 -0.575 -0.617 -0.217 -1.120 0.953
## 13 14 15 16 17 18 19 20 21 22 23
## 2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 8 0.667 -0.199 0.440 -0.484 -0.030 -0.575 -0.617 -0.217 -1.120 0.953 0.924
## 9 0.924 -0.099 0.508 -0.433 0.103 -0.637 -0.305 0.152 -1.298 1.250 1.203
## 24 25 26 27 28 29 30 31 32 33 34
## 2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.00 0.000 0.000
## 3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.00 0.000 0.000
## 8 -0.099 0.508 -0.433 0.103 -0.637 -0.305 0.152 -1.298 1.25 1.203 -0.792
## 9 -0.792 0.346 -0.760 -0.068 -0.546 -0.492 -0.017 -1.334 1.06 1.274 -1.052
## 35 36 37 38 39 40 41 42 43 44 45
## 2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 8 0.346 -0.760 -0.068 -0.546 -0.492 -0.017 -1.334 1.060 1.274 -1.052 0.247
## 9 0.247 -0.929 -0.177 -0.833 -0.541 -0.152 -1.399 1.138 1.512 -1.311 0.074
## 46 47 48 49 50
## 2 0.000 0.000 0.000 0.000 0.000
## 3 0.000 0.000 0.000 0.000 0.000
## 8 -0.929 -0.177 -0.833 -0.541 -0.152
## 9 -0.997 -0.406 -0.859 -0.767 -0.260
In this feature three different autocorrelation descriptors based on PSSM are adopted, which include: normalized Moreau-Broto autocorrelation, Moran autocorrelation and Geary autocorrelation descriptors.Autocorrelation descriptor is a powerful statistical tool and defined based on the distribution of amino acid properties along the sequence, which measures the correlation between two residues separated by a distance of d in terms of their evolution scores [22].
Usage of this feature in PSSMCOOL package:
X<-MBMGACPSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] 0.377 0.296 0.337 0.305 0.182 0.436 0.340 0.287 0.311 0.258 0.262 0.315
## [13] 0.278 0.319 0.238 0.385 0.275 0.179 0.218 0.281 0.153 0.093 0.142 0.110
## [25] 0.043 0.253 0.126 0.110 0.120 0.094 0.118 0.099 0.091 0.152 0.066 0.168
## [37] 0.084 0.039 0.050 0.093 0.151 0.097 0.121 0.091 0.047 0.243 0.124 0.102
## [49] 0.118 0.101
This feature uses linear predictive coding algorithm for each column of PSSM. So for producing this feature vector, the “lpc” function from “phontools” R-package is used which produces a 14-dimensional vector for each column. Since PSSM has 20 column eventually it will obtain a 20*14=280 dimensional feature vector for each PSSM [23].
Usage of this feature in PSSMCOOL package:
X<-LPC_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] 1.0000 0.9084 0.8390 0.5575 0.3737 0.4299 0.4070 0.4765 0.4369
## [10] 0.2830 0.3552 0.2212 0.1509 0.0496 1.0000 0.8380 0.7532 0.7463
## [19] 0.6135 0.3585 0.3674 0.2649 0.1610 0.2800 0.2347 0.2110 0.0763
## [28] 0.1057 1.0000 0.6738 0.6541 0.5619 0.5620 0.4152 0.4450 0.4506
## [37] 0.1943 0.1519 0.1683 0.0216 -0.1052 0.0215 1.0000 0.8553 0.8707
## [46] 0.6323 0.6506 0.7144 0.5384 0.5318
To generate this feature vector, for each of the standard amino acids, we find the positions containing that amino acid in the protein and separate the corresponding rows in the PSSM, to get a sub matrix. Now, for the generated matrix, we calculate the average of its columns, and therefore, for each amino acid, a vector of length 20 is obtained. Finally, by putting these 20 vectors together, a feature vector of length 400 for each protein can be obtained. For example figure 16 shows the PSSM rows corresponding to amino acid S [24].
Usage of this feature in PSSMCOOL package:
X<-pssm400(system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL"))
head(X, n = 50)
## [1] 3.5000 2.2778 2.1667 2.1667 2.1667 2.6111 2.2778 2.8333 2.5556 2.0556
## [11] 2.1111 2.3889 2.1111 2.2778 1.9444 2.7222 2.3333 1.5000 2.1111 2.1667
## [21] 1.0556 1.7778 1.1111 0.8333 1.0000 1.1111 0.8889 0.8333 1.2222 0.7778
## [31] 0.8889 1.1667 1.0000 0.7778 0.7222 1.1667 1.0556 0.5556 0.8889 0.7778
## [41] 3.3889 2.2222 3.6667 3.6667 1.7778 2.8889 2.8333 3.7778 2.4444 2.0556
In this feature The idea is similar to the probe concept used in microarray technologies, where probes are used to identify genes. For the convenience, we call it residue probing method. In our application, each probe is an amino acid, which corresponds to a particular column in the PSSM profiles. For each probe, we average the PSSM scores of all the amino acids in the associated column with a PSSM value greater than zero in the sequence, which leads to a 1 20 feature vector. Once again, for the 20 probes, the final feature for each protein sequence is a 1 400 vector. [7].
Usage of this feature in PSSMCOOL package:
X<- RPM_PSSM(system.file("extdata","C7GRQ3.txt.pssm",package="PSSMCOOL"))
X
## [1] 4.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [11] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [21] 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [31] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [41] 0.000 0.500 2.077 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [51] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [61] 0.000 3.000 1.300 3.000 0.000 0.000 0.000 0.000 0.000 0.000
## [71] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [81] 2.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [91] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [101] 0.000 0.500 0.778 0.750 0.000 1.000 0.000 0.000 0.000 0.000
## [111] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [121] 0.000 2.000 1.200 2.167 0.000 2.000 2.000 0.000 0.000 0.000
## [131] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [141] 1.500 0.000 2.000 4.000 0.000 0.000 1.200 4.750 0.000 0.000
## [151] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [161] 0.000 0.500 1.500 0.500 0.000 0.000 0.333 0.000 3.000 0.000
## [171] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [181] 2.000 0.000 0.000 0.000 1.000 0.000 1.250 0.000 0.000 3.000
## [191] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [201] 0.000 0.000 0.000 0.000 0.000 0.000 2.000 0.000 0.000 0.500
## [211] 2.429 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [221] 0.000 1.000 1.143 0.200 0.000 1.500 0.800 0.000 2.000 0.000
## [231] 1.000 3.231 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [241] 0.000 0.000 0.000 0.000 0.000 0.000 2.500 0.000 1.000 2.500
## [251] 2.500 0.000 8.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [261] 0.000 0.000 1.000 0.000 0.000 0.000 0.500 0.000 0.000 1.667
## [271] 1.333 0.000 0.000 2.000 0.000 0.000 0.000 0.000 0.000 0.000
## [281] 1.000 1.000 1.500 0.000 0.000 0.000 1.000 0.000 0.000 0.000
## [291] 2.000 0.500 0.000 1.000 6.500 0.000 0.000 0.000 0.000 0.000
## [301] 1.000 1.000 1.636 1.429 0.000 1.500 1.000 1.500 3.000 1.000
## [311] 1.167 1.222 1.000 0.000 3.000 2.357 0.000 0.000 0.000 0.000
## [321] 0.000 1.000 1.300 0.333 0.000 1.000 0.857 0.000 0.000 1.000
## [331] 1.286 0.556 1.000 0.000 2.000 1.545 3.600 0.000 0.000 0.000
## [341] 0.000 0.000 0.000 0.000 1.000 0.000 3.000 0.000 0.000 0.000
## [351] 7.000 1.000 0.000 10.000 0.000 0.000 11.000 13.000 0.000 0.000
## [361] 0.000 0.000 0.500 0.000 0.000 0.000 4.000 0.000 3.000 1.667
## [371] 3.000 0.000 0.000 4.000 0.000 3.000 0.000 1.000 5.400 0.000
## [381] 6.000 0.000 0.000 0.000 0.000 0.000 0.667 0.000 0.000 2.889
## [391] 1.933 0.000 0.000 0.000 0.000 0.500 2.000 0.000 1.667 2.000
In this feature at first PSSM is divided to Blocks based on Number N which is entered by user. Then for each Block the mean of columns is computed to get 20-dimensional vector, and eventually by appending these vectors to each other final feature vector is obtained [25].
Usage of this feature in PSSMCOOL package:
X<-PSSMBLOCK(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),5)
head(X, n = 50)
## [1] 0.3773 0.2965 0.3371 0.3049 0.1816 0.4356 0.3395 0.2866 0.3109 0.2576
## [11] 0.2622 0.3150 0.2776 0.3187 0.2383 0.3846 0.2755 0.1791 0.2180 0.2808
## [21] 0.4032 0.2792 0.3606 0.2637 0.1736 0.5488 0.2988 0.3013 0.3996 0.2927
## [31] 0.3335 0.2389 0.3550 0.4466 0.2563 0.4349 0.2933 0.2153 0.2450 0.2929
## [41] 0.3530 0.3048 0.3146 0.3467 0.1921 0.3139 0.3719 0.2744 0.2156 0.2261
To generate this feature vector, at first for the column j, the sum of the total numbers in this column is calculated and denoted by \(T_j\). Then, starting from the first row of this column, the numbers are added one by one together to reach a number less than or equal to 25 percent of \(T_j\). Now the number of components used to calculate this sum is denoted by \(I_j^1\) and stored. Now, starting from the first row of this column again, the numbers are added one by one together to reach a number less than or equal to half of \(T_j\) (50%), then we show the number of components to calculate this sum with \(I_j^2\) and store it. In the same way for column j, starting from the last row of this column, we start adding each elements together to reach a number less than or equal to 25% of the number \(T_j\), and denote the number of these components by \(I_j^3\). In the next step, starting from the last row with summing of each element in this column to reach a number less than or equal to 50% of \(T_j\), the number \(I_j^4\) is also obtained. Therefore, 4 numbers are obtained for each column, and since the PSSM has 20 columns, for each protein a feature vector of length 80 is obtained [26]. Figure 17 shows these steps schematically.
Usage of this feature in PSSMCOOL package:
X<-PSSM_SD(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [[1]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 54 23 15 13 31 10 14 22 18 44 52 16 62 77
## [2,] 70 57 70 32 46 13 54 66 89 71 79 32 84 94
## [3,] 88 83 90 88 104 121 87 86 110 107 108 63 109 116
## [4,] 71 57 70 32 46 110 54 66 89 72 79 32 84 97
## [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 24 32 32 36 38 37
## [2,] 62 80 68 72 77 62
## [3,] 112 108 94 98 105 99
## [4,] 77 80 68 75 77 68
##
## [[2]]
## [1] 54 23 15 13 31 10 14 22 18 44 52 16 62 77 24 32 32 36 38
## [20] 37 70 57 70 32 46 13 54 66 89 71 79 32 84 94 62 80 68 72
## [39] 77 62 88 83 90 88 104 121 87 86 110 107 108 63 109 116 112 108 94
## [58] 98 105 99 71 57 70 32 46 110 54 66 89 72 79 32 84 97 77 80
## [77] 68 75 77 68
This feature, similar to the previous feature, divides each column into four parts and calculates the values for each column. Then, using the following equations, it calculates the values of Segmented Auto Covariance Features. The final feature vector length will be 100.
\[\begin{equation} PSSM-Seg_{n,j}=\frac{1}{(I_j^n-m)}\sum_{i=1}^{I_j^n-m}(P_{i,j}-P_{ave,j})(P_{(i+m),j}-P_{ave,j})\\ (n=1,2,3,4 \ and \ j=1,...,20 \ and \ m\in \{1,2,...,11\}) \end{equation}\]
In the above equation, \(P_{ave,j}\) represents the mean of column j in the PSSM and the number m is somehow a distance factor for each segment. Using the above equation, the feature of length 80 is obtained. Now the feature vector PSSM_AC is calculated using the previous factor m with length 20 and is added to the previous vector to get the final feature vector of length 100 [26].
\[\begin{equation} PSSM-AC_{m,j}=\frac{1}{(L-m)}\sum_{i=1}^{L-m}(P_{i,j}-P_{ave,j})(P_{(i+m),j}-P_{ave,j}) \end{equation}\]
L Represents the total length of the protein.
Usage of this feature in PSSMCOOL package:
X<-pssm_seg(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),3)
head(X, n = 50)
## [1] 0.0342 0.0306 0.0257 0.0307 0.0316 0.0238 0.0170 0.0238 0.0348
## [10] 0.0308 0.0219 0.0308 0.0706 0.0358 0.0275 0.0358 0.0343 0.0322
## [19] -0.0013 0.0322 -0.0070 0.0398 0.0602 0.0649 0.0691 0.0482 0.0426
## [28] 0.0482 0.0091 0.0332 0.0243 0.0332 0.0526 0.0400 0.0364 0.0400
## [37] 0.0931 0.0559 0.0311 0.0555 0.1182 0.0769 0.0584 0.0769 0.0283
## [46] 0.0371 0.0175 0.0371 0.0287 0.0242
This feature also considers each of the columns of the PSSM as a time series. If L represents the length of the protein, then the column j of the matrix can be thought of \(y(i), \ i=1,2,...,L\) as a time series. The SOMA algorithm is implemented in two steps using the following equations on the PSSM: - First, the moving average \(\overline{y_n}(i)\) for the time series \(y(i)\) is calculated according to the following equation:
\[\begin{equation} \overline{y_n}(i)=\frac{1}{n}\sum_{k=0}^{n-1}y(i-k) \end{equation}\]
Where n is the size of the moving average window, and if it tends to zero, the moving average will tend to original series, in other words: if \({n \to 0}\) then \({\overline{y_n}(i)\to y(i)}\). Next, for a moving average window size n which \(2\leq n<L\), the second-order difference of the time series \(y(i)\) with respect to the moving average \(\overline{y_n}(i)\) is defined according to the following equation:
\[\begin{equation} \sigma_{MA}^2=\frac{1}{L-n}\sum_{i=n}^L[y(i)-\overline{y_n}(i)]^2 \end{equation}\]
The number n must be smaller than the length of the smallest protein in the database under study. In the paper used by this algorithm, the length of the smallest protein is 10 and therefore the number n will vary from 2 to 9, so according to above equation, by putting the numbers \(\sigma_{MA}^2\) next to each other, 8 numbers are obtained for each column, and therefore the final feature vector will be of length 160 [27].
Usage of this feature in PSSMCOOL package:
X<-SOMA_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 50)
## [1] 0.2362 0.1898 0.2082 0.2175 0.1262 0.3151 0.2126 0.1880 0.2021 0.1703
## [11] 0.1830 0.1946 0.1593 0.2495 0.1656 0.2278 0.1300 0.1307 0.1247 0.1810
## [21] 0.7849 0.5708 0.7035 0.6571 0.3412 1.1477 0.6830 0.5934 0.6513 0.5323
## [31] 0.6020 0.5920 0.4988 0.8121 0.4674 0.7992 0.4326 0.3434 0.3525 0.5541
## [41] 1.6475 1.1293 1.4382 1.2852 0.6546 2.4681 1.3966 1.2121 1.3366 1.1003
Singular value decomposition is a general-purpose matrix factorization approach that has many useful applications in signal processing and statistics. In this feature SVD is applied to a matrix representation of a protein aimed to reduce its dimensionality. Given an input matrix Mat with dimensions N*M SVD is used to calculate its factorization of the form: \(Mat=U\Sigma V\) where \(\Sigma\) is a diagonal matrix whose diagonal entries are known as the singular values of Mat.The resulting descriptor is the ordered set of singular values: \(SVD\in\mathcal{R}^L\) where L=min(M,N). Since the PSSM has 20 columns, the final feature vector would be of length 20 [24].
Usage of this feature in PSSMCOOL package:
X<-SVD_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(X, n = 20)
## [1] 16.312 8.469 5.364 4.757 4.254 3.710 3.687 3.207 2.943 2.687
## [11] 2.565 2.262 2.244 2.073 1.832 1.755 1.621 1.507 1.441 1.061
# Installing PSSMCOOL and loading it
# install.packages("PSSMCOOL")
# library(PSSMCOOL)
# setting up working environment and downloading necessary files from GitHub
current_directory <- "/home/PSSMCOOL/" # Please provide your desired directory.
setwd(current_directory)
# Downloading the required PSSM files
pssm_url <- 'https://github.com/BioCool-Lab/PSSMCOOL/raw/main/classification-code-data/all_needed_pssms90.zip'
download.file(pssm_url, './all_needed_pssm90.zip', method = 'auto', quiet = FALSE)
unzip('all_needed_pssm90.zip', exdir = 'all_needed_pssm90')
PSSM_directory <- 'all_needed_pssm90/all_needed_pssms90/'
# Downloading positive data and loading it to R
url <- "https://raw.githubusercontent.com/BioCool-Lab/PSSMCOOL/main/classification-code-data/positive.csv"
download.file(url, './PositiveData.csv')
positive_data <- read.csv("./PositiveData.csv", header = TRUE)
# Downloading negative data and loading it to R
url <- "https://raw.githubusercontent.com/BioCool-Lab/PSSMCOOL/main/classification-code-data/negative.csv"
download.file(url, './NegativeData.csv')
negative_data <- read.csv("./NegativeData.csv", header = TRUE)
# ##################—Positive feature extraction—#######################
# Feature extraction
positiveFeatures<- c()
for(i in 1:dim(positive_data)[1]) {
ff<-FPSSM2(paste0(PSSM_directory, positive_data[i,1],'.fasta.pssm'),
paste0(PSSM_directory, positive_data[i,2],'.fasta.pssm'), 20)
positiveFeatures<-rbind(positiveFeatures, ff)
}
# Adding row names and class
positiveFirstColumn <- c()
for(i in 1:dim(positive_data)[1]) {
dd <- paste(positive_data[i,1], '-' ,positive_data[i,2])
positiveFirstColumn <- rbind(positiveFirstColumn, dd)
}
pos_class <- rep("Interaction", dim(positiveFeatures)[1])
positiveFeatures2 <- cbind(positiveFirstColumn, positiveFeatures, pos_class)
#################—Negative feature extraction—#####################
# Feature extraction
negativeFeatures <- c()
for(i in 1:dim(negative_data)[1]) {
ff2<-FPSSM2(paste0(PSSM_directory, negative_data[i,1],'.fasta.pssm'),
paste0(PSSM_directory, negative_data[i,2],'.fasta.pssm'), 20)
negativeFeatures<-rbind(negativeFeatures, ff2)
}
# Adding row names and class
negativeFirstColumn <- c()
for(i in 1:dim(negative_data)[1]) {
dd2 <- paste(negative_data[i,1], '-' ,negative_data[i,2])
negativeFirstColumn <- rbind(negativeFirstColumn, dd2)
}
neg_class <- rep("Non.Interaction", dim(negativeFeatures)[1])
negativeFeatures2 <- cbind(negativeFirstColumn, negativeFeatures, neg_class)
# Merging two feature vectors
mainDataSet <- rbind(positiveFeatures2, negativeFeatures2)
#################—Preparing data set for model training—##############
# In the following we are going to carry out classification on the data we have prepared so far (mainDataSet)
# First we need to install and load caret package and its dependencies
install.packages('caret', dependencies = TRUE)
library(caret)
bmp.R2.submission.data.df <- as.data.frame(mainDataSet)
colnames(bmp.R2.submission.data.df)[1] <- "interactions"
dim(bmp.R2.submission.data.df)#1730 102
# Assigning the Uniprot IDs for each protein pairs to the row name
rownames(bmp.R2.submission.data.df) <- bmp.R2.submission.data.df$interactions
# Removing the Uniprot IDs
bmp.R2.submission.data.df <-bmp.R2.submission.data.df[,-1]
View(bmp.R2.submission.data.df)
colnames(bmp.R2.submission.data.df) <- c(paste0('Frt', 1: dim(positiveFeatures)[2]), 'Class')
dim(bmp.R2.submission.data.df)#1730 101
table(bmp.R2.submission.data.df$Class)
# Interaction–Non-Interaction
# 865——————865
bmp.R2.submission.data.df$Class <-
as.factor(bmp.R2.submission.data.df$Class)
write.csv(bmp.R2.submission.data.df, 'DataSet.csv')
################—Training model with two classifier—##############
# setting.the.trainControl
bmp.R2.submission.data.df <- read.csv("DataSet.csv")
setting.the.trainControl.3 <- function()
{
#setting the trainControl function parameter: repeated CV; downsampling;
set.seed(100)
fitControl <- trainControl(## 10-fold CV
method = "cv",
returnData = TRUE,
classProbs = TRUE,
)
return(fitControl)
}
#########—setting cross validation parameters—#####################
trainControl.for.PSSM <- setting.the.trainControl.3()
# ######—10-fold cross-validation using “Bagged CART (treebag)” classifier—###
cross.validation.bulit.model.treebag <-
train(Class ~ ., data = bmp.R2.submission.data.df,
method = "treebag",
trControl = trainControl.for.PSSM,
verbose = FALSE)
print(cross.validation.bulit.model.treebag$results)
# parameter—Accuracy—–Kappa—–AccuracySD—-KappaSD
# 1-none—0.9965351—0.9930707—0.005582867—0.01116413
# ####—10-fold cross-validation using “Single C5.0 Tree (C5.0Tree)” classifier—##
cross.validation.bulit.model.C5.0Tree <-
train(Class ~ ., data = bmp.R2.submission.data.df,
method = "C5.0Tree",
trControl = trainControl.for.PSSM,
verbose = FALSE)
print(cross.validation.bulit.model.C5.0Tree$results)
# parameter—Accuracy—-Kappa—-AccuracySD—-KappaSD
# 1–none—0.9976911—0.9953822—0.004028016—0.008056142
# #############—SessionInFo in R—##################### # References
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.