The prediction of signal peptides in proteins with Grammatical Restrained Hidden Conditional Random Fields

Martelli, P; Indio, V; Savojardo, C; Fariselli, P; Casadio, R

Signal peptides are short, cleavable, N-terminal peptides (from 15 up to 50 residue long) that are present in most proteins targeted towards the secretory pathway, including proteins that localize in intermediate steps such as the endoplasmic reticulum and the Golgi apparatus. The annotation of signal peptides in proteins is then an important step for characterizing the protein function and localization and for determining the sequence of the mature protein. In the lack of experimental data, prediction methods are useful tools that allow large-scale proteome annotation. We developed SPpred a new predictor of signal peptides based on Grammatical Restrained Hidden Conditional Random Fields, a recently introduced machine-learning tool well suited to solve labeling problems (Fariselli et al., 2009) SPpred is trained on a non-redundant dataset of proteins where the presence of a signal peptide was experimentally validated, and comprising 1495 sequences from Eukaryotes, 417 from Gram- bacteria and 104 from Gram+ bacteria. The prediction performances were evaluated in cross-validation considering the 2016 positive examples and 15,714 non redundant negative examples from Eukaryotes, 1741 from Gram- bacteria and 922 from Gram+ bacteria. For all the three classes SPpred predicts the presence of a signal peptide with a Matthews correlation coefficient equal to 0.87 and an accuracy ranging between 97% and 98 %. The accuracy in predicting the cleavage site ranges from 95% (in Gram+ bacteria) to 97% (in Eukaryotes). Due to the hydrophobic composition of signal peptides, a known problem of most available predictive methods is the fact that they tend to predict as signal peptides many N-terminal transmembrane alpha-helices. In the case of Tppred, the rate of false predictions is limited to the 4.3% of proteins endowed with a transmembrane alpha-helix in the first 50 residues of their sequence.