The prediction of organelle targeting peptides in eukaryotic proteins with Grammatical Restrained Hidden Conditional Random Fields

Martelli Pier Luigi,; Valentina, Indio; Castrense, Savojardo; Fariselli, Piero; Rita, Casadio

Targeting peptides are the most important signal controlling the import of nuclear encoded proteins into mitochondria and plastids. In the lack of experimental information, their prediction is an essential step when proteomes are annotated, for inferring both the localization and the sequence of mature proteins. We developed TPpred a new predictor of organelle targeting peptides based on Grammatical Restrained Hidden Conditional Random Fields, a recently introduced machine-learning tool well suited to solve labeling problems (Fariselli et al., 2009) TPpred is trained on a non-redundant dataset of proteins where the presence of a target peptide was experimentally validated, comprising 297 sequences. When tested on the 297 positive and some other 8010 negative examples, TPpred outperforms available methods in both accuracy and Matthews correlation index (96% and 0.59, respectively). Given its very low false positive rate (3.0%), TPpred is therefore well suited for large-scale analyses at the proteome level. We predicted that about 4% to 9% of the sequences of human, Arabidopsis thaliana and yeast proteomes contain targeting peptides and are therefore likely to be localized in mitochondria and plastids. TPpred predictions correlate to a good extent the experimental annotation of the subcellular localization, when available. TPpred was also trained and tested to predict the cleavage site of the organelle targeting peptide on this task the average error of TPpred on mitochondrial and plastidic proteins is 7 and 15 residues, respectively. This value is lower than the error reported for other methods currently available.