In this paper we evaluate the performance of machine learning methods in the task of predicting the bonding state of cysteines starting from protein sequences. This task is very important and is the first step for the identification of disulfide bonds in proteins. We score the performance of three different approaches, such as: 1) Hidden Support Vector Machines (HSVM) which integrates the SVM predictions with a Hidden Markov Model; 2) SVM-HMM which discriminatively trains models that are isomorphic to a kth-order hidden Markov model; 3) Grammatical-Restrained Hidden Conditional Random Fields (GRHCRF) that we recently introduced. To evaluate the present (and future) methods we built a new non-redundant dataset. We report the performance using indices based on per-cysteine and per-protein scores. Furthermore, we evaluate two different encoding schemes based on sequence profile and position specific scoring matrix (PSSM) as computed with the PSI-BLAST program. The evaluation is carried out with different dimensions of the local cysteine environment and using differentMarkov models. Our results show that when the evolutionary information is encoded with PSSM all the methods perform better than with sequence profile. Finally, among the different methods it appears that GRHCRFs performs slightly better than the others achieving a per protein accuracy of 87% with a correlation coefficient of 0.73. Considering that our dataset does not contain trivial protein cases (only one cysteine per protein) the accuracy achieved is among the best performing reported so far.

Prediction of cysteine bonding state with machine-learning methods

Fariselli P.;
2010-01-01

Abstract

In this paper we evaluate the performance of machine learning methods in the task of predicting the bonding state of cysteines starting from protein sequences. This task is very important and is the first step for the identification of disulfide bonds in proteins. We score the performance of three different approaches, such as: 1) Hidden Support Vector Machines (HSVM) which integrates the SVM predictions with a Hidden Markov Model; 2) SVM-HMM which discriminatively trains models that are isomorphic to a kth-order hidden Markov model; 3) Grammatical-Restrained Hidden Conditional Random Fields (GRHCRF) that we recently introduced. To evaluate the present (and future) methods we built a new non-redundant dataset. We report the performance using indices based on per-cysteine and per-protein scores. Furthermore, we evaluate two different encoding schemes based on sequence profile and position specific scoring matrix (PSSM) as computed with the PSI-BLAST program. The evaluation is carried out with different dimensions of the local cysteine environment and using differentMarkov models. Our results show that when the evolutionary information is encoded with PSSM all the methods perform better than with sequence profile. Finally, among the different methods it appears that GRHCRFs performs slightly better than the others achieving a per protein accuracy of 87% with a correlation coefficient of 0.73. Considering that our dataset does not contain trivial protein cases (only one cysteine per protein) the accuracy achieved is among the best performing reported so far.
2010
7th INTERNATIONAL MEETING ON COMPUTATIONAL INTELLIGENCE METHODS FOR BIOINFORMATICS AND BIOSTATISTICS
Palermo (Italy)
September 16-18, 2010
Proceedings CIBB2010
1
10
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1687460
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact