Evolutionary information is the primary tool for detecting functional conservation in nucleic acid and protein. This information has been extensively used to predict structure, interactions and functions in macromolecules. Pathogenicity prediction models rely on multiple sequence alignment information at different levels. However, most accurate genome-wide variant deleteriousness ranking algorithms consider different features to assess the impact of variants. Here, we analyze three different ways of extracting evolutionary information from sequence alignments in the context of pathogenicity predictions at DNA and protein levels. We showed that protein sequence-based information is slightly more informative in the annotation of Clinvar missense variants than those obtained at the DNA level. Furthermore, to achieve the performance of state-of-the-art methods, such as CADD and REVEL, the conservation of reference and variant, encoded as frequencies of reference/alternate alleles or wild-type/mutant residues, should be included. Our results on a large set of missense variants show that a basic method based on three input features derived from the protein sequence profile performs similarly to the CADD algorithm which uses hundreds of genomic features. As expected, our method results in ~ 3% lower area under the receiver-operating characteristic curve (AUC). When compared with an ensemble-based algorithm (REVEL). Nevertheless, the combination of predictions of multiple methods can help to identify more reliable predictions. These observations indicate that for missense variants, evolutionary information, when properly encoded, plays the primary role in ranking pathogenicity.

Evaluating the relevance of sequence conservation in the prediction of pathogenic missense variants

Fariselli P.
2022-01-01

Abstract

Evolutionary information is the primary tool for detecting functional conservation in nucleic acid and protein. This information has been extensively used to predict structure, interactions and functions in macromolecules. Pathogenicity prediction models rely on multiple sequence alignment information at different levels. However, most accurate genome-wide variant deleteriousness ranking algorithms consider different features to assess the impact of variants. Here, we analyze three different ways of extracting evolutionary information from sequence alignments in the context of pathogenicity predictions at DNA and protein levels. We showed that protein sequence-based information is slightly more informative in the annotation of Clinvar missense variants than those obtained at the DNA level. Furthermore, to achieve the performance of state-of-the-art methods, such as CADD and REVEL, the conservation of reference and variant, encoded as frequencies of reference/alternate alleles or wild-type/mutant residues, should be included. Our results on a large set of missense variants show that a basic method based on three input features derived from the protein sequence profile performs similarly to the CADD algorithm which uses hundreds of genomic features. As expected, our method results in ~ 3% lower area under the receiver-operating characteristic curve (AUC). When compared with an ensemble-based algorithm (REVEL). Nevertheless, the combination of predictions of multiple methods can help to identify more reliable predictions. These observations indicate that for missense variants, evolutionary information, when properly encoded, plays the primary role in ranking pathogenicity.
2022
1
10
Capriotti E.; Fariselli P.
File in questo prodotto:
File Dimensione Formato  
Capriotti_Fariselli_hum_genetics2022.pdf

Open Access dal 01/02/2023

Tipo di file: POSTPRINT (VERSIONE FINALE DELL’AUTORE)
Dimensione 573.28 kB
Formato Adobe PDF
573.28 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1844320
Citazioni
  • ???jsp.display-item.citation.pmc??? 0
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 2
social impact