Topic modeling is a widely used approach to extract relevant information from large datasets. Recently the problem of finding a latent structure in a dataset was mapped to the community detection problem in network theory and a new class of topic modeling strategies has been introduced to overcome some of the limitations of classical methods. We tested this approach on lung and breast cancer samples from the TCGA and METABRIC databases, using data of messenger RNA, microRNAs and copy number variations. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, the “topic” that the algorithm extracts correspond to genes involved in cancer development and they are enriched in genes known to play a role in the corresponding disease; they are strongly related to the survival probability of patients too. In biology, integrating transcriptional data with other layers of information, such as the post-transcriptional regulation mediated by microRNAs, can be crucial in identifying the driver genes and the subtypes of complex and heterogeneous diseases such as cancer. More specifically, we show how an algorithm based on a hierarchical version of stochastic block modeling can be adapted to integrate any combination of data. We will also show that the inclusion of the microRNAs layer significantly improves the accuracy of subtype classification. As a final result, we show how operating in the low dimensional topic space, one can predict the cancer subtype of a new unseen expression sample.

Topic Modeling Methods for the Analysis of Gene Expression Data

valle, filippo
2023-01-01

Abstract

Topic modeling is a widely used approach to extract relevant information from large datasets. Recently the problem of finding a latent structure in a dataset was mapped to the community detection problem in network theory and a new class of topic modeling strategies has been introduced to overcome some of the limitations of classical methods. We tested this approach on lung and breast cancer samples from the TCGA and METABRIC databases, using data of messenger RNA, microRNAs and copy number variations. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, the “topic” that the algorithm extracts correspond to genes involved in cancer development and they are enriched in genes known to play a role in the corresponding disease; they are strongly related to the survival probability of patients too. In biology, integrating transcriptional data with other layers of information, such as the post-transcriptional regulation mediated by microRNAs, can be crucial in identifying the driver genes and the subtypes of complex and heterogeneous diseases such as cancer. More specifically, we show how an algorithm based on a hierarchical version of stochastic block modeling can be adapted to integrate any combination of data. We will also show that the inclusion of the microRNAs layer significantly improves the accuracy of subtype classification. As a final result, we show how operating in the low dimensional topic space, one can predict the cancer subtype of a new unseen expression sample.
2023
valle, filippo
File in questo prodotto:
File Dimensione Formato  
Valle_PhD-Thesis_.pdf

Accesso aperto

Descrizione: PhD Thesis Title: Topic Modeling Methods for the Analysis of Gene Expression Data / Author: Filippo Valle
Tipo di file: PDF EDITORIALE
Dimensione 8.23 MB
Formato Adobe PDF
8.23 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1906052
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact