Topic modeling is a widely used technique to extract relevant information from large arrays of data. The problem of finding a topic structure in a dataset was recently recognized to be analogous to the community detection problem in network theory. Leveraging on this analogy, a new class of topic modeling strategies has been introduced to overcome some of the limitations of classical methods. This paper applies these recent ideas to TCGA transcriptomic data on breast and lung cancer. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, we identify specific topics that are enriched in genes known to play a role in the corresponding disease and are strongly related to the survival probability of patients. Finally, we show that a simple neural network classifier operating in the low dimensional topic space is able to predict with high accuracy the cancer subtype of a test expression sample.

A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data

Valle, Filippo;Osella, Matteo;Caselle, Michele
2020-01-01

Abstract

Topic modeling is a widely used technique to extract relevant information from large arrays of data. The problem of finding a topic structure in a dataset was recently recognized to be analogous to the community detection problem in network theory. Leveraging on this analogy, a new class of topic modeling strategies has been introduced to overcome some of the limitations of classical methods. This paper applies these recent ideas to TCGA transcriptomic data on breast and lung cancer. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, we identify specific topics that are enriched in genes known to play a role in the corresponding disease and are strongly related to the survival probability of patients. Finally, we show that a simple neural network classifier operating in the low dimensional topic space is able to predict with high accuracy the cancer subtype of a test expression sample.
2020
12
12
3799_1
3799_27
https://www.mdpi.com/2072-6694/12/12/3799
gene expression; network theory; network-based cancer data analysis; stochastic block modeling; topic modeling
Valle, Filippo; Osella, Matteo; Caselle, Michele
File in questo prodotto:
File Dimensione Formato  
cancers-12-03799-v3.pdf

Accesso aperto

Tipo di file: PDF EDITORIALE
Dimensione 928.13 kB
Formato Adobe PDF
928.13 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1767492
Citazioni
  • ???jsp.display-item.citation.pmc??? 5
  • Scopus 9
  • ???jsp.display-item.citation.isi??? 7
social impact