Topic Modeling Methods for the Analysis of Gene Expression Data

Valle, Filippo

Topic modeling is a widely used approach to extract relevant information from large datasets. Recently the problem of finding a latent structure in a dataset was mapped to the community detection problem in network theory and a new class of topic modeling strategies has been introduced to overcome some of the limitations of classical methods. We tested this approach on lung and breast cancer samples from the TCGA and METABRIC databases, using data of messenger RNA, microRNAs and copy number variations. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, the “topic” that the algorithm extracts correspond to genes involved in cancer development and they are enriched in genes known to play a role in the corresponding disease; they are strongly related to the survival probability of patients too. In biology, integrating transcriptional data with other layers of information, such as the post-transcriptional regulation mediated by microRNAs, can be crucial in identifying the driver genes and the subtypes of complex and heterogeneous diseases such as cancer. More specifically, we show how an algorithm based on a hierarchical version of stochastic block modeling can be adapted to integrate any combination of data. We will also show that the inclusion of the microRNAs layer significantly improves the accuracy of subtype classification. As a final result, we show how operating in the low dimensional topic space, one can predict the cancer subtype of a new unseen expression sample.