Identification of Interpretable Clusters and Associated Signatures in Breast Cancer Single-Cell Data: A Topic Modeling Approach

Malagoli, Gabriele; Valle, Filippo; Barillot, Emmanuel; Caselle, Michele; Martignetti, Loredana

doi:10.3390/cancers16071350

Simple Summary Topic modeling, widely used in natural language processing, categorizes text documents into themes based on word frequency analysis. It has found success in various biological data analyses, including the accurate prediction of cancer subtypes and the simultaneous identification of genes, enhancers, and cell types from sparse single-cell data. Our study introduces a novel topic modeling approach for clustering single cells and detecting gene signatures in multi-omics single-cell datasets. Applied to study transcriptional heterogeneity in breast cancer cells resistant to chemotherapy and targeted therapy, it identifies protein-coding genes and long non-coding RNAs grouping cells into biologically similar clusters, effectively distinguishing between drug-sensitive and -resistant cancer types. Previous studies have interrogated long non-coding RNA (lncRNA) expression in single-cell data within breast cancer subtypes. Yet, the combined analysis of both lncRNA and mRNA expression in a cell type-specific manner remains to be explored. Compared to standard clustering methods, our approach offers a simultaneous optimal partitioning of genes and cells into topics and clusters, yielding easily interpretable results. Integrating mRNA and lncRNA data enhances cell classification accuracy.Abstract Topic modeling is a popular technique in machine learning and natural language processing, where a corpus of text documents is classified into themes or topics using word frequency analysis. This approach has proven successful in various biological data analysis applications, such as predicting cancer subtypes with high accuracy and identifying genes, enhancers, and stable cell types simultaneously from sparse single-cell epigenomics data. The advantage of using a topic model is that it not only serves as a clustering algorithm, but it can also explain clustering results by providing word probability distributions over topics. Our study proposes a novel topic modeling approach for clustering single cells and detecting topics (gene signatures) in single-cell datasets that measure multiple omics simultaneously. We applied this approach to examine the transcriptional heterogeneity of luminal and triple-negative breast cancer cells using patient-derived xenograft models with acquired resistance to chemotherapy and targeted therapy. Through this approach, we identified protein-coding genes and long non-coding RNAs (lncRNAs) that group thousands of cells into biologically similar clusters, accurately distinguishing drug-sensitive and -resistant breast cancer types. In comparison to standard state-of-the-art clustering analyses, our approach offers an optimal partitioning of genes into topics and cells into clusters simultaneously, producing easily interpretable clustering outcomes. Additionally, we demonstrate that an integrative clustering approach, which combines the information from mRNAs and lncRNAs treated as disjoint omics layers, enhances the accuracy of cell classification.

CINECA IRIS Institutional Research Information System