Identification of a minimum number of genes to predict triple-negative breast cancer subgroups from gene expression profiles

Akhouayri, Laila; Ostano, Paola; Mello-Grand, Maurizia; Gregnanin, Ilaria; Crivelli, Francesca; Laurora, Sara; Liscia, Daniele; Leone, Francesco; Santoro, Angela; Mulè, Antonino; Guarino, Donatella; Maggiore, Claudia; Carlino, Angela; Magno, Stefano; Scatolini, Maria; Di Leone, Alba; Masetti, Riccardo; Chiorino, Giovanna

doi:10.1186/s40246-022-00436-6

Background: Triple-negative breast cancer (TNBC) is a very heterogeneous disease. Several gene expression and mutation profiling approaches were used to classify it, and all converged to the identification of distinct molecular subtypes, with some overlapping across different approaches. However, a standardised tool to routinely classify TNBC in the clinics and guide personalised treatment is lacking. We aimed at defining a specific gene signature for each of the six TNBC subtypes proposed by Lehman et al. in 2011 (basal-like 1 (BL1); basal-like 2 (BL2); mesenchymal (M); immunomodulatory (IM); mesenchymal stem-like (MSL); and luminal androgen receptor (LAR)), to be able to accurately predict them.Methods: Lehman's TNBCtype subtyping tool was applied to RNA-sequencing data from 482 TNBC (GSE164458), and a minimal subtype-specific gene signature was defined by combining two class comparison techniques with seven attribute selection methods. Several machine learning algorithms for subtype prediction were used, and the best classifier was applied on microarray data from 72 Italian TNBC and on the TNBC subset of the BRCA-TCGA data set.Results: We identified two signatures with the 120 and 81 top up-and downregulated genes that define the six TNBC subtypes, with prediction accuracy ranging from 88.6 to 89.4%, and even improving after removal of the least important genes. Network analysis was used to identify highly interconnected genes within each subgroup. Two druggable matrix metalloproteinases were found in the BL1 and BL2 subsets, and several druggable targets were complementary to androgen receptor or aromatase in the LAR subset. Several secondary drug-target interactions were found among the upregulated genes in the M, IM and MSL subsets.Conclusions: Our study took full advantage of available TNBC data sets to stratify samples and genes into distinct subtypes, according to gene expression profiles. The development of a data mining approach to acquire a large amount of information from several data sets has allowed us to identify a well-determined minimal number of genes that may help in the recognition of TNBC subtypes. These genes, most of which have been previously found to be associated with breast cancer, have the potential to become novel diagnostic markers and/or therapeutic targets for specific TNBC subsets.