15–17 Sept 2025
Centro Polifunzionale Studenti Università di Bari
Europe/Rome timezone
CSS/ITALY 2025

Topic modelling methods for the analysis of multi-omics data

16 Sept 2025, 10:00
30m
Centro Polifunzionale Studenti Università di Bari

Centro Polifunzionale Studenti Università di Bari

Speaker

M. Caselle

Description

Topic models are a set of algorithms originally developed to extract latent variables from texts corpora. The most popular of these algorithms is the so-called Latent Dirichlet Allocation (LDA) which has been successfully applied in these last years not only in texts analysis but also in bioinformatics. In fact algorithms which try to identify the topic'' of a given document from the word usage have to face the same type of challenges we usually face when studying gene expression data. In this analogy the cancer samples play the role of the documents, the words are the genes, the number of times a particular word is used in a given document is the analogous of the expression level of a particular gene in a given sample and the topics are the gene sets (the "signatures") we use to cluster samples into subtypes. The goal of topic modeling is to identify thetopic" of a given document from the word usage within that document and exactly in the same way our goal is to identify the cancer subtype from the gene expression pattern. The major advantage of topic modeling methods with respect to standard clustering approaches is that they allow a ``fuzzy" type of clustering. The output of a typical topic modeling algorithm is a {\sl probability distribution} of membership i.e. the probability of a given document to be composed by a given topic and at the same time the probability of a word to characterize a given topic. In our context this means that we have as output of our analysis a set of values which quantify the probability of a given sample to belong to a particular cancer subtype and the relevance of a given gene in driving this identification.

In this talk, after a general introduction to topic modeling and LDA I will discuss a new set of algorithms, based on a hierarchical version of Stochastic Block Modeling (hSBM) which have been recently proposed to overcome some of the problems of LDA and show a few application to cancer gene expression and multi-omics data.

Presentation materials