Master thesis of Hugues Van Assel
With recent advances in cellular biology and high-throughput sequencing, we now have access to an unprecedented wealth of data describing the whole distribution of gene expression in an entire population. In this setting, dimensionality reduction is a required step not only for visualization but also for making inference more efficient.
In recent years, efficient non-linear methods have emerged such as tSNE (Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection). These techniques are widely used and rely on heuristics based on kernels that need to be calibrated. But the lack of strong statistical foundations makes this calibration difficult.
To build a statistical model for this type of methods, we will explore the theory of random graph coupling which can be seen as minimizing a divergence between Markov processes on graphs. The goal is to structure the original data on a graph and recover a low dimensional latent space with a similar graphical structure. In particular, we will study the effects of different priors for the graphs.
After defining the model, the objective will be to build an implementation with reasonable time complexity. The experiments will be conducted on high-throughput single-cell sequencing data.