Advances in experimental methods in biology have allowed researchers to gain an unprecedentedly high-resolution view of the molecular processes within cells, using so-called single-cell technologies. Every cell in the sample can be individually profiled — the amount of each type of protein or metabolite or other molecule of interest can be counted. Understanding the molecular basis that determines the differentiation of cell fates is thus the holy grail promised by these data.
However, the high-dimensional nature of the data, replete with correlations between features, noise, and heterogeneity means the computational work required to draw insights is significant. In particular, understanding the differences between cells requires a quantitative measure of similarity between the single-cell feature vectors of those cells. A vast array of existing methods, from those that cluster a given dataset to those that attempt to integrate multiple datasets or learn causal effects of perturbation, are built on this foundational notion of similarity.
In this dissertation, we delve into the question of similarity metrics for high-dimensional biological data generally, and single-cell RNA-seq data specifically. We work from a global perspective — where we find a distance function that applies across the entire dataset — to a local perspective — where each cell can learn its own similarity function. In particular, we first present Schema, a method for combining similarity information encoded by several types of data, which has proven useful in analyzing the burgeoning number of datasets which contain multiple modalities of information. We also present DensVis, a package of algorithms for visualizing single-cell data, which improve upon existing dimensionality-reduction methods that focus on local structure by accounting for density in high-dimensional space. Lastly, we zoom in on each datapoint, and show a new method for learning 𝑘-nearest neighbors graphs based on local decompositions.
Altogether, the works demonstrate the importance — through extensive validation on existing datasets — of understanding high-dimensional similarity.
Ph.D.