UMAP clustering in Python

[This article was first published on poissonisfish, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Embracing Python in this tutorial series has long been a matter of time. For the last five years I have been championing R mostly because of its wide applicability and quite frankly, my own convenience. However, there is little any programming language can do to singlehandedly solve a variety of statistical and computational challenges and R too – take my word – is not without handicaps. Meanwhile, scientific computing from programming powerhouses in the likes of Python and Julia quickly caught up and there is now great potential in bringing them all together. Therefore, as much as I hold R dear I decided to endorse different programming languages in this and future work. I hope this change will help covering more ground and appeal to a wider readership.

The aim of this short Python tutorial is to introduce the uniform manifold approximation and projection (UMAP) algorithm, using 76,533 single-cell expression profiles from the human primary motor cortex. The data are available from the Cell Types database, which is part of the Allen Brain Map platform.

The UMAP has quickly established itself as a go-to clustering tool well poised to expand our knowledge of various many things, including the human brain. I hope by the end of this tutorial you will have a broad understanding of the UMAP algorithm and how to implement it.

The UMAP algorithm

Uniform manifold approximation and projection (UMAP)1 is a scalable and efficient dimension reduction algorithm that performs competitively among state-of-the-art methods such as t-SNE2, and widely applied for unsupervised clustering.

To effectively approximate a uniformly distributed manifold in the data, this neighbour-graph technique first defines fuzzy simplicial sets using local metric spaces and then patches them together into a single topological structure. Of note, these local metric spaces are determined from the distance of each individual example to its k-th nearest neighbour, where the user-provided k balances the local-global structure representation. Next, a low-dimensional layout – the so-called embedding – is constructed from the fuzzy set cross-entropy that matches the largest edge weights from the topological structure with the shortest pairwise distances in the layout itself, and vice-versa.

In practice, to implement the UMAP users must provide the minimum distance to separate close observations in the embedding, as well as the numbers of training epochs, embedding dimensions and neighbours. Tweaking these hyperparameters may drastically change the resulting embedding, so these should be carefully tuned.

In addition to the main manuscript1 and this excellent walkthrough, you may find the presentation below from the first author himself just as enlightening.

Since its release, the UMAP has been widely applied to large, high-dimensional molecular data and particularly single-cell expression profiling3 4 5, a technology that will be introduced next.

Single-cell expression profiling

In the course of the last two decades, developments in molecular and cellular biology unlocked ever-increasing detail over developmental and metabolic processes in living organisms. Unlike DNA sequencing, RNA sequencing and quantification (RNA-Seq) captures the activity of tens of thousands of genes simultaneously and therefore more closely reflects the interplay among the various different cellular processes.

However, the earliest RNA-Seq workflows presented technical challenges that prevented their application to single cells and consequently access to cell-to-cell variability. Perhaps most critically, these workflows required large amounts of total RNA and lacked the ability to isolate and label single cells. Only recently, the advent of single-cell RNA-Seq (scRNA-Seq) solved these two fundamental problems and helped uncovering cell identity6 7. Among other things, we now have access to a much richer understanding of health and disease, which holds the key to discovering therapeutic targets for a range of diseases.

This widespread technology is largely based on microfluidic chips where RNA molecules of individual cells are tagged with unique barcode sequences. For a end-to-end overview of scRNA-Seq, below is a great explanation from an instrument and chip manufacturer.

For the present UMAP tutorial we will use scRNA-Seq expression profiles from the human primary motor cortex (M1). The dataset and some additional experimental information are available here – no need to download it manually!

From an IPython console such as that from Spyder, we start off with importing a handful of modules. Most can be installed from the conda-forge channel, while in contrast umap and datatable can be installed with the commands !pip install umap-learn and !pip install datatable, respectively. Besides the usual numpy and  pandas dependencies for handling structured data, datatable and seaborn are each imported to more quickly load the scRNA-Seq and easily produce a two-dimensional embedding scatterplot, respectively.

Next, we set up a working directory of our choice and download both the large scRNA-Seq dataset and the associated metadata, the latter to derive the cell subgroup labels and distinguish the embedded expression profiles. To this end I opted for a couple of wget Bash commands, but any equivalent method will also work.

 # Imports #!pip install datatable import os, umap import numpy as np import pandas as pd import datatable as dt import seaborn as sns os.chdir(‘PATH/TO/WDIR’) # Download expression matrix (matrix.csv) and metadata (metadata.csv) !wget https://idk–etl–prod–download–bucket.s3.amazonaws.com/aibs_human_m1_10x/matrix.csv !wget https://idk–etl–prod–download–bucket.s3.amazonaws.com/aibs_human_m1_10x/metadata.csv

Once the downloads are complete, you will certainly notice matrix.csv is approximately 7GB in size. We could try lifting this behemoth using pd.read_csv() with an appropriate choice of chunksize, but we will instead use the nimble dt.fread(). A file preview using the Bash command cut -d, -f-5 matrix.csv | head suggests that sample_name aside, all columns are encoded as integer type. Therefore, to facilitate the data ingestion I passed a Bash command to exclude sample_name prior to importing and apply a dt.int32 encoding on the filtered data. Please make sure to leverage parallelisation if possible, by setting and appropriate value of nthreads when executing dt.fread().

 # Check first five columns in matrix.csv #!cut -d, -f-5 matrix.csv | head # Import data with Bash command discarding first column matrix = dt.fread(cmd=‘cut -d, -f2- matrix.csv’, header=True, sep=‘,’, columns=dt.int32) # ~7 GB (76533, 50281) # Import metadata metadata = pd.read_csv(‘metadata.csv’)

The resulting object matrix comprises a total of 76,533 expression profiles across 50,281 genes or expression features. If your RAM allows, the to_numpy() and to_pandas() methods will directly convert the datatable to the familiar NumPy or Pandas formats, respectively. To learn more about how to manipulate datatable objects check out the official documentation.

Next, for the sake of the quality and runtime of the UMAP training, we use np.apply_along_axis() to identify and discard expression features that equal or exceed 50% of null expression levels. I leave any further data cleansing up to you.

 # Remove expression features with > 50% zero-valued expression levels is_expressed = np.apply_along_axis(lambda x: np.mean(x == 0) .5, arr=matrix, axis=0) matrix = matrix[:,is_expressed.tolist()] # Log2-transform matrix = np.log2(matrix.to_numpy() + 1)

Now that 4,078 expression features were selected and log-transformed, we can proceed with fitting the UMAP and examining the resulting two-dimensional embedding. For this purpose I employed a min_dist of 0.25, n_neighbors of 30 and the default Euclidean metric. One advantage of the Euclidean metric is that it implicitly factors in the absolute differences between every pair of expression profiles, under the expectation that similar cells match their profiles to the magnitude of expression. Finally, a fixed random_state will ensure we arrive at the same embedding.

Before moving on with the UMAP embedding visualisation, it would help to extract some experimental information from the metadata table in order to make sense of the embedding. One of the many interesting labelling options is cell subclass, which will be used below.

 # Define UMAP brain_umap = umap.UMAP(random_state=999, n_neighbors=30, min_dist=.25) # Fit UMAP and extract latent vars 1-2 embedding = pd.DataFrame(brain_umap.fit_transform(matrix), columns = [‘UMAP1’,‘UMAP2’]) # Produce sns.scatterplot and pass metadata.subclasses as color sns_plot = sns.scatterplot(x=‘UMAP1’, y=‘UMAP2’, data=embedding, hue=metadata.subclass_label.to_list(), alpha=.1, linewidth=0, s=1) # Adjust legend sns_plot.legend(loc=‘center left’, bbox_to_anchor=(1, .5)) # Save PNG sns_plot.figure.savefig(‘umap_scatter.png’, bbox_inches=‘tight’, dpi=500)

A grand total of 76,533 human brain cells are represented in the figure above, how cool is that? Under just a few minutes, the UMAP captured a topological representation of the single-cell expression dataset that – given no label information – neatly sorted the different cell subtypes from the human primary motor cortex. Any expert out there daring to comment whether the proximity of the clusters makes anatomical sense?

With that I hope you have a better understanding of the UMAP and how to apply it using Python. See you next time!