Cell Type Annotation in Single-cell Analysis: The Basics

Jan 6
4 min read

Cell type annotation is often seen as one of the most complex steps in scRNA-seq data analysis. If you are new to this area, you may ask, “Where should I start if I want to annotate my clusters?” or “What methods should I use?” In this blog, we would like to share some basic know-how and practical tips that you can anchor to when performing cell type annotation in single-cell analysis.

First and foremost: know your experiment and tissue

The “background” of the cells is important. This step may seem obvious and simple, but it is crucial. Knowing which tissue or biological system your data comes from helps narrow down the choices you need to make during cell type annotation in single-cell analysis: Your choice of method, reference data, and validation strategy all depend heavily on this context.

In single-cell RNA-seq, sequencing depth per cell can vary a lot due to technical reason or biological factor.

How careful QC matters

Inaccurate QC can lead to low-quality cells, which in turn result in low correlation with reference data. Clusters may also be categorized as “unknown” if technical noise is too strong. In addition, cells with high mitochondrial content or stress signatures can cause problems, as these signals may dominate and unfortunately “hide” real marker genes.

Automated cell type prediction: a good starting point

When doing cell type annotation in single-cell analysis, to speed things up, you may use an automated predictor to get a general idea of what cell types are present in your clusters. Similar to manual annotation, cell type prediction is based on gene expression profiles or similarity to reference data. These initial predictions can always be refined later, either manually or using additional methods.

Some available automated approaches for cell type annotation in single-cell data analysis

(i) Marker gene-based

This approach involves annotation based on expert biological knowledge of marker genes. So basically, you have a curated list of cell types and their marker genes (the knowledge base). You then identify the marker genes of each cluster in the dataset and match them to the knowledge base to determine cell identities. Given that, in this method, it does not require direct comparison of expression profiles between cells and reference datasets, but it depends heavily on the quality and specificity of the marker gene list.

Common methods include: scCATCH, SCSA, SCINA, CellAssign [1]

The marker gene-based approach involves annotation based on expert biological knowledge of marker genes. (Figure adapted from [1])

(ii) Correlation-based

Doing cell type annotation in single-cell analysis using this approach means comparing between cells in your dataset and an existing annotated reference database. As a result, the method is only as good as the reference used - the more accurate the reference, the better the annotation. SingleR falls into this category, along with scmap. SingleR also includes a fine-tuning step that reduces ambiguity between closely related cell types by focusing on the most relevant marker genes.

Performing cell type annotation in single-cell analysis using this approach involves comparing cells in your dataset with an existing annotated reference database. (Figure adapted from [1])

Label transfer methods using PCA or CCA align query and reference datasets in a shared embedding space and transfer cell labels based on similarity between cells in this space, making them another form of correlation-based annotation. You can learn more about the underlying mathematics and see a hands-on demonstration in one of our previously recorded webinars with Dr. Ming Tang (Director of Bioinformatics, AstraZeneca) here.

Learn more about PCA/CCA method with Ming Tang here

(iii) Supervised classification-based

This approach uses a collection of labeled reference data to train a machine learning classifier that can predict cell types in new datasets. Unlike correlation-based methods, which rely on similarity with a particular reference dataset, supervised classifiers learn patterns from multiple reference data to assign cell labels automatically. The performance of this method depends on both the quality of the training data and the features selected for classification. Popular methods include: SingleCellNet (uses Random Forest) [2], scPred (uses feature selection and SVM for accurate cell-type prediction) [3], Garnett [4] (uses Elastic net regression),...

Unlike correlation-based methods, which rely on similarity with a particular reference dataset, supervised classifiers learn patterns from multiple reference datasets to auto-assign cell labels. (Figure adapted from Pasquini et al. 2021)

(iv) Transformer-based methods

In simple terms, these are AI-powered approaches. They are particularly useful for spatial data or large multi-study atlases, where cell types may mix within neighborhoods or rare cell types are difficult to identify. Although some transformer-based models are computationally expensive, recent approaches leverage large language models (such as GenePT [5]) and more efficient workflows to reduce resource requirements, making them worth exploring when doing cell type annotation in single-cell analysis for large-scale or multi-tissue datasets. One of our partners, Miraomics, also provides a transformer-based solution called MiraTyper, and our curated database, Pythiomics, is expected to be robustly powered by this tool in the upcoming year.

Tips if your method needs a reference

Reference-based methods are generally recommended for common cell types or cases that have tissues with abundant public data. Usually, tissue-matched or disease-matched datasets often work well as references. It is also recommended to integrate multiple reference datasets into a combined reference to improve coverage.

Manual validation is key

It’s always important to validate the prediction results. At this stage, you can use known marker genes to examine the annotations and review the marker gene lists for each cluster to ensure they align with the predicted cell types.

Above are the basics of cell type annotation in single-cell RNA-seq that we believe can help newcomers to scRNA-seq navigate this step more confidently. If you are looking for a platform to practice, would you like to explore our C-DIAM Multi-Omics Studio? It is designed with an intuitive, no-code interface with SingleR and Azimuth integration so that beginners can easily get started with cell type prediction and exploring insights from their single cell data. You can send us a request here to try it out.

Explore C-DIAM

References

[1] Pasquini, Giovanni, et al. "Automated methods for cell type annotation on scRNA-seq data." Computational and Structural Biotechnology Journal 19 (2021): 961-969.

[2] Tan, Yuqi, and Patrick Cahan. "SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species." Cell systems 9.2 (2019): 207-213.

[3] Alquicira-Hernandez, Jose, et al. "scPred: accurate supervised method for cell-type classification from single-cell RNA-seq data." Genome biology 20.1 (2019): 264.

[4] Pliner, Hannah A., Jay Shendure, and Cole Trapnell. "Supervised classification enables rapid annotation of cell atlases." Nature methods 16.10 (2019): 983-986.

[5] Chen, Yiqun, and James Zou. "GenePT: a simple but effective foundation model for genes and cells built from ChatGPT." bioRxiv (2024): 2023-10.