ScRNA-seq analysis is a complex workflow. Why?
Because ground-breaking technological advances have empowered scientists to generate datasets comprising of thousands to hundreds of thousands of cells. Such massive datasets promise important insights yet are complex to treat as they are high dimensional and carry a lot of noise. On the other hand, hundreds of tools and methods have been developed to treat the single-cell data, but there are few comprehensive benchmarks and minimal guidance of golden practices.
With that in mind, we established this blog series to help answer your most common questions about different steps in the scRNA-seq analysis workflows and provide essential tips to avoid common pitfalls in quality control, normalization, batch correction, feature selection, dimensionality reduction, visualization, and so on.
In this article, let’s start from quality control, or filtering.
FAQ #1 : Why is quality control essential to any scRNA-seq analysis project?
“Quality over quantity”, that’s what people usually say, and this holds true to any omics data analysis project.
Quality control (QC) is even more important to scRNA-seq analysis workflows, especially droplet-based technologies, as the data themselves carry a lot of noise. There might be dying cells, empty droplets (no cells are captured in a droplet), doublets or multiplets (multiple cells are captured in a droplet). In some cases, ambient RNAs coming from contaminants or RNAs from unhealthy cells are enclosed in a droplet together with an intact cell. They altogether distort the UMI counting and downstream analysis of gene expressions.
For example, the lack of proper QC measurements typically ends up:
Distorting the clustering results
Giving a false impression about cell types and states in the data
Distorting the differential expression analysis results and comparison of different groups of cells
Doublets or multiplets, which exhibit absurdly high count values, may distort the gene expression color scales and make it hard to distinguish the gene expression values in other cells.
Cells with absurdly high counts represent a cluster in the clustering result. Its positive markers include CCNB2, CDC6, KIFC1, BIRC5 and do not represent any discrete cell type. Picture from the CDIAM Platform.
This is why you always see QC at the top of any scRNA-seq analysis pipeline. The goal is to:
Generate metrics that help assess the sample quality and decide whether to proceed to downstream analyses
Ensure all cellular barcodes correspond to viable cells by removing poor-quality cells and noise that may confound analysis and interpretation without removing the biologically relevant cell types
FAQ #2 : What are the commonly used QC metrics and their meanings?
Here are some common QC thresholds applied for scRNA-seq data analysis.
the minimum and maximum number of UMI counts/genes per barcode - to filter empty droplets, doublets and multiplets. Empty droplets often have very few counts, while doublets and multiplets exhibit aberrantly high counts.
the maximum percentage of counts from mitochondrial genes per barcode - to filter dying cells as they usually exhibit high mitochondrial percentages (Islam et al. 2014; Ilicic et al. 2016).
the minimum number of cells where a gene is expressed - to filter genes with low abundances.
FAQ #3 : What QC thresholds do my peers use?
You can refer to the default filtering parameters by Seurat and Scanpy as below:
Filter cells that express less than 200 genes
Filter cells that have >5% mitochondrial counts
Filter genes that are detected in less than 3 cells
Since this depends on many factors such as single cell technologies (plate based vs. droplets) and tissue types (solid tumor vs. PBMC), proceed with caution or refer to FAQ #4.
FAQ #4 : How to define the best QC thresholds?
Plot the distribution of potential filtering metrics, if possible, and spot the “elbow”.
Plotting is a good practice to understand the distribution of the data and check for outliers. Any plots that infer distribution such as histograms, density plot, violin plots and box plots are a great choice. You can look at the plot to identify the “elbow” point for outlier values, where the distribution drastically reduces, and choose this as a QC threshold.
Plotting number of counts per cell and finding the “elbow” point, using CDIAM Multi-Omics Studio
However, the above approach can be subjective and multiple tools have been developed to programmatically remove the doublets. They generate artificial doublets and calculate a doublet score by comparing the gene expression profiles of barcodes in the data with artificial doublets, e.g. DoubletFinder, Scrublet, Solo. You can also check out this benchmark by Xi & Li, 2021 as a great reference. Several software tools have also been developed to remove ambient RNA signal in single cell RNA-seq data, including SoupX, DecontX, and CellBender.
It might also be worth considering automatic thresholding via MAD (median absolute deviation) as described by Germain et al., 2020 (sc-best-practices.org) and 6. Quality Control — Single-cell best practices.
Key tips: Be flexible.
1. Apply different QC thresholds for different samples.
The technical variations between different samples can invalidate any set of standard QC thresholds. Plasschaert et al (2018) suggested that QC thresholds should be determined separately if the distribution of QC covariates differs between samples.
2. Consider QC metrics with their underlying biological story.
If cells exhibit comparatively high fractions of mitochondrial counts, they may be dying. Yet they might also be involved in respiratory processes.
Cells that are larger in size may contain higher counts. So do cells with intermediate or continuous phenotypes. Meanwhile, cells with low counts might be smaller in size, or they can be a quiescent population.
Watch out for the underlying biological story so that you don’t remove viable cells after quality control.
3. Relax the QC thresholds first, look at downstream analysis and revisit QC later. Trust the process.
The general recommendation is to start with a relaxed set of QC parameters first and see how it goes.
We can’t emphasize enough the role of downstream analysis in assessing the QC thresholds and outcomes. It is recommended to look at the downstream analysis performance (e.g. clustering, visualizing gene expression, marker gene, DE results,...) and revisit QC thresholds, back and forth. This approach is relevant for datasets containing heterogeneous cell populations where cell types or states may be misinterpreted as low‐quality outlier cells (Luecken & Theis, 2019).
As a doublet-detecting tool, Scrublet also suggested users visualize the doublet predictions in a 2-D embedding (e.g., UMAP or t-SNE) to assess the doublet score thresholds. “Predicted doublets should mostly co-localize (possibly in multiple clusters). If they do not, you may need to adjust the doublet score threshold, or change the pre-processing parameters to better resolve the cell states present in your data.”
ScRNA-seq analysis made simple with CDIAM Multi-Omics Studio
If you are looking for a simple tool to analyze scRNA-seq data, check out the CDIAM Multi-Omics Studio, a web platform that supports easy analysis and integration of scRNA-seq and other omics data. CDIAM accommodates interactive visualization, marker finding, DEG analysis, cell type prediction, pseudobulk profile generation, pathway enrichment, target and biomarker prioritization workflows, multi-omics summary, etc. with the support of latest machine learning algorithms.
References
Current best practices in single‐cell RNA‐seq analysis: a tutorial | Molecular Systems Biology (embopress.org)
Single-cell RNA-seq: Quality Control Analysis | Introduction to Single-cell RNA-seq - ARCHIVED (hbctraining.github.io)
Classification of low quality cells from single-cell RNA-seq data | Genome Biology | Full Text (biomedcentral.com)
Comentarios