5 Common Pitfalls When Reusing Datasets from Public Single-cell Databases (and Tips to Avoid)

Jenny Pham
Nov 9
6 min read

Updated: Nov 10

If you’ve ever tried reusing single-cell datasets from public databases, you’ve probably realized it’s not as plug-and-play as it sounds. What starts as “just download and analyze” can quickly turn into hours of figuring out inconsistent cell labels, missing metadata, or mysterious preprocessing steps.

In this post, we’ll share 5 common pitfalls we’ve run into (and seen others face) when working with public single-cell data - and a few tips to make the process a lot smoother.

But first, what are single-cell databases?

A single-cell database is a repository that stores and organizes data from single-cell sequencing experiments. The data often comes in the form of a gene expression matrix, a FASTQ file, or a Seurat or Scanpy object, and may include metadata on cell types, tissues, species, treatments, etc.

What are some popular single-cell databases?

Some common databases include GEO, cellxgene, Human Cell Atlas, Broad Institute Single Cell Portal, and EBI Single Cell Expression Atlas.

1. GEO (Gene Expression Omnibus): A long-standing public repository from NCBI where thousands of transcriptomics datasets, including single-cell studies, are freely available. It’s often the first stop for anyone hunting for raw or processed data.

2. Broad Institute Single Cell Portal & Cellxgene: Interactive databases and data viewers that allow users to explore datasets in your browser (like checking UMAPs and plot genes). Ideal for quick hypothesis validation.

3. Human Cell Atlas (HCA): A global project aiming to map every cell type in the human body.

4. EBI Single Cell Expression Atlas: A curated collection of single-cell datasets hosted by EMBL-EBI.

Common pitfalls in reusing public single cell datasets

Skipping quality checks

This is one of the most common mistakes beginners make. Most public single-cell datasets are shared as raw data, meaning no quality filtering has been applied yet. But sometimes, authors upload filtered data - where low-quality cells or genes have already been removed.

If you skip the quality control (QC) step assuming it’s raw data, or if you apply QC again on already filtered data, you risk distorting your dataset.

Solution: This is a basic pitfall but luckily is also an easy one to solve. With this problem, it is advised to first come back to the paper, check the final number of filtered cells by the authors to see if it matches the number of cells published in the database. If it is not QC-ed yet, you may need to independently apply your quality filter.

Double-normalizing the data

In single-cell analysis, normalization is the process of adjusting the raw counts so you can fairly compare gene expression between different cells (since some cells naturally have more total RNA than others). If you are new and unfamiliar with the normalization process, you can find a detailed explanation on one of our related blog posts here.

As we mentioned above, most public datasets are uploaded as raw counts. But occasionally, you’ll find normalized or log-transformed data instead.

If you miss this detail and normalize it again, you’ll end up “double-normalizing” - which can drastically affect downstream analyses such as clustering or differential expression. This can lead to data distortion and loss of original meaning or scale.

Solution: The core solution is to inspect the data format immediately after downloading a public dataset. If it includes Integers (e.g., 0, 12, 21), this is likely raw count data and needs normalization. Otherwise, like decimals (e.g., 0.0, 1.45, 5.89), this is likely already normalized data.

Always check if the data has been normalized or not.

Ignoring duplicated gene names

This one’s easy to overlook but can cause a lot of confusion later on. Each gene is supposed to have a unique identifier, known as the Ensembl ID. These IDs usually start with “ENSG” for human genes (for example, ENSG00000123456) and serve as permanent, standardized references that can be used consistently across studies and datasets - basically, the “social security numbers” for genes.

Unlike Ensembl IDs, gene symbols (or gene names) aren’t unique - several different Ensembl genes can share the same symbol (e.g. two Ensembl IDs ENSG00000117461 and ENSG00000278139 are mapped to the same gene symbol PIK3R3). When authors analyze their data in R or Seurat and encounter duplicated gene symbols, the software automatically adds suffixes to avoid conflicts, resulting in names like PIK3R3 and PIK3R3.1.

The problem arises when authors publish count matrices using gene symbols only instead of Ensembl IDs. As a secondary data user, there’s no way for us to tell which specific Ensembl IDs PIK3R3 and PIK3R3.1 actually represent.

Solution: Most public repositories (like GEO) that host 10x raw count data require the authors to upload all 3 associated files: (i) matrix.mtx (the counts), (ii) barcodes.tsv (the cell IDs), (iii) features.tsv (or genes.tsv). The features.tsv file contains the original, ordered mapping of Ensembl IDs to Gene Symbols, so you can use it to restore the correct mapping.

However, if the authors didn’t provide the features.tsv file, you then have to download the raw FASTQ files and re-run the entire alignment pipeline yourself to regenerate that original mapping file.

Trusting the file extension truly represents the file content

Sometimes, the file name you see in the public databases doesn’t tell the full story. You might download a file called matrix.mtx expecting a Matrix Market file (the standard output from 10x Genomics), but when you load it with Read10X() or read_mtx(), it throws a format error.

Often, the problem is simple - the file isn’t really an .mtx at all. Open it in a text editor and you might find it’s actually a CSV, TSV, or even a dense table mislabeled as .mtx.

Solution:

(i) Always check the file content manually before loading

(ii) If the file is actually a TSV, use functions like read.table() or read_csv() (with sep='\t') instead of read_mtx(). This allows us to parse the data file to an anndata.

(iii) The other solution could be finding alternative sources. Sometimes the data on the main repository like GEO is poorly formatted but the author's lab website or a supplementary repository (like Figshare) has the correctly formatted files.

Not verifying the reproducibility of the paper's insights

This pitfall is related to cases when a paper’s insights do not meet its own data analysis. For example, a paper claims to find out gene X is a marker for T-cells, but when you download their dataset to do your own downstream analysis, it turns out that T-cells show no expression of gene X (or gene X is expressed somewhere else). This means the paper's key "insight" is not reproducible from the very data they provided. This could be due to incorrect data processing flawed interpretation or even data manipulation.

Solution: This is the reason why we always emphasize the importance of being critical when reusing public single-cell datasets. We recommend checking the data before trusting a paper's conclusions. To check the data, we recommend that after downloading data, the very first step is to re-run the main analysis pipeline (normalization, clustering, UMAP). After that, you can use functions like FeaturePlot (Seurat) or pl.umap (Scanpy) to check if the key marker genes reported in the paper actually match the cell clusters in your re-analysis.

Always check the reproducibility of key insights.

Pythia’s approach to reliable public single-cell data

Above are common pitfalls that single-cell data curators might encounter, along with our suggested solutions. The process of transforming a messy raw dataset into an insightful outcome can, therefore, be very time-consuming – especially if you're new to the work.

At Pythia, we’ve developed a standardized and rigorously quality-controlled single-cell database called Pythiomics. Public single-cell datasets are curated following a detailed SOP designed to address all the common pitfalls mentioned above - from verifying data integrity and restoring gene IDs, to standardizing metadata and ensuring reproducibility.

The Pythiomics curation pipeline

Each dataset in Pythiomics passes through multiple QC layers to ensure it’s clean, consistent, and ready for downstream analysis - so you can focus on exploring biology instead of troubleshooting data.

Explore the Pythiomics single cell database

Need more tailoring?

If you’re looking for specific datasets or tailored support, our expert data curation team is here to help. We handle requests with full documentation and traceability, ensuring you get high-quality, reproducible data that fits your research needs.

Learn about our curation services

Button: Explore our curation services

5 Common Pitfalls When Reusing Datasets from Public Single-cell Databases (and Tips to Avoid)

But first, what are single-cell databases?

What are some popular single-cell databases?

Common pitfalls in reusing public single cell datasets

Skipping quality checks

Double-normalizing the data

Ignoring duplicated gene names

Trusting the file extension truly represents the file content

Not verifying the reproducibility of the paper's insights

Pythia’s approach to reliable public single-cell data

Need more tailoring?

Recent Posts

Subscribe to Our Newsletter