top of page

12 Public Single-Cell Databases, Compared: Which One Should You Use?

  • Jun 11
  • 7 min read

If you've ever spent an afternoon jumping between single-cell databases like GEO, SRA, CellxGene, and half a dozen other websites trying to find the right single-cell dataset, you're not alone.


The single-cell ecosystem has grown rapidly over the past few years, bringing with it an impressive collection of databases and data portals. But while having more data is great, finding relevant and usable data is often more challenging than it sounds!


A researcher looking for a particular disease single-cell dataset may encounter dozens of databases, each offering different levels of data processing, annotation quality, and accessibility. Some primarily store raw sequencing files and metadata, while others provide harmonized cell-type annotations, interactive visualization tools, or curated collections focused on specific tissues, diseases, or therapeutic areas.


To make things easier, we've put together a practical comparison of 12 public single-cell databases, covering what they offer, where they excel, and the limitations you should be aware of before diving in.


12 Public Single-Cell Databases, Compared: Which One Should You Use?

What Kinds of Single-Cell Databases Are Out There?

There are many ways to categorize single-cell databases, but two approaches are particularly useful: what biological data they focus on and what they allow researchers to do with that data.


By Biological Focus

According to Gondal et al. (2024), single-cell databases can be broadly categorized as:

  • General databases – broad repositories covering many tissues, diseases, and biological systems

  • Tissue-specific databases

  • Disease-specific databases

  • Cancer-focused databases

  • Cell type-focused databases


By Data Type and Functionality

From a practical standpoint, most public single-cell resources can be grouped into three broad categories:


  • Archives

    Archives are primary repositories where researchers deposit data generated in their studies. These resources typically contain raw sequencing files, count matrices, processed matrices, and associated metadata.


  • Standardized Databases

    These resources aggregate datasets from multiple studies and apply standardized processing pipelines, harmonized metadata, and consistent cell-type annotations. Such standardization makes it easier to compare datasets across studies and perform large-scale analyses.


  • Interactive Discovery Portals

    These databases provide web-based interfaces that allow users to search, visualize, and explore public single-cell datasets without downloading and processing the data themselves.


It's worth noting that these categories are not mutually exclusive. For example, some resources combine standardized datasets with interactive exploration tools, allowing researchers to both access harmonized data and explore it through a user-friendly interface.


Popular Single-Cell Databases: What They Do Well (and Where They Fall Short)

Let's start with archive databases. While these single-cell databases provide the broadest coverage of public single-cell datasets, metadata, file formats, annotations, and processing methods are often inconsistent across studies, making cross-dataset comparison and integration challenging.


Database Name

Category

Description

Pros

Cons

GEO

Archive / General

The most widely used repository for gene expression data in general. Single-cell studies typically provide raw count matrices (MTX, CSV, TSV, TXT, H5), metadata, and occasionally processed objects such as H5AD or RDS files. 

• Massive dataset collection

• Often the first place new studies appear

• Easy to find associated publications

• Metadata, file formats, and data structures vary widely. Some datasets are raw while others are normalized or partially processed

• No interactive visualization or analysis tools

SRA

Archive / General

Repository for raw sequencing data, primarily FASTQ files generated from sequencing experiments.

Access to original sequencing reads; ideal for custom processing pipelines.

Requires significant computational resources and bioinformatics expertise to explore. 

Zenodo

Archive / General



A general-purpose open-access repository developed by CERN and supported by the European Commission. Researchers often use Zenodo to share single-cell datasets, processed objects (e.g., H5AD, Seurat RDS), analysis code, supplementary files, and data not deposited in specialized repositories.

• Often contains processed datasets and analysis-ready files

• Supports code, workflows, and supplementary materials alongside data

• Metadata and file organization vary widely between studies

• Limited search and filtering capabilities for biological datasets

EBI’s Single Cell Expression Atlas

Archive / Interactive Discovery Portal/ General 

A repository maintained by EMBL-EBI and commonly used resources for depositing public single-cell data alongside GEO. It includes a searchable browser that allows researchers to discover datasets by species, project, assay,… as well as basic interactive tools for exploring gene expression patterns. 

• Easy to navigate and find relevant datasets with search tools and filters (although quite limited). 

• Provides basic interactive visualization

• Not fully harmonized across studies (primarily standardized to assay ontologies and species)

• Limited data coverage (~10 million cells as of June 2026)

• Limited visualization and downstream analysis capabilities


Databases with focused topics are great for finding relevant data to particular therapeutic areas. Yet they come with much less data coverage. Some examples:


Database Name

Category

Description

Pros

Cons

Allen Brain Cell Types Atlas

Tissue Specific

Comprehensive atlas of brain cell types generated using transcriptomic and multimodal profiling.

Excellent resource for neuroscience research. 

Limited data coverage

TISCH2

Cancer-Focused Database

Specialized resource for tumor microenvironment single-cell datasets with curated annotations.

Excellent resource for cancer and immuno-oncology research.

Limited data coverage


Standardized databases aim to solve one of the biggest challenges in public single-cell research: data heterogeneity. These single-cell databases harmonize metadata, annotations, and data formats across studies, making it much easier to discover relevant datasets, compare cohorts, and perform cross-study analyses. Most also provide interactive visualization tools that allow researchers to explore data without extensive bioinformatics expertise.


Here are some available standardized single-cell databases to consider:

Database Name

Category

Description

Pros

Cons

Human Cell Atlas (HCA)

Standardized Database / General

A repository of datasets generated by the Human Cell Atlas initiative, an international effort to create comprehensive reference maps of all human cell types. 

  • Home to some of the largest human cell atlases currently available.


  • Valuable source of healthy tissue reference datasets.


  • Supports both single-cell and spatial omics data.


  • Standardized metadata schemas and cell-type annotation frameworks improve dataset consistency.

Not all terminologies are standardized - missing harmonized metadata related to treatments, animal strains, comorbidities, etc. 


Coverage is strongest for Human Cell Atlas projects and participating consortia rather than all published single-cell studies.

HuBMAP

Standardized Database  / Interactive Discovery Portal / General

A consortium-led repository and data portal that aims to map the human body at cellular resolution. HuBMAP hosts a wide range of single-cell, spatial transcriptomics, imaging, and other multimodal datasets across multiple human tissues and organs.

  • Large collection of healthy human tissue datasets.


  • Strong emphasis on spatial biology and multimodal data integration.


Not all terminologies are standardized. 

Missing important harmonized metadata related to treatments, cell lines, sample types/ morphology,...


Primarily designed as a multimodal human atlas rather than a dedicated single-cell repository. Single-cell dataset coverage is smaller.


DISCO

Standardized Database/ Interactive Discovery Portal/ General 


A curated, harmonized data repository and interactive platform for visualization. 

There's a function to integrate samples across studies.


Integrated atlases are available for instant access.

Focused primarily on human datasets.


Limited downstream analysis capabilities compared with dedicated analysis platforms.

CELLxGENE 

Standardized Database / Interactive Discovery Portal / General

Collection of harmonized single-cell datasets with search and visualization capabilities.

Standardized terminologies used across datasets regarding key variables like cell types, assays, diseases, tissues


Standardized data formats (downloadable in H5AD)


Easy to navigate and find relevant datasets with search tools and filters. 


User-friendly; supports visualization without coding.

Not all terminologies are standardized. 

Missing important harmonized metadata related to treatments, ages, cell lines, sample types/ morphology (‘normal’ labels can both mean healthy controls and normal tissues in diseased patients). 


Limited visualization and downstream analysis functions. 

Broad Institute Single Cell Portal

Standardized Database / Interactive Discovery Portal / General


One of the most widely used repositories for single-cell data with interactive exploration tools.

Standardized terminologies used across datasets regarding key variables like cell types, assays, diseases, tissues


User-friendly; supports visualization without coding. 


Easy to navigate and find relevant datasets with search tools and filters.

Not all terminologies are standardized. Missing important harmonized metadata related to treatments, cell lines, sequencing platforms, animal strains, etc. 


Limited visualization and downstream analysis functions. 





How Pythiomics Takes a Different Approach

As discussed above, existing public single-cell databases each address part of the challenge but leave important gaps:


  • Data from large archive repositories are often difficult to reuse, especially for AI/ML modeling, because metadata, annotations, and file formats are not standardized. Raw data requires significant effort to process before it is ready to explore.


  • Standardized databases improve consistency, but harmonization is typically limited to a relatively small number of metadata categories, making it difficult to compare cohorts across studies using variables such as treatments, treatment history, demographics, or clinical characteristics.


  • In addition, many databases focus primarily on data discovery and basic visualization, offering limited downstream analysis capabilities.


We developed the Pythiomics database with these challenges in mind. It is designed as a centralized repository that provides truly ready-to-use data for AI/ML modeling, target and biomarker discoveries. Pythiomics combines extensive data coverage, careful manual curation and documentation, thorough metadata harmonization, and rich analytical functionality within a single platform.


Overview of Pythiomics curation pipeline and data statistics
Overview of Pythiomics curation pipeline and data statistics

Pythiomics

Category


Standardized Database / Interactive Discovery Portal / General

Data source

Manually curated from a wide range of repositories like GEO, Cellxgene, Zenodo, HCA, Broad Institute Single-cell Portal, EBI Single-cell Expression Atlas, lab websites,…

Data curation pipeline 


  • Automated and manual QC help identify and remove low-quality cells while ensuring dataset completeness and reproducibility of key findings reported in the original publication.


  • Standardized preprocessing workflows ensure datasets are delivered in consistent formats, making it easier to compare and analyze data across studies.


  • Detailed curation logs and documentation provide full transparency into how each dataset was processed, including the rationale behind key decisions and any modifications made during curation.


Data coverage

  • 161 million cells

  • 433 diseases 

  • 516 tissues 

  • 38,039 donors 

As of June 2026

Data accessibility

  • Interactive exploration through the C-DIAM web platform, with downstream analyses including DEA, enrichment, trajectory analysis, cell-cell communication, marker detection, and more. 


  • Supported programmatic access directly from bioinformatics native environment (R studio, Jupyter notebook) using Pythiomics APIs.


  • Cross-dataset normalization and database-wide analytics are available for large-scale insight validation, such as global differential expression analysis, global gene expression queries. 


Data harmonization

50+ harmonized metadata fields covering patient ids, sample ids, diseases, tissues, cell types, assays, treatments, treatment history, sample types/morphology, genders, sequencing platforms, animal strains, comorbidities, cell lines and many more. 




Wrapping Up

The best database ultimately depends on your research goals. If you need raw sequencing data, archive repositories may be the right choice. If you are focused on a particular disease or tissue, specialized resources can provide valuable domain-specific insights. For researchers looking to rapidly discover, compare, and analyze datasets across studies, standardized databases offer the most efficient workflow.


Pythiomics builds on this foundation by combining large-scale public data coverage, deep metadata harmonization, and integrated analytics in a single platform. By reducing the time spent searching, cleaning, and integrating datasets, researchers can focus on what matters most: generating biological insights and accelerating discovery.

Comments


bottom of page