Noisy Data, Noisy AI: Why So Many AI Drug Discovery Models Fail

May 10
7 min read

Lately we’ve been thinking a lot about the current AI boom in biopharma, and honestly, it feels both exciting and unsettling.

On one hand, we are seeing an explosion of AI drug discovery tools that promise to transform the industry. Models can now predict drug targets, simulate biological systems, design molecules, screen compounds, and surface insights faster than ever before.

But there is one thing that keeps nagging at us: AI is only as good as the data we feed it. And right now, that is still a major problem.

The Many Jobs AI Can Take On

Everywhere we look, AI is being positioned as the answer: faster pipelines, better targets, lower costs, smarter clinical trials. To be fair, the potential is real.

A recent review, The future of pharmaceuticals: Artificial intelligence in drug discovery and development, summarizes just how broad AI drug discovery's role has become, from molecular characterization and target discovery to clinical trial optimization and drug repurposing.

Some of the most promising applications include:

Biopharma area	AI applications
1. Drug–target interaction prediction	Predicting interactions between compounds and protein targets to estimate efficacy, safety, and target relevance.
2. Target discovery and validation	Mining multi-omics data, clinical data, literature, patents, and clinical reports to identify disease-relevant pathways, proteins, biomarkers, and therapeutic targets.
3. Systems biology and causal network modeling	Integrating genomics, proteomics, lipidomics, metabolomics, and clinical data to construct causal inference networks and identify disease drivers.
4. Biomedical knowledge graphs	Building knowledge graphs connecting genes, drugs, diseases, clinical trials, and mechanisms to support target discovery, mechanistic analysis, and drug repurposing.
5. Virtual screening	Screening large compound libraries computationally to prioritize candidates before experimental validation. Includes both structure-based and ligand-based VS.
6. Lead optimization	Optimizing candidate molecules for potency, selectivity, binding affinity, ADMET properties, and synthetic feasibility.
7. ADMET prediction	Predicting absorption, distribution, metabolism, excretion, and toxicity to reduce late-stage failure and prioritize safer compounds.
8. Toxicity and side-effect prediction	Predicting clinical toxicity and adverse reactions using chemical features, target features, pathway analysis, and transcriptomic perturbation data.
9. Clinical trial outcome prediction	Predicting phase I/II trial success, toxicity risk, efficacy likelihood, and possible trial failure using drug response, side-effect, and pathway data.
10. Patient stratification	Identifying subgroups of patients likely to respond to treatment or experience adverse events using omics, clinical, and pathway data.
11. Patient–trial matching and recruitment	Using NLP and EHR/EMR mining to match eligible patients to clinical trials and improve recruitment efficiency.
12. Drug repurposing / repositioning	Identifying new indications for approved or investigational drugs by integrating systems biology, NLP, EHRs, knowledge graphs, and multi-omics data.

Multi-Omics Data: A Goldmine… in Theory

If AI is the engine, then multi-omics data can be the fuel.

Genomics, transcriptomics, proteomics, metabolomics, spatial data, clinical metadata, and imaging data can be layered together to create a much richer view of biology. In theory, these datasets give AI drug discovery models what they need to uncover disease mechanisms, identify therapeutic targets, prioritize biomarkers, and explain patient-level variation.

And we are not lacking data.

If anything, we are drowning in it.

Public repositories are filled with single-cell datasets, bulk sequencing studies, proteomics profiles, clinical annotations, and disease-specific resources. Every new study adds another layer of biological information. Every consortium contributes another atlas. Every lab generates more data.

Number of multi-omics publications from 2002-2023. With its ﬁrst indexing in the National Libraryof Medicine in 2002 (PubMed search “multi-omics”; 31 December 2023), recent years have seen a rapidly increasing interest in the application of multi-omics alongside individual omics layers, more than doubling scientiﬁc publication numbers. — Number of multi-omics publications from 2002-2023. With its ﬁrst indexing in the National Libraryof Medicine in 2002 (PubMed search “multi-omics”; 31 December 2023), in recent years, together with AI drug discovery development, we have seen a rapidly increasing interest in the application of multi-omics alongside individual omics layers, more than doubling scientiﬁc publication numbers.

But more data does not automatically mean better AI. Because not all data is created equal.

The Data Readiness Gap

There is a big difference between available data and AI-ready data.

A dataset can be public, large, and scientifically valuable, but still be very difficult to use in a reliable AI drug discovery workflow.

Take single-cell omics data as an example. In practice, we keep seeing the same problems again and again:

Inconsistencies in quality

Not all datasets are reliable for the AI drug discovery workflow. Data often come from different studies and experiments, involving variations in sampling, sequencing protocols, depth, and batch effects. This is not to mention artifacts such as doublets, ambient RNA contamination and other sources of noise. Without proper filtering, the data quality may be insufficient, causing the model to learn spurious patterns (or “shortcut learning”, where the model memorizes noisy examples rather than learning the true underlying structure).

Inconsistencies in terminology

One of the most important issues is the inconsistency of the cell type annotation. This arises partly from the lack of a common naming convention for single-cell annotations and partly from the absence of a standardized reference for hierarchical relationships among cell types.[1] We think this is also a major issue to address because single-cell data analysis is now more user-friendly, but much time is still spent tuning clustering parameters and assigning cell types. A research team trained a deep learning model for cell type prediction across 265 human cell types also found that inconsistent labeling in existing databases (generated by different labs) directly caused model errors.[2]

They also identified two major issue types leading to poor performance. One is “synonym error”, when cells are annotated with different labeling systems across datasets. For example, in some different datasets, the blood vessel endothelial cell and vein endothelial cell were labeled as different cell types, even though they are the same cell. This caused the AUC (a parameter to tell you how good your training model is at telling two things apart) of the blood vessel endothelial cell to be estimated much lower than it actually is. Second is “hierarchical error”, when some broad cell types may contain multiple subtypes. For example, cells labeled as “T cell” may be subcategorized into alpha-beta T cell, and CD4-positive, etc. It is also stated to have resulted in both wrong predictions and underestimation of the performance.

Non-standardized metadata is also a feature that seems to be critical to address in a dataset. Metadata enhances data discovery, integration, enabling reproducibility, reusability and secondary analysis. In some studies, one issue they highlight is the lack of standardized metadata reporting.[3] That gap makes it harder to integrate and compare data from different cohorts. They give a concrete example from their own sepsis study: when they looked at raw datasets, even basic details like tissue type weren’t reported consistently. Some studies labeled it as “source”, others as “tissue”, and the formats varied—showing how messy things can get by the time you reach secondary analysis.

Example of inconsistent metadata and cell type annotations across studies. Both datasets profile cells from the mouse brain and contain overlapping cell populations (e.g. oligodendrocyte progenitor cells, endothelial cells, astrocytes). However, the terminology differs substantially between studies (e.g. NG2/OPC vs. OPC, Oligodend1 vs. OLG, Astrocyte vs. AST1/AST2), making cross-dataset integration and comparison in AI drug discovery more challenging.

Lack of documentation

Another issue to acknowledge is how important traceable documentation is in facilitating AI drug discovery. Comprehensive metadata documentation, paired with raw omics data, play a pivotal role in promoting reproducibility. Yet, lacking these documentation (how the data was processed, what parameters were used to filter the data, etc) would not be robust for the training models as well.

What We Need Before Better AI: AI-ready Data

We believe the next stage of AI drug discovery will depend less on building bigger models and more on building better data foundations. That means investing in:

Quality control. It is essential to first and foremost ensure low-quality cells and technical artifacts are removed. Good AI drug discovery starts from careful data preprocessing and curation.
Metadata standardization and ontology mapping. Using consistent fields for tissue, disease, assay, treatment, donor, sample, and clinical context. Harmonizing cell types, diseases, tissues, assays, and treatments to standard vocabularies so datasets can be compared across studies.
Documentation. Comprehensive documentation captures the "how" and "why" behind the data transformation and ensures that every dataset is reproducible and extensible. In simple terms, this should be so thorough that a different team, three years from now, could reconstruct the exact dataset from the raw files and understand the clinical context of every cell without needing to contact the original authors.

That’s also the standard that our team is anchored to when building the Pythiomics database—a multi-omics database that currently contains 143 million cell profiles and is growing. In particular, to ensure the attributes mentioned above, Pythiomics is built on:

A Standard Operating Process (SOP). Each dataset follows a defined framework outlining QC criteria, processing steps, naming conventions, and how to handle edge cases.
Harmonized metadata. Diseases, tissues, treatments, cell types, genders, sampling ages, and more are all mapped to standardized vocabularies (e.g. using EBI’s ontologies like MONDO, UBERON, NCIT, CL, CLO,EFO,... in combination with our in-house ontologies) for consistency across studies.
Rigorous QC. Automated checks handle scale, while manual review ensures accuracy and catches what algorithms might miss.
Documentation and versioning. Every step is recorded — from input data types and QC thresholds to the rationale and code used — with full version histories for transparency and traceability.
Accessible through both APIs and GUI. Final processed datasets, UMAPs, and harmonized metadata can be explored interactively through our C-DIAM web platform or accessed programmatically via Pythiomics APIs from bioinformatics-native environments. This allows both computational and non-computational teams to work from the same standardized data foundation, improving accessibility, reproducibility, and cross-team collaboration.

Beyond supporting the AI drug discovery, we want to help patients by helping scientists. Have a look in details of Pythiomics here.

Learn more about Pythiomics

References

[1] Bian, H., Chen, Y., Wei, L., & Zhang, X. (2025). uHAF: a unified hierarchical annotation framework for cell type standardization and harmonization. Bioinformatics (Oxford, England), 41(4), btaf149. https://doi.org/10.1093/bioinformatics/btaf149.

[2] Dong, S., Deng, K., & Huang, X. (2024). Single-cell type annotation with deep learning in 265 cell types for humans. Bioinformatics advances, 4(1), vbae054. https://doi.org/10.1093/bioadv/vbae054.

[3] Huang, Y.-N., Munteanu, V., Love, M. I., Ronkowski, C. F., Deshpande, D., Wong-Beringer, A., et al. (2025). Perceptual and technical barriers in sharing and formatting metadata accompanying omics studies. Cell Genomics, 5(5), 100845. https://doi.org/10.1016/j.xgen.2025.100845.