6 practices to begin with multi-omics data integration

Oct 10, 2023
6 min read

Updated: Oct 22, 2023

“Integrated multi-omics is more than the sum of its parts.”

Multi-omics data integration has recently become an exciting pillar of bioinformatics. More and more scientists are embarking on this approach, and it’s easy to understand why.

Like bringing together the photos of an object from different angles, integrated multi-omics approaches involve the collection and analysis of different omics data (such as genomics, epigenomics, transcriptomics, proteomics, and metabolomics) for studying a biological process, disease, or condition. It delivers a bigger and clearer picture of the biological matter, which single-omics analysis could not complete.

However, integrating multi-omics data is challenging by its nature. Here are just a few problems among many that one needs to bypass when it comes to multi-omics data integration: the batch effect problem, inconsistency of data formats, metrics, metadata, etc.

Although there are no established guidelines for multi-omics data integration, the good news is there are some sound practices that we can employ to avoid common mistakes.

So, here are six of them.

1. Start with asking the right questions

When it comes to any analysis, the starting point is asking the right questions. For what purpose are you integrating multi-omics data? Are you trying to characterize cell/disease subtypes or discovering novel targets, working on drug repositioning or validating biomarkers (if so, diagnostic or prognostic)?

Different biological questions can steer the project towards very different directions. They affect our choices of omics technologies, what datasets to curate, and what analysis methods will be employed.

Asking a solid question really helps define the whole multi-omics integration project. “I want to find the biomarkers of colorectal cancer.” and “I want to find the prognostic biomarkers of colorectal cancer in response to PD-1/PD-L1 blockade therapy.” already signposts different data collection strategies and subjects for comparison (colorectal cancer vs. healthy for the 1st question, and non-response vs. response for the 2nd question).

2. Be selective: consider what omics to use, and what not

Make sure the technology you choose is appropriate for your biological questions. What are the pros and the cons of the data generated from such technology and others?

Different “omes” and the biological layers they present

Take the question of significant enriched pathways as another example. A classic input for pathway enrichment analysis is transcriptomics data, as transcripts are amplifiable and easier to quantify [1].

You can consider integrating transcriptomics with other omics for a more comprehensive view of enriched pathways. However, make sure that you are aware of the pros and cons. Proteomics datasets generated by mass spectrometry may carry biases towards detecting highly expressed proteins, causing variations between different experiments [2,3]. As for metabolomics, high-throughput compound annotation represents a major bottleneck, which makes metabolomic profiles sparser and more ambiguous than transcriptomics [4]. SNPs from GWAS can also be used as an input for pathway enrichment analysis, yet they cannot always be mapped to genes because they can be in both coding and noncoding regions. SNPs are also not evenly distributed across the genome [5].

Each technology comes with different advantages and shortcomings that affect how statistical methods should be selected for pathway enrichment analysis. This leads to the 3rd point: make sure that you employ the right methods to treat such data types.

3. Analysis methods? It’s NOT one-size-fits-all.

Many of us are naturally heading towards our favorite tools that we often use when we perform single omics analysis. However, we should keep our minds open and never blindly use the same tools to get the result.

For example, single-cell RNA-seq data is much more complex compared with its father bulk RNA-seq. The number of samples (cells) can be 1000 times bigger (a single-cell RNA-seq dataset may contain hundreds of thousands of cells), meaning more information and more noise. On the other hand, using a few dozens of samples and average expression, bulk RNA-seq data is simpler but considered to be less informative by many because cell type specific signaling can often be lost. Both require appropriate methods to QC, visualize and analyze while alleviating their own limitations.

As for the pathway enrichment analysis, some methods are better than the others for a particular omics due to the nature of the data. You can refer to this summary by Zhao and Rhee (2023) for a list of pathway enrichment analysis methods and their suitable omics data.

Representative tools for performing pathway enrichment analysis using different types of omics datasets (Zhao and Rhee, 2023)

4. Value the data quality over quantity

The more data, the more information we obtain. However, you should prioritize the data quality over the quantity. Make sure that the data employed in the analysis come from a carefully QC-ed study or experiment.

A good practice is to look at the Methods section and see how the authors collected and preprocessed the data, what tools they used, and whether the study has been peer-reviewed carefully. Did they follow the best practices for data collection, data processing and annotating, and comply with common standards and formats for data representation and sharing? Does the experimental design contain biases (skewed towards a gender, for example)? As in single-cell RNA-seq, the high percentage of mitochondrial gene expression is one of the indicators of poor sample quality. It’s good to check that ratio when curating a particular single-cell dataset.

5. Make sure you compare apples with apples

Pay attention to the experimental design of each dataset and make sure they are compatible for integration. We don’t want to compare apples with oranges.

Are they studying the same population of interest? Some studies profile the disease tissue in comparison with the “adjacent normal tissue”, while others profile the “healthy control” and “peripheral blood”. This may lead to discrepancies in the result when you try to integrate these datasets.

Therefore, the research backgrounds and experimental designs of each omics dataset are very important. You may want to carefully read through the contextual data and metadata (like gender, age, treatment, time, location, and other) to ensure the compatibility of input data for multi-omics integration. [6]

6. Standardize and harmonize comprehensively

Different studies and technologies can render data in different formats, units, ontologies, etc. You may stumble upon a thousand data types: raw FASTQ files, raw count matrices, processed matrices, analysis result tables, etc. Authors can call the same condition or cell type by different names. Authors can filter genes/proteins/metabolites using different approaches and normalize their data by different normalization methods. Therefore, we should standardize and harmonize the data carefully so that they are comparable.

While harmonization involves mapping data to the same ontologies, standardization is ensuring the data are consistent and compatible in the way they are collected, processed, and stored. This can involve different steps, such as filtering data with consistent criteria, normalizing data, and converting data to a scale or unit of measurement that is comparable [7]. Transforming expression to a ranking system is also a common practice to help alleviate the batch effect problems.

A capture of the Multi-omics Target Prioritization pipeline in our CDIAM Platform, combining different omics data to identify potential targets for priorization. In each experiment, targets are scored, and their scores are normalized to a 0-1 scale to integrate with the results from other experiments.

Summary

Above are 6 tips to avoid common mistakes in multi-omics data integration. We hope these tips are helpful for you. Still, this is a super challenging analysis approach, and we recommend everyone to keep abreast of new developments and methods.

About Pythia

Pythia Biosciences is a multi-omics software and service company. Pythia believes "every scientist is important because every patient is important" and with this vision we help scientists and their companies bring medicines to patients that need them. We are committed to offering pharmaceutical scientists intuitive cutting-edge software and services that empower them to bypass multi-omics data analysis challenges, uncover important insights for therapeutic and biomarker development, and achieve these goals faster.

Learn more about our multi-omics analysis platform CDIAM here.

Request a CDIAM trial