A Guide to Single Cell Multiomics Data Analysis
Once I have sequenced my single-cell libraries, how do I analyze the data?
September 17, 2024
This is a common question on many single-cell researchers' minds.
Getting started with data analysis for single-cell data can be a daunting task, but with the right resources and guidance, you can effectively analyze your data and extract meaningful insights. Here's a step-by-step guide to help you get started with your data analysis for single-cell data.
1. Understanding Your Data
Familiarize yourself with the structure and characteristics of your single-cell dataset. Understand the experimental design, sequencing technology and configuration used, and the type of data generated (e.g. gene expression, chromatin accessibility, protein expression).
Typically, single cell multiomics data will be provided as FASTQ files comprising of R1(or Read 1), R2(or Read 2) and index read files. Explore FASTQ file names and associated metadata to understand sample annotations, experimental conditions, and quality control metrics.
Sometimes index read files may be appended in the headers of R1/R2 and may not be a separate file. R1 files usually contain information on the cell label (barcode) and UMI and R2 files contain the actual transcript information.
2. Preprocessing and Quality Control
Perform preprocessing steps to clean and filter your single-cell data. This may involve trimming adapter sequences, removing low-quality reads, or short reads. FASTQC and MultiQC are commonly used tools to QC your FASTQ files. These programs generate QC metrics for FASTQ files without making any changes to them.
After analyzing the QC metrics, programs like Trimmomatic/Cutadapt/fastp can be used to filter poor quality reads and trim poor-quality bases from your samples. Modern aligners may have built-in trimming so be sure to check with the downstream read processing steps before trimming any reads.
3. Read Alignment and Quantification
Once sequence quality reaches an acceptable level, reads can be mapping to the genome (for genomic or epigenomic datasets) or to the annotated transcriptome(for transcriptomic datasets). If a proteomic readout is included, it usually maps to a pre-known reference of oligos that were bound to the antibody of interest. Once reads align to the target, a sorted SAM/BAM file is typically generated (some aligners may not generate the SAM/BAM files by default, this option may need to be turned on). This file contains alignment details for all attempted reads, including genomic coordinates (chromosome, start, end, strand) where sequences matched (if at all). It also records insertions, deletions, and mismatches between input and target sequences. These details, along with genomic coordinates from gene/transcript models, are used to quantify the number of reads sequenced from a gene/transcript.
There are many tools available for read alignment and quantification and one of the most common ones is STAR. The BD Rhapsody™ Sequence Analysis Pipeline is built around STAR and is optimized for BD Rhapsody™ Single Cell Multiomics System data.
4. Normalization and Batch Correction
Normalize your single-cell data to account for differences in sequencing depth and other technical artifacts. Common normalization methods include total count normalization, library size scaling, and log transformation.
Address batch effects and other sources of technical variation by applying batch correction algorithms or incorporating batch covariates into downstream analyses. Consider using tools such as Liger, Harmony, or Seurat's integration methods for batch correction. Similar tools can be used for filtering high levels of ambient RNA, double detection, mitochondrial or ribosomal genes etc.
The BD Rhapsody™ sequence analysis pipeline has RSEC and DBEC UMI adjustment algorithms built in to remove the effect of UMI errors on counts. Many popular single cell tools have functions for additional normalization, such as NormalizeData function in Seurat, normalize total and log1p functions in Scanpy.
At this point, it's a good idea to conduct additional quality control checks to assess the overall quality and integrity of your dataset. Evaluate metrics such as number of UMIs per cell, number of features (genes) per cell; percentage of reads mapped to the genome/transcriptome, and mitochondrial gene content to understand the quality of your data.
5. Dimensionality Reduction and Visualization
Reduce the dimensionality of your single-cell data to facilitate visualization and exploratory analysis. Techniques such as principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) can be used to project high-dimensional data into lower-dimensional space.
Visualize your single-cell data using scatter plots, heatmaps, and trajectory plots to identify cell clusters, infer cell types, and explore transcriptional dynamics.
6. Clustering and Cell Type Identification
Perform clustering analysis to identify distinct cell populations within your single-cell dataset. Use clustering algorithms such as k-means, hierarchical clustering, or graph-based clustering to partition cells into groups based on their transcriptional profiles. Assign cell types to clusters based on known marker genes, differential gene expression analysis, or reference datasets. Tools such as PanglaoDB, Tabula Muris and CellMarker can be used for manual annotation. Azimuth and scType can be used for automated reference-based or marker-based annotation respectively.
7. Other Analyses
Identify genes that are differentially expressed between cell populations or experimental conditions using statistical tests such as the Wilcoxon rank-sum test, likelihood ratio test, or negative binomial regression. Explore the biological significance of differentially expressed genes by conducting functional enrichment analysis using gene ontology (GO) annotations, pathway databases, or transcription factor binding site analysis.
Trajectory analysis allows us to reconstruct the dynamic cellular processes, including differentiation, maturation, responses to stimuli, and cell cycle progression, which reflect the true biological nature of cells. Tools like Monocle3 and Slingshot can model developmental trajectories in single-cell RNA sequencing data whereas tools like Velocyto and ScVelo are used for RNA velocity. While traditional trajectory inference methods reconstruct cellular dynamics based on a population of cells with varying maturity, RNA velocity relies on a dynamical model that uses splicing dynamics.
8. Integration with other Omics Data
Integrate your single-cell data with other omics datasets (e.g., proteomics, chromatin accessibility, methylation assays) to gain a comprehensive understanding of cellular biology and regulatory mechanisms. Seurat and Scanpy have built in tools that can help with integrating muti-modal datasets. After integrating a new dataset, re-run the initial normalization and dimensionality reduction steps and re-do any necessary analysis with your multi-modal dataset.
9. Documentation and Reproducibility
Document your data analysis workflow, including software packages used, parameter settings, and analysis scripts, to ensure transparency and reproducibility. Organize your data, code, and results in a structured and well-documented format for easy sharing with collaborators and dissemination to the scientific community. Depositing single-cell multiomics data in publicly accessible repositories is crucial for promoting transparency, reproducibility, and data sharing in the scientific community. Some prominent repositories where researchers deposit their single-cell multiomics data are Gene Expression Omnibus (GEO), Sequence Read Archive (SRA), European Nucleotide Archive (ENA).
Check out the BD Rhapsody™ Sequence Analysis Pipeline.