Discover our new platform: Learn more

Cell Type Prediction via MetaReference

BioTuring Team
BioTuring Team
March 4, 2025

Cell type annotation is a critical step in scRNA-seq data analysis that allows scientists to gain deeper insight into identifying the cell types that drive a biological process. However, the task of accurately assigning cell type labels to individual cells remains a challenge due to the need to navigate the complexities of high-dimensional gene expression data, biological variability, and for robust and scalable computational methods. Manual annotation using a traditional marker gene-based approach is both subjective and time-consuming. As current knowledge of cell types expands and as the need to annotate more rare cell types becomes more apparent, we become more dependent on computational approaches that allow us to automatically annotate our cells based on reference datasets.

As a result, we’ve developed a cell type prediction method that automatically labels cells based on a comprehensive, manually-curated database of scRNA-sequencing data spanning over 100 million cells, 2,500 studies, and 2 species. 

Methods

Cell type prediction via MetaReference utilizes a model that searches for relevant scRNA-seq studies on Talk2Data, our database that houses manually-curated scRNA-seq datasets. Our data curation team manually curates studies by standardizing the ontology terms within the authors’ metadata categories. This includes, but is not limited to, cell type, disease, and tissue. As a result, you can filter for relevant studies to use in the prediction. The model then assigns a cell type label based on how similarly your cell clusters’ expression profiles match that of cell clusters’ expression profiles in the relevant studies. You can further refine the cell type labels by reviewing each cell type’s similarity score, its marker genes, and the weighted Log2FC of a marker gene.

Step 1. Database filtration

You can define which studies to include in the cell type prediction model by filtering for cell type, tissue, condition, and suspension type. The model will restrict its search to those studies, producing more precise cell type annotation results. For example, if your dataset contains a colon tissue from a colon cancer patient, you may filter for relevant studies containing only colon tissue affected by colon cancer.

Step 2. Marker gene list and weighted Log2FC calculation 

The cell type prediction model requires two inputs: gene expression matrix and a metadata category containing cell clusters (typically obtained via Louvain or Leiden clustering analysis). For each cluster in your dataset, the model outputs a list of gene markers which are calculated using a weighted log2 fold change approach. 

Weighted log2 fold change (wLog2FC) is a method used to identify marker genes by quantifying the difference in gene expression between cell clusters. While log2 fold change (Log2FC) alone highlights gene expression differences, it may not accurately reflect marker gene specificity. Therefore, wLog2FC incorporates the difference in the percentage of gene expression within (g1, group1) and outside (g2, group2) a cluster, enhancing the identification of genes highly enriched in a specific cluster. Additionally, to further emphasize genes with minimal expression in other clusters, we include the complement of the percentage of gene expression outside the cluster (100 – outside coverage percentage). Essentially, the calculation of a marker gene considers both its expression level as well as the percentage of cells that express the gene within a cluster. Altogether, these methods provide a more robust measure for marker gene selection.

weight log2FC
Fig 1. Formula for calculating the weighted log2FC of a marker gene. pct = percentage. g1 = group 1. g2 = group2.

Step 3. Similarity score calculation

Next, the model calculates a similarity score between your list of gene markers and the relevant studies’ list of gene markers using the Rank-biased Overlap (RBO) method. This method compares the similarity of the two marker genes lists while taking into account their order. 

rank biased overlap
Fig 2. Formula for calculating the similarly score, using the rank-biased overlap method.

Step 4. Cell type labeling 

Finally, the model uses a majority voting approach to label the cell type for each cluster. 

Advantages

Automatic annotation with manual refinement

The cell type prediction tool labels cells quickly and accurately, while still allowing you the flexibility in refining the results that better reflect the nuances of your dataset. Users simply need to define how many cell clusters they would like to annotate, then filter for relevant studies.

Based on a large relevant reference dataset

The cell type prediction tool utilizes a large database of manually curated scRNA-seq data to standardize ontological terms for easier filtering. Filtering only for relevant studies that reflect a similar tissue type or condition as your data for example, can deliver more precise cell type labels.

Identification of rare cell types

Additionally, manual refinement of the initial prediction allows users to identify more rare cell types. This requires both refining the clustering resolution and inspecting the marker genes of each cluster to identify cell types that studies may not have labelled.

Limitations

Gaps in the single-cell database

While our repository of single-cell RNA-seq data is extensive, there may be some cell types that do not have standardized ontologies. As such, if the database does not contain a certain cell type, then the cell type prediction tool would not be able to predict that cell type. Therefore, manually fine-tuning the prediction results based on marker genes help bridge the gap in the database.

Quality of clustering

The results of cell type prediction rely heavily on the quality of the clustering technique (Louvain, Leiden, etc.). When performing cell type prediction on the clustering results, each cluster will be annotated as a single cell type; however, depending on the resolution and quality of the clustering results, one cluster may not be biologically representative of one cell type. Thus, to address this limitation, we offer multiple methods of clustering with flexible parameters.

0 comments