BioTuring Single-cell Browser Guidebook

Learn how to use BioTuring Single-cell Browser with our step-by-step instructions.

Introduction

About the software

BioTuring Browser, or BBrowser, is a desktop application that performs analyses on sequencing data. The software is also connected to a database hosting sequencing data from the latest publications. Users can use BBrowser to analyze their own data or analyze the public data available.

The software allows scientists, even ones without programming experience, to quickly investigate massive amounts of sequencing data from in-house and published work and compare them together. All data submitted by users and data downloaded from the BBrowser database is stored and secured on the local computer.

The application was first released in October of 2018, running on Windows, macOS, and Ubuntu.

System requirements

Operating systems
● Windows: 7 or higher (64 bit only)
● MacOS: OS X El Capitan (10.11) or higher
● Linux: Ubuntu 16.04 only

Network
● Ethernet connection (LAN) or a wireless adapter (Wi-Fi)

Hard Drive
● Minimum 10 GB
● Recommended 20 GB or more

Memory (RAM)
● Minimum 8 GB
● Recommended 16 GB or above

Notice
● For an optimal experience with large datasets containing more than 100,000 cells or performing analyses from raw data (FASTQ files), please use a computer with the recommended configuration.
● If you experience difficulties in running the software even after fulfilling all these requirements, please contact support@bioturing.com

The application is only available on Ubuntu 16.04 (xenial xerus).

After downloading the .deb package, you can either install the software by the graphical user interface or the command-line interface as following:

Graphical User Interface:

  • Navigate to the location of the downloaded .deb file, double click on the file or right-click and select “Open with Software Installer”.
  • Click on the “Install” button on the Software Installer window.

Command Line Interface:

  • Open terminal with Ctrl + Alt + T, navigate to the location of the downloaded .deb file.
  • Type: sudo dpkg -i <name of the package>.deb. This will install all the dependencies you need to run the software.
  • If the installation results in dependencies errors, you might type
    sudo apt install -f to resolve all dependencies and reinstall the software.

Please note that the software may open a local web socket to execute R commands for certain analyses. Internal tcp permissions for port 9004 need to be available for the software to run properly.

To install BBrowser on MacOS, after downloading the installation package:

  • Navigate to the downloaded .dmg package.
  • Double click the .dmg package to open the installation window.
  • Follow the on-screen instructions to complete the installation process.

There are 2 options for running BBrowser on Windows: portable version or installed version.

Portable version

The portable version does not require any installation and is only available for Windows. Users who do not have the necessary privileges to modify the system registry can download this version to use instantly. There is no difference between the portable version and the installed version of the BBrowser in terms of the interface and functionality.

The portable version is provided in a zipped file. After downloading, you need to unzip the file and double-click on it to run the software.

Alternatively, to launch BBrowser, users can run the executable binary, BBrowser2.exe, located in a folder called BBrowser2-win32-x64.

If you want to move the software storage, you must move the entire BBrowser2-win32-x64 folder. Modifying or removing any files in this folder may cause BBrowser to stop working. In either case, your single-cell data is not affected as it is stored in a different location.

Installation

If you want to install BBrowser to your computer, download the installer .exe file and run it with administrator permission. An installation window will guide you through the installation.

Although the program can be installed anywhere on your computer, we highly recommend you putting it in the usual Program Files folder. However, this action will require the administrator’ s permission.

If your computer has more than one account using the software, each account can only access its own data.

To install BBrowser on Centos 7, first you need to install some dependencies:

yum install libgfortran libXScrnSaver

Then, use the following command to install BioTuring Browser:

rpm -iU BBrowser-xxx.x86_64.rpm

Please replace “xxx” with the version that you downloaded. Installation of the software and its dependencies may require root access. After installing, BBrowser can be found in Applications > Accessories

Program Interface

Login page

BBrowser Login page appears when you open the software for the first time. Once the software successfully records your credentials, it will automatically log in the next time you start BBrowser.

Please enter your credentials and claim your academic or non-academic status to access the different sets of features, then Enter or click on Login.

Log in credentials are encrypted and stored individually for each user if multiple users are using the same computer.

If you are using a network with proxy, please configure Proxy settings at this point, before any connection to BioTuring server is made. The software needs to have correct proxy settings in order to connect to our server and verify your credentials, as well as to get access to our public database.

Home page

BBrowser Home page shows you all data that you can download to the local computer, including public data from BioTuring server and data from your remote repositories.

From BBrowser 2.1.3, the Home page offers another feature for you to look for gene expression levels across studies in the BioTuring database.

On the Search studies tab, choose the library you want to access by the drop-down on the top left. There are 2 libraries available:

  • Public studies: list of scRNA-seq data from publications, curated by BioTuring team.

For details about this database, please refer to Section 8 of this document.

Internet connection is needed to download data from BioTuring server. Once you are connected to the internet, new datasets will be automatically updated.

  • Remote repository: list of studies available on your organization shared network.

To access and download data from here, please configure your network on the Settings page.

To search for single or multiple gene expression across all studies in BioTuring public database, click on the Search genes tab.

Data page

BBrowser Data page shows you all data that you have downloaded or submitted to the local computer. You can refer to this page as your local database. You also need to go here to submit a new dataset.

  • Click on Add new study will lead you to a pop-up window for data submission.
  • Move the mouse over your study of interest will make it highlighted and click on the study will bring you to the Analysis dashboard to explore it.
  • By default, all datasets are sorted by Last modified date. You can choose to sort it by title, species, size or number of cells.
  • Rename and Delete buttons are available for each dataset.
  • The search box (Find your study) on the right can search for the study’s title, species, date, etc. However, there is no filter box or tags available in the Data page.

Settings page

BBrowser Settings page helps you to:

  • Check for software version
  • Activate and view your license and expired date
  • Change proxy server settings
  • Configure the shared network for remote repository data hosting
  • Manage tags to classify and organize the data you share
  • Change data storage location
  • Choose to launch app automatically after computer login
  • Allow sending log data to BioTuring server

Analysis dashboard

When you click on a study on Data page or click on Explore a study from Home page, that dataset will open in the Analysis dashboard.

Here is where you can visualize the data and perform all the analyses.

The main visualization is a scatter plot of dimensionality reduction, with each point representing a single cell. Cell color, size, and shape change when you run different analyses. The scatter plot is interactive, allowing you to zoom, move, rotate (in 3D mode), or select cells.

Inside the main visualization window are some function boxes:

  • Info box: shows a hierarchical list of sub-clusters, the cell type prediction result and gene information. The cluster box can be minimized by clicking on the main cluster, the 2 other boxes can be closed by clicking on the (x) icon below the dialogues.
  • Gene query: allows searching for single/ dual gene expression by gene name or ensemble ID
  • Mini map: shows the whole cell population of the study with selected cells highlighted when the main scatter plot is showing gene expression or a sub-cluster
  • Navigation tools: a list of interactive visualization tools like zoom, 2D/ 3D, move, pan select, lasso select and reset the scatter plot. At the bottom is access to the clonotype dashboard, gene gallery and scatter plot export.

On the right of the main visualization window are main function tabs

There are 4 tabs here, each comes in a small window which can be expand or collapse. These tabs either give you more insights about the data or provide additional visualization, which are:

  • Color by: This tab controls the color of the scatter plot. By changing the way cells are colored, you can visualize different clustering/ cell annotation results. You can import your annotation matrix from a file in this tab.
  • Shape by: This tab controls the shape of the cells on the scatter plot. When this tab is activated, the cell’s shape will be changed from a dot to a number or letter.
  • Composition: This tab shows the percentage of different groups of cells on a selected population. When this tab is activated, in the scatter plot, cells in selected population is colored while unselected cells left in gray. This tab also helps you run differential expression analysis.
  • Marker genes: This tab runs and presents results from finding marker genes function.
  • Enrichment analysis: This tab runs and presents results from enrichment analysis function

At the bottom of function tabs are information about study input/ output and visualization and analysis settings.

The other 2 interfaces: Sub-clustering dashboard and differential expression dashboard will be described in their specific section.

If you need help while doing analysis, press Alt (on Windows) or hover your mouse to the top left of the screen (on macOS) and click on Help to view our tutorials or to contact us.

BioTuring public database

Massive amounts of single-cell RNA sequencing data generated have opened avenues for exploration, yet also brought up new challenges to standardize data formats, systematically access transcription profiles of cell types across studies and integrate multiple datasets.

Hence, in BioTuring Browser, we have indexed published single-cell RNA sequencing data from multiple formats to our platform to remove that barrier. All data are processed and annotated to be instantly accessed and explored in a uniform visualization and analytics interface.

In addition to that, we have developed our set of marker genes for over 200 cell types and use that gene list to verify the author’s annotations and re-label the cell types to BioTuring cell ontology to systemize cell types available in our database.

Users can also query a single or multiple gene expression across all datasets in the database and see how the genes expressed in different clusters without downloading any dataset.

The section below explains how we index published data and how the gene query across the database works.

Curation method

Step 1. Data collection

Single-cell gene expression matrices or Seurat/Scanpy objects are obtained the author, or public repositories. If Seurat or Scanpy objects are available, we will reserve the analysis results and move to the annotation step (6).

Step 2. Filtering and normalization

Cells and genes from the submitted matrices are filtered to avoid drop-out, doublets, and apoptotic cells. Data are then subjected to log normalization and highly variable genes selection. QC criteria are subject to authors’ descriptions.

In case details of the filtering and normalization are not available, we will process the data by ourselves to get the most similar results with the publication.

Step 3. Batch effect correction

We follow the methods used in each study. If not provided, we will apply CCA correction.

Step 4. Dimensionality reduction and clustering

We use the first 30 components of PCA to calculate 2D and 3D t-SNE or UMAP, the parameters of which are taken from author’ descriptions.

Step 5. Clustering

The dataset will go through both graph-based clustering by the igraph package (Csardi and Nepusz, 2006) ⁠ and k-means clustering (Neter et al., 1998).

Step 6. Annotation and standardization of cell type labels

Cell type annotation matrices are obtained from authors and loaded in BioTuring Browser, together with metadata of the experimental design. We then manually verify cell type annotations using known markers and unify the terminology based on our internal cell ontology.

If annotation and metadata are not available, we will extract information directly from the publications.

List of studies

Users of BBrowser can view all studies in the public database when opening the Home page of the software. You can also access the list of studies available in BioTuring website: https://bioturing.com/bbrowser/datasets

We select the studies to index based on the needs of our users and community.

If you have a study of interest and would want it to be indexed by BioTuring team, please contact us at support@bioturing.com

If you are an author, we are very happy to distribute your data on BBrowser for public access. Please also contact us at support@bioturing.com

Query gene expression across the database

Since version 2.1.3, we introduced a special search engine to help you look at one gene or multiple gene expression across every public dataset of BBrowser.

BBrowser is currently connected to 126 studies with a total of more than 5.5 million cells. Without downloading anything from the server, the gene search engine lets you skim through a huge amount of information in the most efficient way.

You can find the gene search engine in Home page > Search genes tab

  • Type in or copy and paste your list of genes to the search box.
  • Click on the Search button to start searching

If you search for one gene, the result of this search engine is a series of violin plots, each of which is the gene expression in a public dataset of BBrowser. On the plot:

  • x-axis: By default, this is the graph-based clustering result. You can change it into any annotations in that dataset by clicking on the annotation name and select other categories from the drop-down.
  • y-axis: This is the log-2 of expression value. The unit depends on what kind of data provided by the authors. In most cases, this is the UMI count.

All violin plots are interactive. You can hover your mouse over the plot to get the statistics (e.g. quantiles, median, mean, etc.), or drag to enlarge an area of the plot. Double click on any part of the plot will bring it back to the original setting.

On the top right of each dataset, there is a horizontal bar telling the percentage of cells that express the gene. The search result is sorted descending based on this number.

Information about the study and option to Download are the same as in the Search studies tab.

 

If you search for multiple genes, the results will be a series of heatmaps, each of which is from one dataset. Each heatmap shows:

  • x-axis: By default, this is the graph-based clustering result. You can change it into other annotations in that dataset.
  • y-axis: This is the gene list you query not in a specific order.
  • The log-2 expression value is shown in color scale. The unit depends on what kind of data provided by the authors. In most cases, this is the UMI count.

You can get a dataset on BBrowser by downloading it from BioTuring server or from your internal server or by importing the data from your local computer.

Currently, BBrowser supports analyzing data from human (Homo sapiens) and mouse (Mus musculus). If you input data of a species rather than those, the software can still process the data (except transcript quantification step). However, some features that are related to gene information will be disabled, such as gene-set enrichment analysis and gene functional reminder.

BioTuring Browser hosts a public database of published studies that are selected, processed, verified and uniformly labeled by the BioTuring team. You can view the list of studies in this database in BBrowser Home page.

To download a study from BBrowser public database, you need to be connected to the internet.

  • Go to Home page > Search studies and choose Public studies.
  • Use the Search box and Filter box to look for your studies of interest.
  • Filter tags classify studies based on field of research, tissues and related diseases.
  • Search results can be sorted by created date, study title, number of downloads, number of cells and the size of the data.
  • Each study comes with author’s information, abstract, species used and GSE number.
  • Click on Download to get the study to your local computer.
  • Once the study is downloaded, click on Explore to open the study or Redownload to get the most updated version of the study.

Alternatively, you can go to Home page > Search genes tab and look for your gene(s) of interest. All studies having that gene expressed will be shown in the result. You can scroll down to find the study you want to download and click on the Download button next to that study.

 Notice:

  • The redownload button will remove old original data of a study and replace it with the updated version. Analyses have been done on the study (differential expression analysis, sub-clustering, etc.) will be kept. If a study in BBrowser database gets updated, we'll send you an email and notify you to redownload it.

Given that you have configured your shared network access to the remote repository, to download a study from the remote repository:

  • Go to Home page and choose Remote repository.
  • Use the Search box and Filter box to look for your studies of interest.

Filter tags can be based on the field of research, tissues and related diseases. Or it can also be customized depending on your settings. Go to the Settings page to set up filter tags.

Search results can be sorted by created date, study title, number of downloads, number of cells and the size of the data.

  • Click on Download to get the study to your local computer.
  • Once the study is downloaded, click on Explore to open the study or Redownload to get the most updated version of the study.

Import FASTQ file

To import a study with raw sequencing data, you need to provide a folder with your FASTQ files.

  • Go to Data page -> Click on the Add new study. On the popup window, click on Raw data
  • Move the folder containing your FASTQ files into the input box or click anywhere inside the box to open File Explorer and select the input folder.
  • Type in the dataset title and choose the reference index compatible with your data and the sample preparation platform used.
  • Define your quality control parameters and the dimensionality reduction plot you want to view.
  • Finally, click on Start.
  • BBrowser will start processing the data and a bar will appear to show you the progress.
  • Once the processing is done, the Analysis dashboard will be open with the data you imported.

Notice:

  • FASTQ files can be unzipped (.fastq/.fq) or zipped in gzip (.fastq.gz/.fq.gz) format.
  • All input FASTQ files need to be in the same folder and at the same level (no subfolder).
  • Currently, we only support paired end reads that are in two distinct FASTQ files. The software automatically pairs the FASTQ into runs based on the file names. Please make sure that two FASTQ files of one pair have the same prefix, and it is different from other pairs’ prefix. For example: nuclei_900_S1_L001_R1_001 and nuclei_900_S1_L001_R2_001 will make a pair.
  • The alignment and quantification process run by Hera-T (Tran et al. 2018) and needs at least 12 GB RAM. Hera-T supports raw data prepared by 10X Chromium Chemistry V2 and V3. The output of Hera-T will automatically go through the analysis pipeline as a single MTX file and can be exported in Analysis dashboard.
  • Currently BBrowser supports mouse mm10 and human 88.p12/ GRCh37 index.
  • To download the reference index, sufficient space for storage is also required. Depending on the reference index you choose, the software will notify you of the free space needed (5 GB on average for each reference file). You only need to download the reference index once, when importing FASTQ files for the first time, after that, the reference is stored in your computer and will be called out whenever needed.
  • In case there is an error occurs in importing data, click on Send log to send the error records to our development team. We will contact you shortly for troubleshoot.
  • To edit or remove the file you have imported, hover your mouse over the file name. A gray box will appear around the data together with Edit and Remove options.

Import Expression matrix (MTX, TSV, CSV)

BBrowser supports importing expression matrices as MTX, TSV, and CSV file. The expression matrix files can be unzipped or zipped in gzip format.

Import MTX file(s)

To import a study by single or multiple MTX files, you need to provide a folder with exactly 3 files: gene.tsv (or features.tsv), barcodes.tsv and matrix.mtx for each MTX file.

When multiple folders containing data from multiple batches are submitted, options for selecting batch correction methods will be available.

  • Go to Data page -> Click on Add new study. On the popup window, click on Expression matrix tab
  • Choose the input folders by drag and drop the folders to the input box.
  • Or you can click the + button and select the file format as MTX. A File Explorer window will be opened. Navigate the directory and select the input folder(s) containing your 3 files: gene/features.tsv, barcodes.tsv and matrix.mtx
  • Choose the species of your data, the batch correction method preferred (optional), quality control parameters and the dimensionality reduction plot you want to view.
  • Type in the dataset title and finally, click on Start.
  • Once the processing is done, the Analysis dashboard will be open with the data you imported.

If multiple folders were submitted, in the Analysis dashboard you will find the input metadata classification with name of clusters are input folders’ name. This helps you visualize (colored and shaped) the cells based on which batch they come from.

The three files barcodes.tsv, features.tsv (or genes.tsv), and matrix.mtx are the standard files from 10X CellRanger. Below, we describe some more details of the data format that will affect the analysis.

  • mtx: This is the sparse version of the expression matrix. Currently, BBrowser can only accept UMI count or non-negative values. Centered values are not allowed and will not pass the analysis pipeline.
  • tsv: Each line in this file is a barcode. The format of the barcodes sometimes includes a different number at the end, e.g. AAACCTGAGGGTCTCC-1, which indicates for subject identifier. However, demultiplexing in BBrowser is not available, which means metadata cannot be generated based on the subject identifiers. Users are highly recommended to provide an annotation table after submitting data in case of multiplex sequencing.
  • tsv: This file provides the information of the row in the count matrix. Originally, these are the genes. But because of the recent introduction of multi-omic protocols, we now have multiple types of feature (the 3rd column of the file) or even multiple species (prefixes of the 1st column). In BBrowser, if you have more than one type of features, only Gene expression are selected. And if you have more than one species, only the species appear first in the file is selected.

Import TSV, CSV file(s)

To import a study by single or multiple CSV/ TSV files:

  • Go to Data page -> Add new study. On the popup window, click on Expression matrix
  • Choose the input files by drag and drop the files to the input box.
  • Or you can click the + button and select the file format as TSV, CSV. A File Explorer window will be opened. Navigate the directory and select your input file(s). Make sure that all files submitted are in the same format (either TSV or CSV)
  • Choose the species of your data, the batch correction method preferred (if you submit multiple files), quality control parameters and the dimensionality reduction plot you want to view.
  • Type in the dataset title and finally, click on Start.

A .tsv or .csv files are simply a table in which values are separated by a delimiter. It can be a tab (in .tsv) or a comma (in .csv). If you use a table editor, such as Excel, Libre, or Google Sheet, it always can export your table into either .csv or .tsv format.

BBrowser requires a strict format in order to parse the information correctly. Please make sure that the first column of the table has the gene names / Ensembl identifiers, and the first row of the table has the barcodes.

For users who want to export a matrix using R, please be careful because writing a matrix in R may lose one first cell of the first row. For example, given a matrix object having 1000 rows and 500 columns:

> str(m)
num [1:1000, 1:500] 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:1000] "ENSG1111111111" "ENSG1111111112" "ENSG1111111113" "ENSG1111111114" ...
..$ : chr [1:500] "CTGGTCCGGTGTTATCAG" "TTACTGGGACGACTCGGG" "ACGAGGAGACCCGAGATA" "CTTTGCAGTAGGGGCAAC" …

Write a .csv file in this way will lose the first cell. The first row of the file will only contain 500 values while other rows will be 501:

write.table(m, 'matrix.csv', sep=",", col.names=T, row.names=T)

Please use this command instead. It is much easier:

write.csv(m, ‘matrix.csv’)

For .tsv file, the best way is to use the common write.table, then manual insert one tab on the beginning of the first row.

Import Seurat, scanpy object

To import a study by Seurat objects (.rds) or Scanpy objects (.h5ad/ h5):

  • Go to Data page -> Add new study. On the popup window, click on Other
  • Move the file you want to import in the input box or click anywhere inside the box to open File Explorer and select the input file.
  • Data format will be automatically adjusted for the compatible input file.
  • Choose the species of your data.

Quality control parameters and the dimensionality reduction method are not needed because these steps have been done on the Seurat/ scanpy object.

  • Type in the dataset title and finally, click on Start.

A Seurat or scanpy object must contain an expression matrix with information on barcodes and genes. BBrowser can also adopt some analysis results in the object. These results include, but are not limited to:

  • Integrated expression matrix with batch effects corrected
  • PCA results
  • t-SNE/UMAP coordinates
  • Graph-based clustering results
  • Metadata of the cells

Upon receiving the Seurat or Scanpy object, BBrowser will read all data available and runs analyses to get the missing information.

Notice:

  • BBrowser reads all metadata which has less than 50 categories. If you have already annotated your data, you can add it to metadata class as “annotations”.
  • BBrowser does not support importing multiple objects, please combine your multiple batches in one object before importing to the software.
  • For other single-cell object formats, you can convert it to Seurat objects by the tutorial from Satija lab: https://satijalab.org/seurat/v3.1/conversion_vignette.html

Seurat object

BBrowser is able to read a Seurat object stored in .rds format. To create a .rds file from Seurat, you can use the saveRDS function in R. We will not go into detail about the structure since the software does not require any specific modification of original Seurat structure. If you can load the .rds as a normal Seurat object, the software can do the same.

Scanpy object

For users who analyze with Python via the scanpy library, the final AnnData class should be stored in .h5/.h5ad format using the .write function within the class itself. Unfortunately, hdf5 is too general and there are many variations of the structure in which the information is recorded. BBrowser expects the following structure:

  • Expression matrix (required): a parse matrix at X or X. It can be a normal or a sparse matrix. For a sparse matrix, it must have the 3 standard columns: indices, indptr, and data.
  • Gene IDs (required): a column named gene_ids or index at var or var
  • Barcodes (required): a column at obs/index
  • PCA (optional): a data.frame at obsm/X_pca
  • t-SNE (optional): a data frame at obsm/X_tsne
  • UMAP (optional): a data frame at obsm/X_umap
  • Metadata (optional): a data frame at obs
  • Graph-based clustering (optional): a column in metadata named louvain

Data analysis pipeline

We are fully aware that different datasets were generated under different experimental designs and may have to be treated uniquely in order to represent all biological variations in the samples and for public studies, to reproduce the published results in the most faithful way. That is also the long-term plan for BioTuring Browser to maintain the speed and ease of use, while enhancing the flexibility of the analyses. All public datasets and imported data underwent the same pipeline, separate steps of which will be discussed in this section.

Transcript quantification

Transcript quantification is only applied when you create a new study with raw sequencing files (FASTQ). The process is run by Hera-T (version 1.2.0) (Tran et al. 2018), a new algorithm developed by BioTuring team. This is applied to data generated by 10X protocol on Chromium v2 and v3. The processing speed is up to 10 - 100 times faster than CellRanger 3.0 with better accuracy (Tran et al. 2018). The output of transcript quantification is an expression matrix in MTX file format and the file will be submitted for further processing steps below.

Quality control

The process from quality control to dimensionality reduction is applied to public and in-house datasets imported in MTX, TSV or CSV files.

Quality control filters out poor-quality cells in terms of gene expression and redundant non-expressed genes in the data.

In public datasets without a detailed processing script from the author, genes having at least 1 UMI count in less than 3 cells are excluded. Then, cells with less than 200 genes having at least 1 UMI count and more than 5% of mitochondria genes are excluded. The process creates a new expression matrix that may have fewer cells than the original data, and BBrowser only takes the cells and genes of this filtered matrix for the next processing steps.

For in-house data, BBrowser allows users to define the cut-off for quality control or to skip any filtering steps. In the data import pop-up, you can

  • untick the checkbox before any filtering criteria to skip that step
  • change the number of cells/ genes/ mitochondria gene ratio/ top variable genes to apply the new parameters to the filtering step.

 Batch effect removal

This process is applied when multiple MTX, TSV or CSV files are submitted, usually from multiple batches of sample preparation and sequencing. The software considers each file as a batch and will try to scale all batches with the chosen method

Currently, we provide 3 methods to remove batch effects for your preference:

  • MNN correction (Haghverdi et al. 2018) ⁠: This method is based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space, which are highly variable genes detected by the Seurat package (Gribov et al. 2010). To select the initial group, unbiased graph-based (louvain) clustering is run on the first 30 components of the PCA results to sort the batches in order, then the batch in first order is selected. The method assumes that there is at least a subset of the population is shared by all the batches. It is effective on repetitive measurement, but the computational time is expensive, compared to other methods.
  • CCA (Butler et al. 2018): This method is widely used in scRNA-seq via the Seurat package (Gribov et al. 2010). The idea of this method is to use an adaptive version of manifold alignment and anchor all the batches in this adaptive space. It is simple, fast, and at the same time, very effective when applying on data of different technologies.
  • harmony (Korsunsky et el., 2019): This method is similar to CCA since it also projects cells into a shared embedding. But by considering cell type rather than dataset-specific conditions, it can simultaneously account for multiple experimental and biological factors. Application and the speed of this method is comparable to CCA.

Dimensionality reduction

In the current version, there are two ways to run dimensionality reduction: t-SNE and UMAP.

t-SNE (Maaten and Hinton 2008)⁠

The first 30 components of the PCA are used for the t-SNE.

The parameter for t-SNE depends on the size of the data.

  • Less than 100: perplexity = 10, theta = 0.0, max_iter = 2000
  • Less than 1000: perplexity = 20, theta = 0.0, max_iter = 1500
  • Less than 10000: perplexity = 30, theta = 0.4, max_iter = 1000
  • Other: perplexity = 50, theta = 0.5, max_iter = 1000

The analysis is done by Rtsne package (Krijthe 2015).

UMAP (McInnes and Healy 2018)

The first 30 components of the PCA are used for the UMAP.

The number of neighbors is set at 15.

The analysis is done by uwot package (Melville 2018)⁠.

Clustering methods

This analysis runs on the PCA results. For every dataset, the software will calculate both louvain (graph-base) and k-means clustering.

  • Louvain clustering: The graph-based method is done by the igraph package (Csardi and Nepusz 2006)⁠ with a flexible number of nearest neighbors. This number is no larger than 20 and estimated by the elbow method of k-means clustering on the PCA results.
  • k-means clustering: This method is generated in a series of k ranging from 2 to 10. The software pre-calculates and records the outcome from all k values so that users can instantly switch to a different k and view clustering result in the scatter plot.

Finding marker genes

BBrowser uses a non-parametric approach, called Venice, to detect marker genes. It is an open-source algorithm and can effectively run on a large amount of data while the accuracy is outperform other methods (Hy et al. 2019).

We first defined marker genes of a group of cells in a data set as the genes that can be used to distinguish such cells from the rest. From this idea, we used the accuracy of classification as a metric to score the significance of a marker gene.

Considering each gene separately, we denote a cell as  where  is the label of a group of cells.  if the cell is in the group of interest (group 1 - the group that we want to find the marker genes for).  if the cell is not in the group of interest (group 2 - the rest of the data). We denote  as the complement group of .

The probability for a cell being in group , given its expression level  is:

In most of the cases, the group of interest is much smaller than the rest of the data and can generate a sampling bias. To avoid this bias of sample size, we set:

Hence:


Accuracy of the classifier is:


The accuracy of prediction is:

Intuitively, For the robustness of the calculation, we divide the expression into  intervals:


Where is the number of cells of group  in group , and  is the number of cells in group . For each gene, we can estimate the accuracy measure for using this gene to predict cells inside or outside the cluster and use this as a metric for ranking the marker genes.

We tested Venice on both real and simulated datasets. The benchmark considered the performance on 2 different sequencing technologies (full-lenght and UMI count), 4 different kinds of marker genes (including transitional genes), and 2 different kinds of null genes. Venice exhibited the best performance and accuracy in all cases. It could effectively detect different types of marker genes and avoid false-positive results while keeping a modest running time.

Venice is also incorporated in Signac, a single-cell analytics package developed by BioTuring. The package is available at https://www.github.com/bioturing/signac

Gene set enrichment analysis

This analysis is adopted from the GSEA method (Subramanian et al. 2005)⁠, a common analysis for selecting potential biological terms given a sorted list of genes. The software performs GSEA on 4 different terms: biological process, molecular function, cellular component, and biological pathway. The first 3 terms are from the gene ontology (Consortium 2004), and the last one is from the reactome database (Joshi-Tope et al. 2005)⁠.

Enrichment analysis can be found in both the Analysis dashboard and the Differential expression dashboard

  • The Analysis dashboard: The gene list used for GSEA is the sorted list of marker genes. The genes were sorted in the Marker genes tab based the p-value score previously being discussed
  • The Differential expression dashboard: The gene list used GSEA comes from the result of the differential expression analysis. Genes are sorted by p-values.

Cell-type prediction

This feature shows you the suggested cell-type for a group of cells. When a user does a selection by clicking a cluster/annotation or using the Select cell tool, the software picks genes that express in at least 35% of the group. This process does not select from the whole transcriptome, but instead on a list of cell-type markers in our curated knowledge base. Then, it takes that gene profile to estimate the correlation with the cell-types profile. A cut-off of 0.5 is applied to remove non-potential candidates. The remaining cell types will undergo and tree search to find the common parents. Parents which have less weight (e.g. distinct from the rest) are removed. This process is repeated until only one cell type left. The whole analysis usually takes 1-3 seconds to finish, hence, it triggered automatically.

Differential expression analysis

 BBrowser supports finding the differential expressed genes between two groups of cells, with each group must have at least 3 cells. It finds differentially expressed genes using Venice, the same method for finding marker genes. Users can switch to edgeR, a more common method but takes at least 5 times longer.

  • Venice: The algorithm has been previously described under the section of marker genes. But in this case, instead of comparing one group to the rest of the data, the software will look for genes that differentiate the two selected groups.
  • edgeR: In the pre-processing step, the software keeps genes being expressed at least 45% of cells in one group and non-spike-in genes. Then edgeR package (Robinson, McCarthy, and Smyth 2010)⁠ is used to fit a quasi-likelihood negative binomial generalized log-linear model to UMI-count data. To test for significance, gene-wise statistical tests are conducted and produce the p-adjusted value for each gene.

For the log2FC value of each gene, we use the same method of Seurat package (Gribov et al. 2010). Below is the detail formula:

Notations:


 

Adjusting data visualization

Interactive 2D - 3D plot

By default, the dimensionality reduction result is shown in a 2-D scatter plot.

You can interact with the plot by zoom in/ zoom out, switch between 2D and 3D, move and rotate the plot and reset it to the original state.

On the bottom right corner of the scatter plot, there are several buttons that control the visualization as well as how a user can define a selection.

●   Reset: this button reset the scatter plot to the original state without any selection and cells are colored by the last clustering factor/annotation used.

●   Pencil tool (lasso selection): this button activates the free selection mode.

●   Hand tool: this button activates the navigation mode: moving and rotating the plot, as well as whole cluster selection.

●   2D / 3D: these buttons help you switch between 2-D and 3-D scatter plot. Rotation is only enabled for 3D plot. For Seurat/ scanpy object calculated for dimensionality reduction in 2D but not 3D coordinate, BBrowser can calculate the 3D coordinate based on PCA results and vice versa.

●    Zoom (plus/minus): these buttons help you zoom in and out. The point size of the scatter plot remains unchanged when zooming. Alternatively, you can use your mouse wheel to zoom.

●   Capture: Screencap of the current scatter plot and cluster labeling and export as an image.

Customize the plot

Users can customize the scatter plot in term of visualization (theme, point size, transparency, color palette) and dimensionality reduction method (t-SNE or UMAP).

  • Go to Settings on the bottom right corner:
  • Choose Visualization to change the appearance of the plot or Analysis to change to plot type.
  • Define your preferences for visualization/ settings then click Apply

Options for altering the scatter plot appearance includes:

  • Point opacity: Adjust the opacity and make the cells transparent. This option is helpful when you want to view a rare population with low density surrounded by cells from different groups. Since the density is low and the number of cells is small, the cluster might be hidden and only revealed when all cells are transparent.
  • Point-size: Depending on the screen resolution, you may need to adjust the point size to fit your screen. By default, the software automatically estimates the point based on the screen size and resolution.
  • Equal size: Data that is generated by UMI-count technologies (such as 10X genomics) may have a lot of zeros (the dropout issue). When a user maps the expression values to the scatter plot, cells that have expression value equal to zero will be in grey, and there will be lots of them. They can cover non-grey cells (has expression value rather than zero) and cause misleading visualization. To tackle this problem, the software makes grey cells much smaller than the original size when a user query a gene. If you turn this option off, the software will keep everything as is.
  • Theme: This option is “Dark” by default (black background). This is the optimal theme for analysis. If you want to make a figure or presenting which uses a screen projection, you may consider using the “Light” theme (white background).
  • Color palette: This option controls the way colors are assigned to groups of cells or gene expression.

Color by and Shape by

Color by tab and Shape by tab help you to color and shape the cells in scatter plot to your preference. Users will decide the group of clusters they would want to visualize, hence, changing the way cells are colored and shaped. Cells with the same color and shape belong to the same cluster.

The software offers various classification methods: unbiased graph-based clustering, k-means clustering, classification by input metadata, or by your own definition and annotation. You can import your annotation matrix from a file in Color by tab.

Color by tab is always activated.

  • Click on the drop down to choose the clustering or annotation result you would want to visualize.
  • Each cluster has cluster name and number of cells and represented by a bar sharing the same color with cells belong to that cluster in the scatter plot. The length of the bar depends on the number of cells in the cluster. The order of the clusters is the order in which they were input.
  • Double-click on any cluster will show only cells in that cluster and minimize cells from other clusters. Double-click again will show you the whole cell population of the study.
  • You can also untick any cluster that you don’t want to see in the scatter plot.

Shape by tab is activated by choice.

When this tab is activated, the cell’s shape will be changed from a dot to number or letter. With 2 layers of visualization (color and shape), you can view how one cell appears in 2 different classifications (cell type in patient, cell type in treatment, etc.)

  • The selection of clustering/ annotation results and clusters to visualize is the same with Color by

To save the tSNE or UMAP plot with the given type of cell’s color and shape, you can click on the camera button at the bottom of the plot to save the image and its legends.

Query gene expression

Single mode

To see how one gene expressed in the given dataset, you can type a gene name or its Ensembl ID in the gene query box at the top right corner of the scatter plot and Enter

Cells will be colored based on their expression level of that gene, according to the sequential color scale in Settings. Gene information will be displayed in the info box on top left corner.

  • To save the scatter plot showing the selected gene expression to Gene gallery, click on the arrow below the color scale and choose Add to Gallery
  • Saved scatter plot can be viewed when opening Gene gallery at the bottom right of the main visualization window.
  • To save the tSNE or UMAP plot with the given gene expression level, you can click on the Camera button at the bottom of the plot to save the image and its legends.
  • To view the expression of that selected gene across clusters of the current classification, click on the arrow below the color scale and hit the Plot
  • The software will create a violin plot of the gene expression across all the clusters.

The x-axis of the plot contains the name of the clusters, which can be a custom annotation, or a clustering result, together with the ratio of cells having positive expression versus the total number of cells in the cluster. You can sort the clusters’ order in alphabet, the number of cells in the cluster, the number of cells expressed the selected gene or by mean of expression of the selected gene. Click on the Settings icon on the top left corner and tick on your preference.

The Settings can help you further customize the violin plot. You can change the plot into a box plot or a bar chart. It also allows you to add data points, with each point representing a cell.

Notice:

  • To get the violin plot of gene expression across clusters in a specific annotation, you need to define the annotation first in Color by tab. After that, you can search for the gene and hit the Plot
  • You can also choose to make a plot of some certain clusters (not all clusters in an annotation) by selecting the clusters you want in Color by tab and untick other clusters before query for gene expression.
  • To save the plot you made, you can click on the camera button at the top right to save the image and its legends.

Dual mode

To simultaneously see how two genes expressed in the given dataset:

  • Type the first gene name or its Ensembl ID in the gene query box and Enter
  • Click on the arrow below the color scale and choose Dual mode
  • Type the second gene name or its Ensembl ID in the gene query box and Enter
  • Cells will be colored based on their expression level of the 2 genes, according to the sequential color scale chosen.

Now, if you click on the Plot button, a density plot will appear, showing the expression of two genes across the entire population. To get the density plot for a single cluster or some certain clusters, select the clusters you want in Color by tab and untick other clusters before query for gene expression.

You can change the density plot to a scatter plot by going to Settings and check the scatter plot. By default, cells with no expression of both genes are excluded but option to include those in the plot is available. You can also drag and select any part of the plot that you want to view with flexible maximum and minimum values.

To save the plot you made, you can click on the camera button at the top right to save the image and its legends.

Select a cell population

For other analyses: add annotation, view compositional breakdown, … , you first need to select the cells. Cells that are selected will be colored in white.

Hand tool & Pencil tool

The most common way to select cells is by using the Hand tool and Pencil tool.

  • Hand tool should be used to select cells that are already clustered. Choosing the Hand tool and clicking on a cluster will select all cells that belong to that cluster.
  • Pencil tool can be used to select any cells, whether they belong to a cluster or multiple clusters. Use the pencil tool to draw a selection border around the cells you want to select. If the border you draw is not closed, a straight line will be added to join the 2 ends.

Color by & Shape by filters

You can also select cells that are already cluster/ annotated from the Color by and Shape by tabs.

To select cells in one cluster:

  • Go to Color by tab and choose the group contains the cluster you want to select to make them visible in the scatter plot.
  • Single-click on the cluster name to select all cells in that cluster
  • (cluster bar will be highlighted by a gray box and selected cells turned white)

To select cells that belong to 2 clusters of 2 different classification

For example, cells belong to cell type 1 in cell type classification and belong to patient A in patient classification.

  • Go to Color by tab and choose the first classification. Choose your first cluster of interest by single-click.
  • Go to Shape by tab and choose the second classification. Choose your second cluster of interest by single-click.
  • Selected cells in white are cells that belong to both clusters.

Select cells by gene expression

You can select cells that shared the same expression level of one given gene:

  • Type in the gene name or its Ensembl ID in the gene query box.
  • Clicking on the color scale to select all cells expressing that gene.

  • Adjust the 2 black dots at 2 ends of the color scale to select cells with defined maximum and minimum level of expression of the given gene.
  • Or click on the maximum and minimum values and type in new values

Cell type prediction tool

Default cell type prediction

Whenever a group of cells is selected, BBrowser will try to predict the cell type of that cluster. Cell type prediction results will appear in the infobox on the top left corner of the scatter plot. Cell type prediction result includes cell type name, marker genes’ information and the publication used for reference.

By default, cell type prediction is applied to data with less than 50 000 cells due to the long processing time for large datasets. You can enable or disable the function by increasing or decreasing the cell number limit in Settings > Analysis > Cell-type prediction limit. The prediction calculated based on the curated database of cell types. This database consists of more than 200 cell types with marker genes and the related reference publications.

Custom cell type prediction

Users can also create their own definition of a cell type by a set of marker genes and use that custom knowledge base for cell type prediction.

  • To create a custom database of cell types, go to Settings > Analysis and in the Cell-type prediction knowledge section, check on Custom.
  • Type in cell type name and the positive and/ or negative gene markers. Click Enter after you type in a gene name or click on the suggested gene name (suggested by auto-complete) to add the gene markers.
  • Click on the Plus icon (+) to add your definition of a cell type.
  • Click Apply to save your settings.
  • Now, when you circle a group of cells, cell type prediction calculation will be based on your custom knowledge base.

Find marker genes and enriched processes

Finding marker genes and enriched processes in a group of cells helps you to see the genes and processes that are differently expressed in that selected group, compared to the rest of the cell population. The information is essential to define which cell type the cluster belongs to. To run the analysis:

  • Select a group of cells
  • Go to Marker genes tab on the right, click on Find marker genes.

  • Algorithm will run and find the marker genes as well as enriched processes. Hence, both results will be available and ready to be explored.
  • Alternatively, one can click on Run enrichment analysis and will get the same result with both marker genes and enriched processes.
  • Each gene or process comes with the p-value and biological details related to it. You can use the Search box to look for a gene or a process of interest.

Details on the marker genes and enrichment analysis include:

 Marker genes:

  • By default, marker genes will be sorted by order of significant (p-value) with the most significant gene comes first. Each page shows you 10 marker genes, to continue browsing, go to the next page.
  • Together with gene name and p-value, the software will also show you the type of marker gene, dissimilarity, log2FC, Ensembl ID, protein class, gene type, transcript count and GC content. All criteria above can be used to sort the marker genes in an increasing or decreasing order. To sort the marker genes, click on icon next to the column name.
  • Type of marker genes: up-regulated, down-regulated or transitional.
  • Transitional marker genes are genes that are not exclusively expressed or repressed in the given cluster but show expression in multiple clusters, and its expression level is distinctive for each cluster. The classification is taken from Venice.
  • Dissimilarity: this score indicates if the selected cells are different and can be separated from the non-selected population by constructing a simple classifier based on the given gene expression. If the classifier can determine whether a cell is coming from the selected or non-selected group with 100% accuracy, dissimilarity will be 1.
  • Log2FC: log2-fold-change of each gene is the ratio of the means of expression of that gene in the cells selected, compared to the rest of the cell population.
  • Other details about protein class, gene type, transcript count and GC content are taken from the Ensembl database and available for human and mouse genes.

Enrichment analysis:

  • By default, enriched biological processes are displayed by order of significant (p-value) with the most significant process comes first. Each page shows you 15 enriched processes, to continue browsing, go to the next page.
  • To view enriched molecular functions, cellular components and pathways, click on the drop-down next to the first column name.
  • To view details about each process or pathway, click on the icon in the Source column. This will connect you to the database (gene ontology or reactome) and go directly to the specific page of the chosen process.

Add an annotation

Add an annotation

You can add multiple annotations to a cell, regarding cell type, subtype, expression level of a gene or set of genes or clonotype, etc. There are 2 ways to add an annotation:

  • Add an annotation matrix by a file and applied to the whole dataset
  • Add annotation manually for each cluster

For each annotation, you need to put in Group name as the name of the classification (cell type, sub-type, T cell sub-type …) and cluster name is the name of the cluster (macrophage, microglia, COL1A4+ fibroblast, …).

To import an annotation matrix by a file:

  • Go to Color by tab, click on the drop down to the type of classification
  • Click on Add annotation from a file
  • A File explorer window will be opened, navigate it to the annotation file and click Open
  • Annotation matrix needs to be in TSV or CSV file format with the first column is cell ID and the next column is the annotation linked to each cell. Column name will be taken as Group name and the cluster name is taken from the annotation of each cell. The annotation must be non-numerical and have less than 50 types.

To manually annotate each cluster:

  • Select a group of cells (refer to section 6.3)
  • Click on the + icon to create an annotation for that group of cells      

Fill in Group name and cluster name to create a new group and cluster or choose an existing group and cluster to add the selected cells to that cluster.

Click OK to implement.

 

Edit an annotation

After an annotation is added, you can edit it by changing name, merging 2 clusters together, delete the cluster or the whole group.

  • Go to Color by tab, choose the group you want to edit.
  • Click on the Pencil icon next to the group name.

  • A pop-up window will be opened.
  • Put in a new name for the clusters or group as you prefer.
  • Click on the trash icon to delete the clusters or group.
  • To merge 2 clusters, hover the mouse on top of one cluster to have it selected (surrounded by gray box), then move it to any another cluster to merge them together.
  • Click on Save to keep all changes or Cancel to discard the changes.

Study cellular composition

BBrowser supports cellular composition analysis for any group of cells, whether annotated or not annotated. User will define the group of cells they want to view composition and the type of classification. The software will identify the percentage of each cluster from the chosen classification in the group of selected cells and sort the clusters by order of majority.

To study the cellular composition:

  • Select a group of cells (refer to section 6.3)
  • Open the Composition tab on the right
  • Choose the type of classification you want to analyze by the dropdown.

The results of cellular composition will be displayed in a stacked bar chart.

Color of cells in the scatter plot will be changed as non-selected cells become grey and selected cells are colored to the cluster they are in, according to the bar chart.

Differential expression (DE) analysis

Performing differential expression analysis on any given two clusters will help you to find out the genes that cause differences between 2 clusters and processes associated with them.

Select 2 groups of cells for DE

You can run DE analysis on 2 clusters in the same group or in different groups.

To run DE analysis on 2 clusters in the same group:

  • Select the group of cells that have cells from both clusters.
  • Go to Composition tab, choose the classification that has both clusters.
  • Select 2 clusters in the bar chart by clicking on clusters’ name or click on the parts of bar chart displaying those clusters.
  • A button to Run DE analysis will be activated.
  • Clicking on that button will start the calculation process and the DE dashboard will be opened showing differentially expressed genes.

To run the DE analysis on 2 clusters in 2 different groups:

  • Select the first cluster you want to analyze. Create a new annotation for that cluster (e.g. Group name – DE analysis, cluster name – group A)
  • Select the second cluster you want to analyze. Create a new annotation for that cluster in the same group with the first cluster.
  • Select the 2 clusters by free selection (pencil icon)
  • Go to Composition tab, choose the group that you’ve just created with both clusters in it.
  • Select the 2 clusters and click on Run DE analysis

The DE analysis dashboard

After you run the DE analysis on two clusters of interest, the software will proceed to the DE dashboard, showing differentially expressed genes by a volcano plot of all genes, a box plot of a single gene expression, a table of genes and enriched processes, and a scatter plot of cells in two clusters

  • The volcano plot: showing genes that expressed in more than 45% of the cells in two clusters. Each point represents a gene. The up-regulated genes of the first cluster vs second cluster are colored in red, while down-regulated genes are colored in blue.
  • This plot is interactive. You can hover the mouse over any gene to view the fold-change values and p-values correspondingly. If you click on the gene, the box plot will show that gene’s expression levels across two clusters. Right-click on the gene will show or hide its name on the volcano plot.
  • The box plot: showing the selected gene’s expression levels across two clusters.
  • This box plot is interactive, which allows showing median, mean, max, min value when hovering the mouse over it.
  • By clicking on the Settings icon on the top left corner of the box plot, you can customize the plot: changing the box plot into a violin plot or a bar chart, sort the boxes order in alphabet, number of cells in the cluster, number of cells expressed the selected gene or by means of expression of the selected gene, add all data points or only outliers.
  • DE genes tab: showing the list of genes with log2(fold-change) values and p-values. The genes are sorted by the p-values. If you click on any gene, the box plot will change to show that genes expression levels.
  • Enrichment analysis tab: showing the enriched biological processes, molecular function, cellular component and pathways together with its p-values and link to details on Gene ontology and Reactome database (for human data).
  • The scatter plot of cells in two clusters: By default, cells are colored based on the cluster they belong to, two clusters are shown in two different colors in the scatter plot.

If you click on a gene on the volcano plot or the table, the scatter plot will show the selected gene’s expression. You can also query a specific gene expression by filling the gene name in the top right box.

  • The DE dashboard toolbar:
  • Reset button brings the plot to the original state (colored by cluster).
  • Split view button divides scatter plot horizontally, with each haft showing cells from one cluster. This aids the visualization of gene expression in two clusters.
  • Export button helps to save each plot and table in the dashboard separately.
  • Return back button helps to escape DE dashboard and go back to the Analysis dashboard of the entire dataset.

Save and view previous DE analysis results

DE analysis results are automatically saved right after you run it, so you do not have to perform the analysis again in the future.

To review the DE analysis result, click on the Result button on the button of function tabs and select the DE analysis result that you want to review.

Sub-clustering

Sub-clustering is an advanced feature that takes out a group of cells and treats them as a new set of data. The software will calculate new principal components and dimensionality reduction results to plot the selected cells in a new scatter plot. They will also be re-clustered based on louvain and k-means clustering methods.

Focusing on a subset of data with less cells than the original one helps you to identify more principal components and components that are significant only to this group of cells. Therefore, you can further group the cells to smaller clusters with distinct expression profiles. This feature is suitable for analyzing clusters with large heterogeneity.

Run sub-clustering

To run sub-clustering, first select a group of cells (refer to section 6.3) and click on the Sub-clustering icon. Name the sub-cluster as you like and click on Apply.

Re-calculation for the sub-cluster usually takes some minutes. After that, the Sub-clustering dashboard will be automatically open.

Sub-clustering dashboard is similar to the Analysis dashboard and can be used for query gene expression, find marker genes and enriched processes, study cellular composition, etc. but not differential expression analysis. A Mini map at the bottom left of the dashboard shows the main scatter plot with all cells of the sub-cluster highlighted in white.

To go back to main Analysis dashboard, click on the name of the sub-cluster at the top left corner and choose Main cluster from the drop-down.

Annotation of sub-clusters

Adding annotation in the sub-cluster dashboard is like in the Analysis dashboard.

First, select a group of cells, then click on Create an annotation and define the Group name and Cluster name.

Annotation created in Sub-cluster dashboard is treated equally to the one created in the main dashboard. Hence, you can view your sub-clusters in the main scatter plot or annotate sub-clusters in the different sub-clustering dashboard under the same group name.

Study clonotype

Sequencing the TCR is a powerful instrument to dissect the complexity and diversity of the T cell response repertoire. By associating the TCR with gene expression, BBrowser can provide an unbiased classification of a population of interest and the association of the transcriptional landscape of each cell with its TCR.

Getting started

On BBrowser, click on the Clonotype button at the bottom of the main scatter plot will show you the Clonotype dashboard. All cells in main scatter plot will be changed to gray color and spot size is decreased. A mini map will pop-up showing you the previous coloring of the scatter plot.

Now, you can add TCR sequencing data by clicking on Upload.

In case your data coming from multiple batches, the TCR sequencing data should be submitted for individual batch. Clicking Upload data button in that case will show you a pop-up to select input file for each batch.

Cells with recognized TCR sequence will now be colored according to their clonotype and spot size is changed to normal. The cells will be highlighted and enlarged if you hover the mouse on the clonotype name. Details on the number of cells in each clonotype and relevant antigen information are displayed in a table format.

On the left side of the dashboard, you can change clonotype data, or do clonotype counting and create an annotation for cells with a TCR sequence. By having this conversion to annotation, you can run any analysis on different clonotypes including marker gene detection, enrichment analysis, composition, and differential expression analysis.

Accepted data format

TCR sequencing results can be imported as TSV or CSV file.

The input matrix must have enough information for a typical V(D)J annotations. BBrowser only reads data from columns with the column name fall into the list below. Columns that are not in this list will be ignored.

  • v_gene: name of the V gene
  • j_gene: name of the J gene
  • crd3: CRD3 sequence in terms of amino acid
  • barcode: barcode of cell having this clonotype
  • raw_clonotype_id: the clonotype ID
  • full_length: Whether it has valid V and J annotations
  • productive: Whether the transcript translates to a protein with a CRD3 region

The software only chooses clonotypes that are both full_length and productive. The CDR3 amino acid sequencing are used to map with the VDJdb (Shugay et al. 2017) to find out about the information of relevant epitopes.

Clonotype counting

There are two ways to perform clonotype count:

  • Clonotype: This is the default method. The software counts a cell to a clonotype if that cell has the clonotype ID with the exact sequence for both chains of the TCR. You can convert this counting result to an annotation to capture the composition of other factors.
  • TCR chain: With this option, each row in the table is a single TCR chain. So that cells are grouped if they shared at least one chain with each other. Hence, one cell can appear in several groups at a time, and you cannot convert this one into an annotation.

Export and Data sharing

Export graph

BBrowser supports you to export different graphs to image file formats:

  • Box plot, violin plot, density and scatter plot of genes’ expression can be exported as SVG file.
  • Scatter plot showing clustering results, genes’ expression over whole cell population and volcano plot showing DE genes can be exported as PNG image.
  • To save the image and its legends, click on the camera button on the top right or bottom right of the plot.

Export TSV file and MTX file

You can export many types of data to a TSV file, for example, graph-based and k-mean clustering results, all annotations and input metadata, clonotypes count, list of marker genes and enriched processes and the corresponding p-value.

To save the annotation file, click on the Export button and choose the type of data you want to save. For gene table and enrichment processes table, you can find the download button next to the table.

For the expression matrix, it will be saved in a folder with the same name as the study’s name, containing matrix.mtx, genes.tsv, and barcodes.tsv files. This is the sparse matrix of data after pre-processing steps such as filtering or batch-effect removal.

To combine 2 studies together, you can export the expression matrix from each study to a folder, then import them again in a merged study with the batch correction method of your choice.

Share your data to the remote repository

If you want to share both the data and analysis results with your colleagues, you can utilize your existing private network protocols like SSH to create an internal data repository. Once the study is shared to the remote repository, your colleagues can download it to their local computers and view all the available information: annotation, differential expression analysis results, gene gallery, sub-clusters, as well as to continue working on the dataset in their local computer. The uploaded version remains unchanged in the remote repository.

Please note that this feature only supports data sharing, BioTuring does not provide a server itself. Users need to have an internal server and have access to that server in order to use the remote repository.

Setting up your BBrowser

On BBrowser, go to Settings > Remote repository data to add the server and access details.

Type in the boxes your server and account information – username and password. The software will record it and automatically login if your account is available and granted the access to the default directory.

  • Default directory: This is the where your data is uploaded to.
  • The software will also create a metadata file here, which contains summaries of all datasets have been uploaded to the remote repository. You can view the summaries when opening Home page > Remote repository. Only users who upload the study can add a summary for that study.
  • Host: A web address where you host your server, e.g.: mydata.bioturing.com
  • Port: This is the port ID that provide sftp on your internal server. The default port for sftp is 22.

After providing all information, please click Apply setting button to finish.

Export data

Now, you can try exporting a dataset from your computer to your remote repository:

  • Open the dataset you want to share in BBrowser. This will lead you to the Analysis dashboard of that dataset.
  • Click on Export button at the bottom of function tabs.
  • A pop-up window will come with details about the Remote repository export.
  • Fill in the boxes with details about the study: Title, authors, abstract and tags.

Use comma (,) to separate the authors.

We don’t limit the number of characters for abstract (summary) and you can also use Enter to separate between paragraphs.

You can only add pre-defined tags in this window. To create new tags, go to next section.

Please note that this data upload will create a clone of the current version of your dataset in the remote repository. You will still have your dataset in the local computer and can continue working on it. However, the software will not sync any changes to the dataset after uploading. Other users who have access and download the dataset from the remote repository, will see the exact version of data when it is uploaded. If you want to share your changes, please upload the dataset again.

Add custom tag

On BBrowser Home page, you can find a study of interest by a list of tags. This helps users to quickly get a data from a tissue or category. Tags of a given study are determined by the one who uploaded the study.

When you upload a dataset to the remote repository, you can add or remove the tags in the Export window. By default, BBrowser initializes a list of tags which is commonly used to classify a data and only allow users to choose a tag from this list.

If you want to add a new tag, you can go to the Settings page. Under the Custom tags section, you can add or remove any tag you want. Notice that any changes on the list will affect all local dataset. Tags of data that has been uploaded to the remote repository will remain unchanged.


Azizi, E., Carr, A. J., Plitas, G., Cornish, A. E., Konopacki, C., Prabhakaran, S., ... & Choi, K. (2018). Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell, 174(5), 1293-1308.

Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. "Integrating single-cell transcriptomic data across different conditions, technologies, and species." Nature biotechnology 36, no. 5 (2018): 411.

Consortium, Gene Ontology. 2004. “The Gene Ontology (GO) Database and Informatics Resource.” Nucleic acids research 32(suppl_1): D258--D261.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal, Complex Systems 1695(5): 1–9.

Gribov, Alexander et al. 2010. “SEURAT: Visual Analytics for the Integrated Analysis of Microarray Data.” BMC medical genomics 3(1): 21.

Haghverdi, Laleh, Aaron T L Lun, Michael D Morgan, and John C Marioni. 2018. “Batch Effects in Single-Cell RNA-Sequencing Data Are Corrected by Matching Mutual Nearest Neighbors.” Nature biotechnology 36(5): 421.

Joshi-Tope, G et al. 2005. “Reactome: A Knowledgebase of Biological Pathways.” Nucleic acids research 33(suppl_1): D428--D432.

Korsunsky, Ilya, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-Ru Loh, and Soumya Raychaudhuri. "Fast, sensitive, and flexible integration of single cell data with Harmony." BioRxiv (2018): 461954.

Korthauer, K. D., Chu, L. F., Newton, M. A., Li, Y., Thomson, J., Stewart, R., & Kendziorski, C. (2016). A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome biology, 17(1), 222.

Krijthe, J H. 2015. “Rtsne: T-Distributed Stochastic Neighbor Embedding Using Barnes-Hut Implementation.” R package version 0.13, URL https://github. com/jkrijthe/Rtsne.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome biology 15(12): 550.

Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing Data Using T-SNE.” Journal of machine learning research 9(Nov): 2579–2605.

McInnes, Leland, and John Healy. 2018. “Umap: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv preprint arXiv:1802.03426.

Melville, James. 2018. “Uwot: The Uniform Manifold Approximation and Projection (UMAP) Method for Dimensionality Reduction.” https://github.com/jlmelville/uwot.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26(1): 139–40.

Shugay, M., Bagaev, D. V., Zvyagin, I. V., Vroomans, R. M., Crawford, J. C., Dolton, G., ... & Eliseev, A. V. (2017). VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic acids research, 46(D1), D419-D427.

Subramanian, Aravind et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102(43): 15545–50.

Tran, Thang, Thao Truong, Hy Vuong, and Son Pham. 2019. "Hera-T: An Efficient And Accurate Approach For Quantifying Gene Abundances From 10X-Chromium Data With High Rates Of Non-Exonic Reads.". doi:10.1101/530501.

Wang, T., Li, B., Nelson, C. E., & Nabavi, S. (2019). Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC bioinformatics, 20(1), 40.