BioTuring Single-cell Browser Guidebook
Learn how to use BioTuring Single-cell Browser with our step-by-step instructions.
1.1. About the software
BioTuring Browser, or BBrowser, is a desktop application that performs analyses on sequencing data. The software is also connected to a database hosting sequencing data from the latest publications. Users can use BBrowser to analyze their own data or analyze the public data available.
The software allows scientists, even ones without programming experience, to quickly investigate massive amounts of sequencing data from in-house and published work and compare them together. All data submitted by users and data downloaded from the BBrowser database is stored and secured on the local computer.
The application was first released in October of 2018, running on Windows, macOS, and Ubuntu.
1.2. System requirements
- Windows: 7 or higher (64-bit only)
- MacOS: OS X El Capitan (10.11) or higher
- Linux: Ubuntu 18.04 only
- CentOS 7
- Ethernet connection (LAN) or a wireless adapter (Wi-Fi)
- Minimum 10 GB
- Recommended 20 GB or more
- Minimum 8 GB
- Recommended 16 GB or above
- For an optimal experience with large datasets containing more than 100,000 cells or performing analyses from raw data (FASTQ files), please use a computer with the recommended configuration.
- If you experience difficulties in running the software even after fulfilling all these requirements, please contact email@example.com
2. Installation Guide
2.1. Ubuntu 18.04
The application is only available on Ubuntu 18.04 (xenial xerus).
After downloading the .deb package, you can either install the software by the graphical user interface or the command-line interface as following:
Graphical User Interface:
- Navigate to the location of the downloaded .deb file, double click on the file or right-click and select “Open with Software Installer”.
- Click on the “Install” button on the Software Installer window.
Command Line Interface:
- Open terminal with
Ctrl + Alt + T, navigate to the location of the downloaded
sudo dpkg -i <name of the package>.deb. This will install all the dependencies you need to run the software.
- If the installation results in dependencies errors, you might type
sudo apt install -fto resolve all dependencies and reinstall the software.
Please note that the software may open a local web socket to execute R commands for certain analyses. Internal tcp permissions for port 9004 need to be available for the software to run properly.
To install BBrowser on MacOS, after downloading the installation package:
- Navigate to the downloaded .dmg package.
- Double click the .dmg package to open the installation window.
- Follow the on-screen instructions to complete the installation process.
There are 2 options for running BBrowser on Windows: portable version or installed version.
The portable version does not require any installation and is only available for Windows. Users who do not have the necessary privileges to modify the system registry can download this version to use instantly. There is no difference between the portable version and the installed version of the BBrowser in terms of the interface and functionality.
The portable version is provided in a zipped file. After downloading, you need to unzip the file and double-click on it to run the software.
Alternatively, to launch BBrowser, users can run the executable binary, BBrowser2.exe, located in a folder called BBrowser2-win32-x64.
If you want to move the software storage, you must move the entire BBrowser2-win32-x64 folder. Modifying or removing any files in this folder may cause BBrowser to stop working. In either case, your single-cell data is not affected as it is stored in a different location.
If you want to install BBrowser to your computer, download the installer .exe file and run it with administrator permission. An installation window will guide you through the installation.
Although the program can be installed anywhere on your computer, we highly recommend you putting it in the usual Program Files folder. However, this action will require the administrator’ s permission.
If your computer has more than one account using the software, each account can only access its own data.
2.4. CentOS 7
To install BBrowser on Centos 7, first, you need to install some dependencies:
yum install libgfortran libXScrnSaver
Then, use the following command to install BioTuring Browser:
rpm -iU BBrowser-xxx.x86_64.rpm
Please replace “xxx” with the version that you downloaded. Installation of the software and its dependencies may require root access. After installing, BBrowser can be found in Applications > Accessories
3. Program Interface
3.1. Login page
BBrowser Login page appears when you open the software for the first time. Once the software successfully records your credentials, it will automatically log in the next time you start BBrowser.
Please enter your credentials and claim your academic or non-academic status to access the different sets of features, then Enter or click on Login.
Log in credentials are encrypted and stored individually for each user if multiple users are using the same computer.
If you are using a network with proxy, please configure Proxy settings at this point, before any connection to BioTuring server is made. The software needs to have correct proxy settings in order to connect to our server and verify your credentials, as well as to get access to our public database.
3.2. Home page
BBrowser Home page shows you all data that you can download to the local computer, including public data from BioTuring server and data from your remote repositories.
From BBrowser 2.1.3, the Home page offers another feature for you to look for gene expression levels across studies in the BioTuring database.
On the Search studies tab, choose the library you want to access by the drop-down on the top left. There are 2 libraries available:
- Public studies: list of scRNA-seq data from publications, curated by BioTuring team.
For details about this database, please refer to Section 8 of this document.
Internet connection is needed to download data from BioTuring server. Once you are connected to the internet, new datasets will be automatically updated.
- Remote repository: list of studies available on your organization shared network.
To access and download data from here, please configure your network on the Settings page.
To search for single or multiple gene expression across all studies in BioTuring public database, click on the Search genes tab.
3.3. Data page
BBrowser Data page shows you all data that you have downloaded or submitted to the local computer. You can refer to this page as your local database. You also need to go here to submit a new dataset.
- Click on Add new study will lead you to a pop-up window for data submission.
- Move the mouse over your study of interest will make it highlighted and click on the study will bring you to the Analysis dashboard to explore it.
- By default, all datasets are sorted by Last modified date. You can choose to sort it by title, species, size or number of cells.
- Rename and Delete buttons are available for each dataset.
- The search box (Find your study) on the right can search for the study’s title, species, date, etc. However, there is no filter box or tags available in the Data page.
3.4. Settings page
BBrowser Settings page helps you to:
- Check for software version
- Activate and view your license and expired date
- Change proxy server settings
- Configure the shared network for remote repository data hosting
- Manage tags to classify and organize the data you share
- Change data storage location
- Choose to launch app automatically after computer login
- Allow sending log data to BioTuring server
3.5. Analysis dashboard
When you click on a study on Data page or click on Explore a study from Home page, that dataset will open in the Analysis dashboard.
Here is where you can visualize the data and perform all the analyses.
The main visualization is a scatter plot of dimensionality reduction, with each point representing a single cell. Cell color, size, and shape change when you run different analyses. The scatter plot is interactive, allowing you to zoom, move, rotate (in 3D mode), or select cells.
Inside the main visualization window are some function boxes:
- Info box: shows a hierarchical list of sub-clusters, the cell search/ cell type prediction result and gene information. The cluster box can be minimized by clicking on the main cluster, the 2 other boxes can be closed by clicking on the (x) icon below the dialogues.
- Gene query: allows searching for single/ dual gene expression by gene name or ensemble ID
- Mini map: shows the whole cell population of the study with selected cells highlighted when the main scatter plot is showing gene expression or a sub-cluster
- Navigation tools: a list of interactive visualization tools like zoom, 2D/ 3D, move, pan select, lasso select and reset the scatter plot. At the bottom is access to the clonotype dashboard, gene gallery and scatter plot export.
On the right of the main visualization window are main function tabs
There are 4 tabs here, each comes in a small window which can be expanded or collapse. These tabs either give you more insights about the data or provide additional visualization, which are:
- Color by: This tab controls the color of the scatter plot. By changing the way cells are colored, you can visualize different clustering/ cell annotation results. You can import your annotation matrix from a file in this tab.
- Shape by: This tab controls the shape of the cells on the scatter plot. When this tab is activated, the cell’s shape will be changed from a dot to a number or letter.
- Composition: This tab shows the percentage of different groups of cells on a selected population. When this tab is activated, in the scatter plot, cells in the selected population are colored while unselected cells left in gray. This tab also helps you run differential expression analysis.
- Differential expression: This panel helps you freely select 2 groups of cells and run differential expression analysis between them.
- Marker genes: This tab runs and presents results from finding marker genes function.
- Enrichment analysis: This tab runs and presents results from enrichment analysis function
At the bottom of function tabs are information about study input/ output and visualization and analysis settings.
The other 2 interfaces: Sub-clustering dashboard and differential expression dashboard will be described in their specific section.
If you need help while doing analysis, press Alt (on Windows) or hover your mouse to the top left of the screen (on macOS) and click on Help to view our tutorials or to contact us.
4. BioTuring public database
Massive amounts of single-cell RNA sequencing data generated have opened avenues for exploration, yet also brought up new challenges to standardize data formats, systematically access transcription profiles of cell types across studies and integrate multiple datasets.
Hence, in BioTuring Browser, we have indexed published single-cell RNA sequencing data from multiple formats to our platform to remove that barrier. All data are processed and annotated to be instantly accessed and explored in a uniform visualization and analytics interface.
In addition to that, we have developed our set of marker genes for over 200 cell types and use that gene list to verify the author’s annotations and re-label the cell types to BioTuring cell ontology to systemize cell types available in our database.
Users can also query a single or multiple gene expression across all datasets in the database and see how the genes expressed in different clusters without downloading any dataset.
The section below explains how we index published data and how the gene query across the database works.
4.1. Curation method
Step 1. Data collection
Single-cell gene expression matrices or Seurat/Scanpy objects are obtained from the author or public repositories. If Seurat or Scanpy objects are available, we will reserve the analysis results and move to the annotation step (6).
Step 2. Filtering and normalization
Cells and genes from the submitted matrices are filtered to avoid drop-out, doublets, and apoptotic cells. Data are then subjected to log normalization and highly variable genes selection. QC criteria are subject to the authors’ descriptions.
In case details of the filtering and normalization are not available, we will process the data by ourselves to get the most similar results with the publication.
Step 3. Batch effect correction
We follow the methods used in each study. If not provided, we will apply CCA correction.
Step 4. Dimensionality reduction and clustering
We use the first 30 components of PCA to calculate 2D and 3D t-SNE or UMAP, the parameters of which are taken from the author’ descriptions.
Step 5. Clustering
The dataset will go through both graph-based clustering by the igraph package (Csardi and Nepusz, 2006) and k-means clustering (Neter et al., 1998).
Step 6. Annotation and standardization of cell type labels
Cell type annotation matrices are obtained from authors and loaded in BioTuring Browser, together with metadata of the experimental design. We then manually verify cell type annotations using known markers and unify the terminology based on our internal cell ontology.
If annotation and metadata are not available, we will extract information directly from the publications.
4.2. List of studies
Users of BBrowser can view all studies in the public database when opening the Home page of the software. You can also access the list of studies available in BioTuring website: https://bioturing.com/bbrowser/datasets
We select the studies to index based on the needs of our users and community.
If you have a study of interest and would want it to be indexed by BioTuring team, please contact us at firstname.lastname@example.org
If you are an author, we are very happy to distribute your data on BBrowser for public access. Please also contact us at email@example.com
4.3. Query gene expression across the database
Since version 2.1.3, we introduced a special search engine to help you look at one gene or multiple gene expression across every public dataset of BBrowser.
BBrowser is currently connected to 126 studies with a total of more than 5.5 million cells. Without downloading anything from the server, the gene search engine lets you skim through a huge amount of information in the most efficient way.
You can find the gene search engine in Home page > Search genes tab
- Type in or copy and paste your list of genes to the search box.
- Type in key words about the tissue, disease or studies you’d want to look for
- Click on the Search button to start searching
If you search for one gene, the result of this search engine is a series of violin plots, each of which is the gene expression in a public dataset of BBrowser. On the plot:
- x-axis: By default, this is the graph-based clustering result. You can change it into any annotations in that dataset by clicking on the annotation name and select other categories from the drop-down.
- y-axis: This is the log-2 of expression value. The unit depends on what kind of data provided by the authors. In most cases, this is the UMI count.
All violin plots are interactive. You can hover your mouse over the plot to get the statistics (e.g. quantiles, median, mean, etc.), or drag to enlarge an area of the plot. Double click on any part of the plot will bring it back to the original setting.
On the top right of each dataset, there is a horizontal bar telling the percentage of cells that express the gene. The search result is sorted descending based on this number.
Information about the study and option to Download are the same as in the Search studies tab.
If you search for multiple genes, the results will be a series of heatmaps, each of which is from one dataset. Each heatmap shows:
- x-axis: By default, this is the graph-based clustering result. You can change it into other annotations in that dataset.
- y-axis: This is the gene list you query not in a specific order.
- The log-2 expression value is shown in the color scale. The unit depends on what kind of data provided by the authors. In most cases, this is the UMI count.
5. Get your data
You can get a dataset on BBrowser by downloading it from BioTuring server or from your internal server or by importing the data from your local computer.
Currently, BBrowser supports analyzing data from human (Homo sapiens) and mouse (Mus musculus). If you input data of a species rather than those, the software can still process the data (except transcript quantification step). However, some features that are related to gene information will be disabled, such as gene-set enrichment analysis and gene functional reminder.
5.1. Search and Download a public study
BioTuring Browser hosts a public database of published studies that are selected, processed, verified and uniformly labeled by the BioTuring team. You can view the list of studies in this database in the BBrowser Home page.
To download a study from BBrowser public database, you need to be connected to the internet and follow these steps.
If you want to search your studies through general terms like its title, authors and other keywords:
- Go to Home page > Search studies and choose BioTuring database for public data.
- Use the Search box to look for your studies by using the title, authors' name and other keywords.
- Filter by tags: Search for studies by using some tags that are pre-defined by BioTuring team. You can find the tags below each study, indicating the field of research, tissues and related diseases.
- Filter by Bioturing - Cell type: Search for studies that contain a specific cell type of interest.
- Search results can be sorted by created date, study title, number of downloads, number of cells, and the size of the data.
- Each study comes with the author’s information, abstract, species used, and GSE number.
This type of searching allows looking for studies that express your gene(s) of interest. Go to Home page > Search genes.
- Type your specific gene name or list of genes in the left Search genes box and narrow the studies based on all keywords from published studies related to gene expression of interest in the right Search box. Then click the Search button.
- The results can be sorted by specific gene expression from users. You can scroll down to find the study you want to download and click on the Download button or Capture the images of study.
- For further information on specific groups in study, you can scroll down to find gene expression in specific groups or read more in BioTuring public database > Query gene expression across public database part.
Download the data
- Click Download to get the study to your local computer.
- Once the study is downloaded, click on Explore to open the study or Redownload to get the most updated version of the study
- The Redownload button will remove old original data of a study and replace it with the updated version. Analyses have been done on the study (differential expression analysis, sub-clustering, etc.) will be kept but your annotation will be disappeared.
5.2. Import FASTQ file
To import a single-cell RNA sequencing study with raw data, you need to provide a folder containing all your FASTQ files.
- Go to the Data page -> Click on Add new study. On the popup window, click on the Raw sequencing files
- Move the folder containing your FASTQ files into the input box or click anywhere inside the box to open File Explorer and select the input folder.
- Type in the dataset title and choose the reference index compatible with your data and the sample preparation platform used.
- Define your quality control parameters and the dimensionality reduction plot you want to view.
- Finally, click on Start.
- BBrowser will start processing the data and a bar will appear to show you the progress.
- Once the processing is done, the Analysis dashboard will be open with the data you imported.
- FASTQ files can be unzipped (.fastq/.fq) or zipped in gzip (.fastq.gz/.fq.gz) format.
- All input FASTQ files need to be in the same folder and at the same level (no subfolder).
- Currently, we only support paired-end reads that are in two distinct FASTQ files. The software automatically pairs the FASTQ into runs based on the file names. Please make sure that two FASTQ files of one pair have the same prefix, and it is different from other pairs’ prefix. For example, nuclei_900_S1_L001_R1_001 and nuclei_900_S1_L001_R2_001 will make a pair.
- BBrowser supports mouse mm10 and human 88.p12/ GRCh37 index.
- To download the reference index, sufficient space for storage is also required. Depending on the reference index you choose, the software will notify you of the free space needed (5 GB on average for each reference file). You only need to download the reference index once, when importing FASTQ files for the first time, after that, the reference is stored in your computer and will be called out whenever needed.
- The alignment and quantification process run by Hera-T (Tran et al. 2018) and needs at least 12 GB RAM. Hera-T supports raw data prepared by 10X Chromium Chemistry V2 and V3. The output of Hera-T will automatically go through the analysis pipeline as a single MTX file and can be exported in the Analysis dashboard.
- Raw data from single-nuclei sequencing experiments cannot be processed by Hera-T.
- To edit or remove the file you have imported, hover your mouse over the file name. A gray box will appear around the data together with Edit and Remove options.
5.3. Import Expression matrix (MTX, TSV, CSV)
BBrowser supports importing expression matrices as MTX, TSV, and CSV files with integer counts.
- scRNA-seq data can be imported by all file formats.
- CITE-seq data can only be imported by MTX.
The expression matrix files can be unzipped or zipped in gzip.
5.3.1. Import MTX file(s)
To import a study by single or multiple MTX files, you need to provide a folder with exactly 3 files: gene.tsv (or features.tsv), barcodes.tsv and matrix.mtx for each MTX file.
When multiple folders containing data from multiple batches are submitted, options for selecting batch correction methods will be available.
- Go to Data page -> Click on Add new study. On the popup window, click on Count matrix tab
- Choose the input folders by drag and drop the folders to the input box.
- Or you can click the + button and select the file format as MTX. A File Explorer window will be opened. Navigate the directory and select the input folder(s) containing your 3 files: gene/features.tsv, barcodes.tsv and matrix.mtx
- Choose the species of your data, the batch correction method preferred (optional), quality control parameters and the dimensionality reduction plot you want to view.
- Type in the dataset title and finally, click on Start.
- Once the processing is done, the Analysis dashboard will be open with the data you imported.
If multiple folders were submitted, in the Analysis dashboard you will find the input metadata classification with the name of clusters are input folders’ names. This helps you visualize (colored and shaped) the cells based on which batch they come from.
The three files barcodes.tsv, features.tsv (or genes.tsv), and matrix.mtx are the standard files from 10X CellRanger. Below, we describe some more details of the data format that will affect the analysis.
- mtx: This is the sparse version of the expression matrix. Currently, BBrowser can only accept UMI count or non-negative values. Centered values are not allowed and will not pass the analysis pipeline.
- tsv: Each line in this file is a barcode. The format of the barcodes sometimes includes a different number at the end, e.g. AAACCTGAGGGTCTCC-1, which indicates for subject identifier. However, demultiplexing in BBrowser is not available, which means metadata cannot be generated based on the subject identifiers. Users are highly recommended to provide an annotation table after submitting data in case of multiplex sequencing.
- tsv: This file provides the information of the row in the count matrix. Originally, these are the genes. But because of the recent introduction of multi-omic protocols, we now have multiple types of feature (the 3rd column of the file) or even multiple species (prefixes of the 1st column). BBrowser only read Gene expression and Antibody Capture features. And if you have more than one species, only the species that appear first in the file is selected.
5.3.2. Import TSV, CSV file(s)
To import a study by single or multiple CSV/ TSV files:
- Go to Data page -> Add new study. On the popup window, click on Count matrix
- Choose the input files by drag and drop the files to the input box.
- Or you can click the + button and select the file format as TSV, CSV. A File Explorer window will be opened. Navigate the directory and select your input file(s). Make sure that all files submitted are in the same format (either TSV or CSV)
- Choose the species of your data, the batch correction method preferred (if you submit multiple files), quality control parameters and the dimensionality reduction plot you want to view.
- Type in the dataset title and finally, click on Start.
A .tsv or .csv files are simply a table in which values are separated by a delimiter. It can be a tab (in .tsv) or a comma (in .csv). If you use a table editor, such as Excel, Libre, or Google Sheet, it always can export your table into either .csv or .tsv format.
BBrowser requires a strict format in order to parse the information correctly. Please make sure that the first column of the table has the gene names / Ensembl identifiers, and the first row of the table has the barcodes.
For users who want to export a matrix using R, please be careful because writing a matrix in R may lose one first cell of the first row. For example, given a matrix object having 1000 rows and 500 columns:
num [1:1000, 1:500] 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:1000] "ENSG1111111111" "ENSG1111111112" "ENSG1111111113" "ENSG1111111114" ...
..$ : chr [1:500] "CTGGTCCGGTGTTATCAG" "TTACTGGGACGACTCGGG" "ACGAGGAGACCCGAGATA" "CTTTGCAGTAGGGGCAAC" …
Write a .csv file in this way will lose the first cell. The first row of the file will only contain 500 values while other rows will be 501:
write.table(m, 'matrix.csv', sep=",", col.names=T, row.names=T)
Please use this command instead. It is much easier:
For .tsv file, the best way is to use the common write.table, then manual insert one tab on the beginning of the first row.
5.4. Import Seurat, scanpy object
BBrowser supports importing processed scRNA-seq and CITE-seq data by Seurat (.rds) and Scanpy objects (.h5ad/ h5) with integer counts (raw counts or rounded counts).
- Go to Data page -> Add new study. On the popup window, click on Processed data.
- Move the file you want to import in the input box or click anywhere inside the box to open File Explorer and select the input file.
- Data format will be automatically adjusted for the compatible input file.
- Choose the species of your data.
Quality control parameters and the dimensionality reduction method are not needed because these steps have been done on the Seurat/ scanpy object.
- Type in the dataset title and finally, click on Start.
A Seurat or scanpy object must contain an expression matrix with information on barcodes and genes. BBrowser can also adopt some analysis results in the object. These results include, but are not limited to:
- Integrated expression matrix with batch effects corrected
- PCA results
- t-SNE/UMAP coordinates
- Graph-based clustering results
- Metadata of the cells
Upon receiving the Seurat or Scanpy object, BBrowser will read all data available and runs analyses to get the missing information.
BBrowser is able to read a Seurat object stored in .rds format. To create a .rds file from Seurat, you can use the saveRDS function in R. We will not go into detail about the structure since the software does not require any specific modification of the original Seurat structure. The most critical information in each object is the count matrix, which should be store in @assays$RNA@counts for gene expression data and @assays$ADT@counts for antibody captured data.
For users who analyze with Python via the scanpy library, the final AnnData class should be stored in .h5/.h5ad format using the .write function within the class itself. Unfortunately, hdf5 is too general and there are many variations of the structure in which the information is recorded. BBrowser expects the following structure:
- Expression matrix (required): a parse matrix at X or X. It can be a normal or a sparse matrix. For a sparse matrix, it must have the 3 standard columns: indices, indptr, and data.
- Gene IDs (required): a column named gene_ids or index at var or var
- Barcodes (required): a column at obs/index
- PCA (optional): a data.frame at obsm/X_pca
- t-SNE (optional): a data frame at obsm/X_tsne
- UMAP (optional): a data frame at obsm/X_umap
- Metadata (optional): a data frame at obs
- Graph-based clustering (optional): a column in metadata named louvain
- BBrowser reads all metadata which has less than 50 categories. If you have already annotated your data, you can add it to @meta.data class in the object.
- BBrowser does not support importing multiple objects, please combine your multiple batches in one object before importing to the software.
- For other single-cell object formats, you can convert it to Seurat objects by the tutorial from Satija lab: https://satijalab.org/seurat/v3.1/conversion_vignette.html
6. Data analysis pipeline
We are fully aware that different datasets were generated under different experimental designs and may have to be treated uniquely in order to represent all biological variations in the samples and for public studies, to reproduce the published results in the most faithful way. That is also the long-term plan for BioTuring Browser to maintain the speed and ease of use, while enhancing the flexibility of the analyses. All public datasets and imported data underwent the same pipeline, separate steps of which will be discussed in this section.
6.1. Transcript quantification
Transcript quantification is only applied when you create a new study with raw sequencing files (FASTQ). The process is run by Hera-T (version 1.2.0) (Tran et al. 2018), a new algorithm developed by BioTuring team. This is applied to data generated by 10X protocol on Chromium v2 and v3. The processing speed is up to 10 - 100 times faster than CellRanger 3.0 with better accuracy (Tran et al. 2018). The output of transcript quantification is an expression matrix in MTX file format and the file will be submitted for further processing steps below.
6.2. Quality control
The process from quality control to dimensionality reduction is applied to public and in-house datasets imported in MTX, TSV or CSV files.
Quality control filters out poor-quality cells in terms of gene expression and redundant non-expressed genes in the data.
In public datasets without a detailed processing script from the author, genes having at least 1 UMI count in less than 3 cells are excluded. Then, cells with less than 200 genes having at least 1 UMI count and more than 5% of mitochondria genes are excluded. The process creates a new expression matrix that may have fewer cells than the original data, and BBrowser only takes the cells and genes of this filtered matrix for the next processing steps.
For in-house data, BBrowser allows users to define the cut-off for quality control or to skip any filtering steps. In the data import pop-up, you can
- untick the checkbox before any filtering criteria to skip that step
- change the number of cells/ genes/ mitochondria gene ratio/ top variable genes to apply the new parameters to the filtering step.
6.3. Batch effect removal
This process is applied when multiple MTX, TSV or CSV files are submitted, usually from multiple batches of sample preparation and sequencing. The software considers each file as a batch and will try to scale all batches with the chosen method
Currently, we provide 3 methods to remove batch effects for your preference:
- MNN correction (Haghverdi et al. 2018) : This method is based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space, which are highly variable genes detected by the Seurat package (Gribov et al. 2010). To select the initial group, unbiased graph-based (louvain) clustering is run on the first 30 components of the PCA results to sort the batches in order, then the batch in first order is selected. The method assumes that there is at least a subset of the population is shared by all the batches. It is effective on repetitive measurement, but the computational time is expensive, compared to other methods.
- CCA (Butler et al. 2018): This method is widely used in scRNA-seq via the Seurat package (Gribov et al. 2010). The idea of this method is to use an adaptive version of manifold alignment and anchor all the batches in this adaptive space. It is simple, fast, and at the same time, very effective when applying on data of different technologies.
- harmony (Korsunsky et el., 2019): This method is similar to CCA since it also projects cells into a shared embedding. But by considering cell type rather than dataset-specific conditions, it can simultaneously account for multiple experimental and biological factors. Application and the speed of this method is comparable to CCA.
6.4. Dimensionality reduction
On BBrowser, you can choose to run dimensionality reduction by t-SNE or UMAP.
- The first 30 components of the PCA are used for calculation.
- The perplexity can be changed under Dimensionality reduction settings.
- Same parameters are applied to both initial processing and sub-clustering of a dataset (except sub-clustering of less than 100 cells, where perplexity of 5 will be applied).
t-SNE (Maaten and Hinton 2008):
The analysis is done by the Rtsne package (Krijthe 2015). The default perplexity for t-SNE is set at 30
UMAP (McInnes and Healy 2018):
The analysis is done by the uwot package (Melville 2018) . The number of neighbors is set at 30.
6.5. Clustering methods
This analysis runs on the PCA results. For every dataset, the software will calculate both louvain (graph-base) and k-means clustering.
- Louvain clustering: The graph-based method is done by the igraph package (Csardi and Nepusz 2006) with a flexible number of nearest neighbors. This number is no larger than 20 and estimated by the elbow method of k-means clustering on the PCA results.
- k-means clustering: This method is generated in a series of k ranging from 2 to 10. The software pre-calculates and records the outcome from all k values so that users can instantly switch to a different k and view clustering result in the scatter plot.
6.6. Finding marker genes
BBrowser uses a non-parametric approach, called Venice, to detect marker genes. It is an open-source algorithm and can effectively run on a large amount of data while the accuracy is outperform other methods (Hy et al. 2019).
We first defined marker genes of a group of cells in a data set as the genes that can be used to distinguish such cells from the rest. From this idea, we used the accuracy of classification as a metric to score the significance of a marker gene.
Considering each gene separately, we denote a cell as where is the label of a group of cells. if the cell is in the group of interest (group 1 - the group that we want to find the marker genes for). if the cell is not in the group of interest (group 2 - the rest of the data). We denote as the complement group of .
The probability for a cell being in group , given its expression level is:
In most of the cases, the group of interest is much smaller than the rest of the data and can generate a sampling bias. To avoid this bias of sample size, we set:
Accuracy of the classifier is:
The accuracy of prediction is:
Intuitively, For the robustness of the calculation, we divide the expression into intervals:
Where is the number of cells of group in group , and is the number of cells in group . For each gene, we can estimate the accuracy measure for using this gene to predict cells inside or outside the cluster and use this as a metric for ranking the marker genes.
We tested Venice on both real and simulated datasets. The benchmark considered the performance on 2 different sequencing technologies (full-lenght and UMI count), 4 different kinds of marker genes (including transitional genes), and 2 different kinds of null genes. Venice exhibited the best performance and accuracy in all cases. It could effectively detect different types of marker genes and avoid false-positive results while keeping a modest running time.
Venice is also incorporated in Signac, a single-cell analytics package developed by BioTuring. The package is available at https://www.github.com/bioturing/signac
6.7. Gene set enrichment analysis
This analysis is adopted from the GSEA method (Subramanian et al. 2005), a common analysis for selecting potential biological terms given a sorted list of genes. The software performs GSEA on 4 different terms: biological process, molecular function, cellular component, and biological pathway. The first 3 terms are from the gene ontology (Consortium 2004), and the last one is from the reactome database (Joshi-Tope et al. 2005).
Enrichment analysis can be found in both the Analysis dashboard and the Differential expression dashboard
- The Analysis dashboard: The gene list used for GSEA is the sorted list of marker genes. The genes were sorted in the Marker genes tab based the p-value score previously being discussed
- The Differential expression dashboard: The gene list used GSEA comes from the result of the differential expression analysis. Genes are sorted by p-values.
6.8. Cell-type prediction
This feature shows you the suggested cell-type for a group of cells. When a user does a selection by clicking a cluster/annotation or using the Select cell tool, the software picks genes that express in at least 35% of the group. This process does not select from the whole transcriptome, but instead on a list of cell-type markers in our curated knowledge base. Then, it takes that gene profile to estimate the correlation with the cell-types profile. A cut-off of 0.5 is applied to remove non-potential candidates. The remaining cell types will undergo and tree search to find the common parents. Parents which have less weight (e.g. distinct from the rest) are removed. This process is repeated until only one cell type left. The whole analysis usually takes 1-3 seconds to finish, hence, it triggered automatically.
6.9. Differential expression analysis
BBrowser supports finding the differential expressed genes between two groups of cells, with each group must have at least 3 cells. It finds differentially expressed genes using Venice, the same method for finding marker genes. Users can switch to edgeR, a more common method but takes at least 5 times longer.
- Venice: The algorithm has been previously described under the section of marker genes. But in this case, instead of comparing one group to the rest of the data, the software will look for genes that differentiate the two selected groups.
- Wilcoxon, Likelihood-ratio test, T-test, Poisson, Negative binomial and Logistic regression: These are DE methods available in the Seurat package (Gribov et al. 2010). Each will keep the genes that are expressed in at least 45% of cells in one group, without spike-in genes and fit the count data to a statistical distribution model to identify DE genes.
Poisson and the negative binomial test should only be used for UMI count data. To test for significance, gene-wise statistical tests are conducted to produce the p-adjusted value for each gene.
For the log2FC value of each gene, we use the same method of the Seurat package (Gribov et al. 2010). Below is the detail formula:
7. Adjusting data visualization
7.1. Visualization methods
Depending on the data available in your study, you can choose between several visualization methods:
- t-SNE plot based on gene expression
- UMAP plot based on gene expression
- t-SNE plot based on protein expression
- feature plot based on 2 genes or 2 proteins expression
By default, the main plot is calculated by gene expression. Go to the Dimensionality reduction method box at the bottom of the screen to switch between different methods.
To generate a feature plot, type in your gene/ protein of interest for the X and Y axes. Both axes must be either genes or proteins.
7.2. Interactive 2D - 3D plot
t-SNE/ UMAP of gene expression can be view in 2D or 3D, while other plots are set as 2D.
You can interact with the plot by zoom in/ zoom out, switch between 2D and 3D, move and rotate the plot and reset it to the original state.
On the bottom right corner of the scatter plot, there are several buttons that control the visualization as well as how a user can define a selection.
● Reset: this button reset the scatter plot to the original state without any selection and cells are colored by the last clustering factor/annotation used.
● Pencil tool (lasso selection): this button activates the free selection mode.
● Hand tool: this button activates the navigation mode: moving and rotating the plot, as well as whole cluster selection.
● 2D / 3D: these buttons help you switch between 2-D and 3-D scatter plot. Rotation is only enabled for 3D plot. For Seurat/ scanpy object calculated for dimensionality reduction in 2D but not 3D coordinate, BBrowser can calculate the 3D coordinate based on PCA results and vice versa.
● Zoom (plus/minus): these buttons help you zoom in and out. The point size of the scatter plot remains unchanged when zooming. Alternatively, you can use your mouse wheel to zoom.
● Capture: Screencap of the current scatter plot and cluster labeling and export as an image.
7.3. Customize the plot
Users can customize theme, point size, transparency and color palette of the main plot.
- Go to Settings on the bottom right corner:
- Choose Visualization to change the appearance of the plot or Analysis to change to plot type.
- Define your preferences for visualization/ settings then click Apply
Options for altering the scatter plot appearance includes:
- Point opacity: Adjust the opacity and make the cells transparent. This option is helpful when you want to view a rare population with low density surrounded by cells from different groups. Since the density is low and the number of cells is small, the cluster might be hidden and only revealed when all cells are transparent.
- Point-size: Depending on the screen resolution, you may need to adjust the point size to fit your screen. By default, the software automatically estimates the point based on the screen size and resolution.
- Equal size: Data that is generated by UMI-count technologies (such as 10X genomics) may have a lot of zeros (the dropout issue). When a user maps the expression values to the scatter plot, cells that have expression value equal to zero will be in grey, and there will be lots of them. They can cover non-grey cells (has expression value rather than zero) and cause misleading visualization. To tackle this problem, the software makes grey cells much smaller than the original size when a user query a gene. If you turn this option off, the software will keep everything as is.
- Theme: This option is “Dark” by default (black background). This is the optimal theme for analysis. If you want to make a figure or presenting which uses a screen projection, you may consider using the “Light” theme (white background).
- Color palette: This option controls the way colors are assigned to groups of cells or gene expression.
7.4. Color by and Shape by
Color by tab and Shape by tab help you to color and shape the cells in scatter plot to your preference. Users will decide the group of clusters they would want to visualize, hence, changing the way cells are colored and shaped. Cells with the same color and shape belong to the same cluster.
The software offers various classification methods: unbiased graph-based clustering, k-means clustering, classification by input metadata, or by your own definition and annotation. You can import your annotation matrix from a file in Color by tab.
Color by tab is always activated.
- Click on the drop-down to choose the clustering or annotation result you would want to visualize.
- Each cluster has a cluster name and number of cells and represented by a bar sharing the same color with cells belong to that cluster in the scatter plot. The length of the bar depends on the number of cells in the cluster. The order of the clusters is the order in which they were input.
- Double-click on any cluster will show only cells in that cluster and minimize cells from other clusters. Double-click again will show you the whole cell population of the study.
- You can also untick any cluster that you don’t want to see in the scatter plot.
Shape by tab is activated by choice.
When this tab is activated, the cell’s shape will be changed from a dot to a number or letter. With 2 layers of visualization (color and shape), you can view how one cell appears in 2 different classifications (cell type in patient, cell type in treatment, etc.)
- The selection of clustering/ annotation results and clusters to visualize is the same with Color by
To save the scatter plot with the given type of cell’s color and shape, you can click on the download button at the bottom of the plot to save the image and its legends.
8. Query gene or protein expression
8.1. For a single gene or protein
To see how a gene or protein is expressed in the given dataset, you can type the gene/ protein name or its Ensembl ID or alias into the gene/protein query box at the top right corner of the scatter plot and Enter.
Upon querying a gene/ protein, BBrowser provides two ways to visualize its expression
- Scatter plot showing expression levels across all cells: Cells will be colored based on their expression level of that gene or protein, according to the sequential color scale in Settings. Gene information will be displayed in the info box on top left corner.
- Violin plot, Box plot or Bar plot of expression levels across clusters: to create this kind of plot, just click on the arrow below the color scale and hit the Plot.
- The x-axis of the plot contains the name of the clusters, which can be a custom annotation, or a clustering result, together with the ratio of cells having positive expression versus total number of cells in the cluster.
- You can sort the clusters’ order either by alphabet, the number of cells in the cluster, number of cells expressed the selected gene or by mean of expression of the selected gene. Click on the Settings icon on the top left corner and tick on your preference.
- The Settings icon can help you further customize the violin plot. You can change the plot into a box plot or a bar chart. It also allows you to add data points, with each point representing a cell.
- Export and review options:
- You can save the gene/ protein search to the Gene gallery for a quick review later, click on the arrow below the color scale and choose the Add to Gallery button.
- You can export the scatter plot in PNG format by clicking on the export icon at the bottom right corner of the screen.
- For violin plots, box plots, and bar plots, BBrowser supports exporting the figures in SVG format and the data to draw the figures in a TSV file. To do this, just click on the export icon at the top right corner of the plot.
- By default, the unit of gene expression is Log normalized value and the unit of protein expression is Normalized value (CLR normalization) respectively.
- You can change the unit of gene expression to Raw value or Log2 of raw value under Settings > Analysis > Gene expression unit.
In case the expression values are stretched in a large range, you can choose to visualize from 5th to 95th percentile of the data to eliminate outliner points.
Image exported when querying a gene expression showing all expression values (left) or top 5th-to-95th-percentile of expression values (right)
- You can also choose to make different violin plots showing different annotations (eg: conditions, patients, samples, … or your own annotations). First, define the annotation in the Color by tab, then query genes/proteins and hit the Plot
- To make a plot of some certain clusters (not all clusters in an annotation), select only the clusters you want in the Color by tab and untick the other clusters before you query for gene expression.
8.2. For two genes or proteins
You can type in 2 gene/ protein names in the query box to simultaneously see their expression in the given dataset.
Cells will be colored based on the log2 ratio of the 2 genes/ proteins expressions, using the sequential color scale chosen.
Now, if you click on the Plot button, a density plot will appear, showing the expression of two genes across the entire population. To get the density plot for a single cluster or some certain clusters, select the clusters you want in Color by tab and untick other clusters before query for gene expression.
You can change the density plot to a scatter plot by going to Settings and check the scatter plot. By default, cells with no expression of both genes are excluded but the option to include those in the plot is available. You can also drag and select any part of the plot that you want to view with flexible maximum and minimum values.
Settings of expression unit, range and cell populations to be visualized are as instructed above.
To save the plot you made, you can click on the download button.
8.3. For multiple genes or proteins
From version 2.3.6, BBrowser supports viewing the expression of multiple genes or proteins, in the scatter plot or in a heatmap.
To query for multiple genes/ proteins, just continue to type the gene names or protein names after each other. Or, if you have a list of genes in a column of a worksheet, you can paste them right into the gene/protein query box.
You can query an unlimited number of genes or proteins. However, you cannot query a gene and a protein at the same time, due to different units and computation methods.
In the main scatter plot, cells will be colored based on their gene set total expression level, according to the sequential color scale. The gene set total expression is the sum of all UMI for all the genes queried, divided by the sum of all UMI in the cell (Frédéric Pont et al., 2019).
To generate the heatmap, click on the Plot button under the color scale.
By default, the heatmap shows the Z-scores of gene expression / protein expression measurements across the clusters. You can also click on the Settings icon at the top left corner to use expression values to draw the heatmap.
When saving the multiple genes/ proteins query to Gene gallery, you can name the search to your preference for easier review later.
Settings of expression unit, range and cell populations to be visualized are as instructed above.
9. Select a cell population
For other analyses: add annotation, view compositional breakdown, … , you first need to select the cells. Cells that are selected will be colored in white.
9.1. Hand tool & Pencil tool
The most common way to select cells is by using the Hand tool and Pencil tool.
- Hand tool should be used to select cells that are already clustered. Choosing the Hand tool and clicking on a cluster will select all cells that belong to that cluster.
- Pencil tool can be used to select any cells, whether they belong to a cluster or multiple clusters. Use the pencil tool to draw a selection border around the cells you want to select. If the border you draw is not closed, a straight line will be added to join the 2 ends.
9.2. Color by & Shape by filters
You can also select cells that are already cluster/ annotated from the Color by and Shape by tabs.
To select cells in one cluster:
- Go to Color by tab and choose the group contains the cluster you want to select to make them visible in the scatter plot.
- Single-click on the cluster name to select all cells in that cluster
- (cluster bar will be highlighted by a gray box and selected cells turned white)
To select cells that belong to 2 clusters of 2 different classification
For example, cells belong to cell type 1 in cell type classification and belong to patient A in patient classification.
- Go to Color by tab and choose the first classification. Choose your first cluster of interest by single-click.
- Go to Shape by tab and choose the second classification. Choose your second cluster of interest by single-click.
- Selected cells in white are cells that belong to both clusters.
9.3. Select cells by gene expression
You can select cells that shared the same expression level of one given gene:
- Type in the gene name or its Ensembl ID in the gene query box.
- Clicking on the color scale to select all cells expressing that gene.
- Adjust the black dots at the ends of the color scale to select cells within the specified maximum and minimum level the of expression when querying one gene, or within the specified range of log 2 ratio of dual gene expression when querying two genes.
- Or click on the maximum and minimum values and type in new values
10. Cell type prediction tool
BBrowser cell-type prediction tool takes a list of marker genes defined by the users as the reference and evaluates the expression of all those marker genes in the selected population to predict the cell-type. Whenever a cell population is selected, the process will automatically be done. The cell type prediction result will appear in the infobox on the top left corner of the scatter plot. It includes the cell type name and the marker genes’ information.
- To activate this function, go to Settings > Cell type prediction knowledge and click on Custom.
- Type in a cell type name and the positive and/ or negative gene markers. Enter after you type in a gene name or click on the suggested gene name (suggested by auto-complete) to add the gene markers.
- Click on the Plus icon (+) to add your definition of a cell type.
- Click Apply to save your settings.
- Now, when you circle a group of cells, the cell type prediction calculation will be based on your custom knowledge base.
By default, cell type prediction is applied only to data with less than 50,000 cells due to the long processing time needed for a large dataset. You can enable the function for large data by increasing the cell number limit in Settings > Analysis > Cell-type prediction limit.
11. Cell search (beta version)
12. Find marker genes and enriched processes
Finding marker genes and enriched processes in a group of cells helps you to see the genes and processes that are differently expressed in that selected group, compared to the rest of the cell population. The information is essential to define which cell type the cluster belongs to. To run the analysis:
- Select a group of cells
- Go to Marker genes tab on the right, click on Find marker genes.
- Algorithm will run and find the marker genes as well as enriched processes. Hence, both results will be available and ready to be explored.
- Alternatively, one can click on Run enrichment analysis and will get the same result with both marker genes and enriched processes.
- Each gene or process comes with the p-value and biological details related to it. You can use the Search box to look for a gene or a process of interest.
Details on the marker genes and enrichment analysis include:
- By default, marker genes will be sorted by order of significant (p-value) with the most significant gene comes first. Each page shows you 10 marker genes, to continue browsing, go to the next page.
- Together with gene name and p-value, the software will also show you the type of marker gene, dissimilarity, log2FC, Ensembl ID, protein class, gene type, transcript count and GC content. All criteria above can be used to sort the marker genes in an increasing or decreasing order. To sort the marker genes, click on icon next to the column name.
- Type of marker genes: up-regulated, down-regulated or transitional.
- Transitional marker genes are genes that are not exclusively expressed or repressed in the given cluster but show expression in multiple clusters, and its expression level is distinctive for each cluster. The classification is taken from Venice.
- Dissimilarity: this score indicates if the selected cells are different and can be separated from the non-selected population by constructing a simple classifier based on the given gene expression. If the classifier can determine whether a cell is coming from the selected or non-selected group with 100% accuracy, dissimilarity will be 1.
- Log2FC: log2-fold-change of each gene is the ratio of the means of expression of that gene in the cells selected, compared to the rest of the cell population.
- Other details about protein class, gene type, transcript count and GC content are taken from the Ensembl database and available for human and mouse genes.
- By default, enriched biological processes are displayed by order of significant (p-value) with the most significant process comes first. Each page shows you 15 enriched processes, to continue browsing, go to the next page.
- To view enriched molecular functions, cellular components and pathways, click on the drop-down next to the first column name.
- To view details about each process or pathway, click on the icon in the Source column. This will connect you to the database (gene ontology or reactome) and go directly to the specific page of the chosen process.
13. Add an annotation
13.1. Add an annotation
You can add multiple annotations to a cell, regarding cell type, subtype, expression level of a gene or set of genes or clonotype, etc. There are 2 ways to add an annotation:
- Add an annotation matrix by a file and applied to the whole dataset
- Add annotation manually for each cluster
For each annotation, you need to put in Group name as the name of the classification (cell type, sub-type, T cell sub-type …) and cluster name is the name of the cluster (macrophage, microglia, COL1A4+ fibroblast, …).
To import an annotation matrix by a file:
- Go to Color by tab, click on the drop down to the type of classification
- Click on Add annotation from a file
- A File explorer window will be opened, navigate it to the annotation file and click Open
- Annotation matrix needs to be in TSV or CSV file format with the first column is cell ID and the next column is the annotation linked to each cell. Column name will be taken as Group name and the cluster name is taken from the annotation of each cell. The annotation must be non-numerical and have less than 50 types.
To manually annotate each cluster:
- Select a group of cells (refer to section 6.3)
- Click on the + icon to create an annotation for that group of cells
Fill in Group name and cluster name to create a new group and cluster or choose an existing group and cluster to add the selected cells to that cluster.
Click OK to implement.
13.2. Edit an annotation
After an annotation is added, you can edit it by changing name, merging 2 clusters together, delete the cluster or the whole group.
- Go to Color by tab, choose the group you want to edit.
- Click on the Pencil icon next to the group name.
- A pop-up window will be opened.
- Put in a new name for the clusters or group as you prefer.
- Click on the trash icon to delete the clusters or group.
- To merge 2 clusters, hover the mouse on top of one cluster to have it selected (surrounded by gray box), then move it to any another cluster to merge them together.
- Click on Save to keep all changes or Cancel to discard the changes.
14. Study cellular composition
BBrowser supports cellular composition analysis for any group of cells, whether annotated or not annotated. Users will define the group of cells they want to view composition and the type of classification. The software will identify the percentage of each cluster from the chosen classification in the group of selected cells and sort the clusters by order of majority.
For standard function
- Select a cell population. There are 3 ways to select cells for composition analysis: by clicking on a group in the Color by panel, by gene expression levels, and by a lasso tool (pencil tool). (Please go to the Select a cell population section to learn how to select cells in BBrowser.)
- Open Composition box. Choose the type of classification you want to analyze by the dropdown.
- The results of composition will be displayed in a stacked bar chart.
For Normalized composition
We recommend the Normalized by total tool for reducing the bias of unequal distribution affected by unbalanced sample sizes.
See below for an example. The total number of cells from disease doubles that from non-disease. Therefore, if you want to discover the percentages of disease and non-disease in Macrophage, it is more likely that the percentage of disease will be dominating. That will create the biased proportion of macrophage cells from two groups. After using Normalized by total tool, the result will show the cell composition without the bias coming from sample size.
- To apply normalization for composition, go to the Settings below the composition tab and choose Normalized by total.
The normalizing formulation is as follows:
Total number of cells from group 1 (Disease): A;
Total number of cells from group 2 (Non-disease): B;
Number of cells from group 1 (Disease) in the selected population (Macrophage): a
Number of cells from group 2 (Non-disease) in the selected population (Macrophage): b
Normalized percentage of group 1 in the selected population:
15. Differential expression (DE) analysis
Performing differential expression analysis on any given two clusters will help you to find out the genes that cause differences between 2 clusters and processes associated with them.
15.1. Running DE in the Composition panel
You can run DE analysis on 2 clusters in the same annotation in the Composition panel
- Select the group of cells that have cells from both clusters by selection tools, filters or single/ dual gene expression
- Go to Composition tab, choose the annotation that has both clusters.
- Select the 2 clusters in the bar chart by clicking on clusters’ name or click on the parts of bar chart displaying those clusters.
- The Run DE analysis button will be activated.
- Clicking on that button will start the calculation process and when it is finished the DE dashboard will be opened showing the differentially expressed genes.
15.2. Running DE in Differential Expression panel
You can run DE analysis on any 2 selected groups of cells.
- Select the first group of cells by selection tools, filters or single/ dual gene expression (selected cells highlighted in white)
- Go to the Differential Expression tab, click on + button on the left to add the group to the comparison (group A)
- Select the second group of cells and click on the + button on the right to add that group to the comparison (group B). In this comparison, an upregulated gene is defined as a gene with expression in group A higher than in group B.
- You can change the name of the two groups by clicking on the pencil icon at the top right corner, click on the check point after you enter the new name to save the change.
- The Run DE analysis button will be activated. Click on that to start the analysis.
BBrowser offers 6 methods to run DE analysis: our in-house algorithm Venice and 6 differential expression analysis algorithms from Seurat package – Wilcox, likelihood-ratio test, Poisson, negative binomial, logistic regression, t-test.
To choose a method, go to Settings > Analysis > Differential expression analysis.
15.3. The DE analysis dashboard
After you run the DE analysis on two clusters of interest, the software will proceed to the DE dashboard, showing differentially expressed genes by a volcano plot of all genes, a box plot of a single gene expression, a table of genes and enriched processes, and a scatter plot of cells in two clusters
- The volcano plot: showing genes that expressed in more than 45% of the cells in two clusters. Each point represents a gene. The up-regulated genes of the first cluster vs second cluster are colored in red, while down-regulated genes are colored in blue.
- This plot is interactive. You can hover the mouse over any gene to view the fold-change values and p-values correspondingly. If you click on the gene, the box plot will show that gene’s expression levels across two clusters. Right-click on the gene will show or hide its name on the volcano plot.
- The box plot: showing the selected gene’s expression levels across two clusters.
- This box plot is interactive, which allows showing median, mean, max, min value when hovering the mouse over it.
- By clicking on the Settings icon on the top left corner of the box plot, you can customize the plot: changing the box plot into a violin plot or a bar chart, sort the boxes order in alphabet, number of cells in the cluster, number of cells expressed the selected gene or by means of expression of the selected gene, add all data points or only outliers.
- DE genes tab: showing the list of genes with log2(fold-change) values and p-values. The genes are sorted by the p-values. If you click on any gene, the box plot will change to show that genes expression levels.
- Enrichment analysis tab: showing the enriched biological processes, molecular function, cellular component and pathways together with its p-values and link to details on Gene ontology and Reactome database (for human data).
- The scatter plot of cells in two clusters: By default, cells are colored based on the cluster they belong to, two clusters are shown in two different colors in the scatter plot.
If you click on a gene on the volcano plot or the table, the scatter plot will show the selected gene’s expression. You can also query a specific gene expression by filling the gene name in the top right box.
- The DE dashboard toolbar:
- Reset button brings the plot to the original state (colored by cluster).
- Split view button divides scatter plot horizontally, with each haft showing cells from one cluster. This aids the visualization of gene expression in two clusters.
- Export button helps to save each plot and table in the dashboard separately.
- Return back button helps to escape DE dashboard and go back to the Analysis dashboard of the entire dataset.
15.4. Save and view previous DE analysis results
DE analysis results are automatically saved right after you run it, so you do not have to perform the analysis again in the future. To review the DE analysis result, click on the Differential Expression panel > View previous results.
You can edit the name or delete the analysis by clicking at the top right corner of it, click on Save/ Confirm to save the change.
Sub-clustering is an advanced feature that takes out a group of cells and treats them as a new set of data. The software will calculate new principal components and dimensionality reduction results to plot the selected cells in a new scatter plot. They will also be re-clustered based on louvain and k-means clustering methods.
Focusing on a subset of data with less cells than the original one helps you to identify more principal components and components that are significant only to this group of cells. Therefore, you can further group the cells to smaller clusters with distinct expression profiles. This feature is suitable for analyzing clusters with large heterogeneity.
16.1. Run sub-clustering
To run sub-clustering, first select a group of cells (refer to section 6.3) and click on the Sub-clustering icon. Name the sub-cluster as you like and click on Apply.
Re-calculation for the sub-cluster usually takes some minutes. After that, the Sub-clustering dashboard will be automatically open.
Sub-clustering dashboard is similar to the Analysis dashboard and can be used for query gene expression, find marker genes and enriched processes, study cellular composition, etc. but not differential expression analysis. A Mini map at the bottom left of the dashboard shows the main scatter plot with all cells of the sub-cluster highlighted in white.
To go back to main Analysis dashboard, click on the name of the sub-cluster at the top left corner and choose Main cluster from the drop-down.
16.2. Annotation of sub-clusters
Adding annotation in the sub-cluster dashboard is like in the Analysis dashboard.
First, select a group of cells, then click on Create an annotation and define the Group name and Cluster name.
Annotation created in Sub-cluster dashboard is treated equally to the one created in the main dashboard. Hence, you can view your sub-clusters in the main scatter plot or annotate sub-clusters in the different sub-clustering dashboard under the same group name.
17. Study clonotype
Sequencing the TCR is a powerful instrument to dissect the complexity and diversity of the T cell response repertoire. By associating the TCR with gene expression, BBrowser can provide an unbiased classification of a population of interest and the association of the transcriptional landscape of each cell with its TCR.
17.1. Getting started
On BBrowser, click on the Clonotype button at the bottom of the main scatter plot will show you the Clonotype dashboard. All cells in main scatter plot will be changed to gray color and spot size is decreased. A mini map will pop-up showing you the previous coloring of the scatter plot.
Now, you can add TCR sequencing data by clicking on Upload.
In case your data coming from multiple batches, the TCR sequencing data should be submitted for individual batch. Clicking Upload data button in that case will show you a pop-up to select input file for each batch.
Cells with recognized TCR sequence will now be colored according to their clonotype and spot size is changed to normal. The cells will be highlighted and enlarged if you hover the mouse on the clonotype name. Details on the number of cells in each clonotype and relevant antigen information are displayed in a table format.
On the left side of the dashboard, you can change clonotype data, or do clonotype counting and create an annotation for cells with a TCR sequence. By having this conversion to annotation, you can run any analysis on different clonotypes including marker gene detection, enrichment analysis, composition, and differential expression analysis.
17.2. Accepted data format
TCR sequencing results can be imported as TSV or CSV file.
The input matrix must have enough information for a typical V(D)J annotations. BBrowser only reads data from columns with the column name fall into the list below. Columns that are not in this list will be ignored.
- v_gene: name of the V gene
- j_gene: name of the J gene
- crd3: CRD3 sequence in terms of amino acid
- barcode: barcode of cell having this clonotype
- raw_clonotype_id: the clonotype ID
- full_length: Whether it has valid V and J annotations
- productive: Whether the transcript translates to a protein with a CRD3 region
The software only chooses clonotypes that are both full_length and productive. The CDR3 amino acid sequencing are used to map with the VDJdb (Shugay et al. 2017) to find out about the information of relevant epitopes.
17.3. Clonotype counting
There are two ways to perform clonotype count:
- Clonotype: This is the default method. The software counts a cell to a clonotype if that cell has the clonotype ID with the exact sequence for both chains of the TCR. You can convert this counting result to an annotation to capture the composition of other factors.
- TCR chain: With this option, each row in the table is a single TCR chain. So that cells are grouped if they shared at least one chain with each other. Hence, one cell can appear in several groups at a time, and you cannot convert this one into an annotation.
18. Export and Data sharing
18.1. Export graph
BBrowser supports the export of different graphs using to image file formats or data table in tsv
- Box plots, violin plots, density and scatter plots of genes’ expression can be exported as SVG files.
- Scatter plots showing clustering results, genes’ expression over whole cell population and volcano plots showing DE genes can be exported as PNG images.
- To save the image and its legends, click on the camera button or download button on the top right or bottom right of the plot and choose Export image (if applied).
- To save the data table, click on the download button and choose Export data.
18.2. Export TSV file and MTX file
You can export many types of data to a TSV file, for example, graph-based and k-mean clustering results, metadata, your annotations, clonotypes count, list of marker genes and enriched processes and the corresponding p-value.
To save your annotation and the clonotypes count, click on the Export button and choose the type of data you want to save. For gene table and enrichment processes table, you can find the download button next to the table.
To export metadata of the public studies, you might need to enter a Content license.
For the expression matrix, it will be saved in a folder with the same name as the study’s name, containing matrix.mtx, genes.tsv, and barcodes.tsv files. This is the sparse matrix of data after pre-processing steps such as filtering or batch-effect removal.
To combine 2 studies together, you can export the expression matrix from each study to a folder, then import them again in a merged study with the batch correction method of your choice.
18.3. Share your data to the remote repository
If you want to share both the data and analysis results with your colleagues, you can utilize your existing private network protocols like SSH to create an internal data repository. Once the study is shared to the remote repository, your colleagues can download it to their local computers and view all the available information: annotation, differential expression analysis results, gene gallery, sub-clusters, as well as to continue working on the dataset in their local computer. The uploaded version remains unchanged in the remote repository.
Please note that this feature only supports data sharing, BioTuring does not provide a server itself. Users need to have an internal server and have access to that server in order to use the remote repository.
18.3.1. Setting up your BBrowser
On BBrowser, go to Settings > Remote repository data to add the server and access details.
Type in the boxes your server and account information – username and password. The software will record it and automatically login if your account is available and granted the access to the default directory.
- Default directory: This is the where your data is uploaded to.
- The software will also create a metadata file here, which contains summaries of all datasets have been uploaded to the remote repository. You can view the summaries when opening Home page > Remote repository. Only users who upload the study can add a summary for that study.
- Host: A web address where you host your server, e.g.: mydata.bioturing.com
- Port: This is the port ID that provide sftp on your internal server. The default port for sftp is 22.
After providing all information, please click Apply setting button to finish.
18.3.2. Export data
Now, you can try exporting a dataset from your computer to your remote repository:
- Open the dataset you want to share in BBrowser. This will lead you to the Analysis dashboard of that dataset.
- Click on Export button at the bottom of function tabs.
- A pop-up window will come with details about the Remote repository export.
- Fill in the boxes with details about the study: Title, authors, abstract and tags.
Use comma (,) to separate the authors.
We don’t limit the number of characters for abstract (summary) and you can also use Enter to separate between paragraphs.
You can only add pre-defined tags in this window. To create new tags, go to next section.
Please note that this data upload will create a clone of the current version of your dataset in the remote repository. You will still have your dataset in the local computer and can continue working on it. However, the software will not sync any changes to the dataset after uploading. Other users who have access and download the dataset from the remote repository, will see the exact version of data when it is uploaded. If you want to share your changes, please upload the dataset again.
18.3.3. Add custom tag
On BBrowser Home page, you can find a study of interest by a list of tags. This helps users to quickly get a data from a tissue or category. Tags of a given study are determined by the one who uploaded the study.
When you upload a dataset to the remote repository, you can add or remove the tags in the Export window. By default, BBrowser initializes a list of tags which is commonly used to classify a data and only allow users to choose a tag from this list.
If you want to add a new tag, you can go to the Settings page. Under the Custom tags section, you can add or remove any tag you want. Notice that any changes on the list will affect all local dataset. Tags of data that has been uploaded to the remote repository will remain unchanged.
19. Frequently asked questions
My computer has 8GB RAM, can I process large data?
We recommend using computer with 16GB RAM for data having more than 100,000 cells or processed from FASTQ file. However, on computer with 8GB RAM, you can still open large Seurat objects if they are fully processed with PCA and dimensionality reduction results (tested with 300,000 cells object). If you want to submit count matrices, 8 GB RAM can smoothly process data of 30,000 cells.
I got the message “Cannot connect to server”. What can I do?
If you are using a server with a proxy, the message might come up when you try to login to the software since the proxy connection to BioTuring server cannot be made to verify your credentials. Please click on Proxy settings at the bottom of the login screen and configure your server.
What file formats can I import to BBrowser?
You can import FASTQ, MTX, TSV, CSV, .H5, .H5AD, and .RDS files to BBrowser.
For details about the structure of each file, please refer to section 4.
Does BBrowser support importing a dataset downloaded from GEO?
It depends on the format and structure of the file.
If the file fulfills all the requirements of the software, you can import it to BBrowser.
Otherwise, if the author of the study is willing to share their annotations, BioTuring team would be happy to consider hosting the data in our platform and will index the data based on our standard process.
Why is the scatter plot in BBrowser different from the plot in the publication?
Since we cannot obtain all the parameters of the data processing steps from the authors, for some steps, our default parameters may be different from those of the authors.
How can I combine multiple datasets?
To combine multiple datasets, first, you need to make sure they are in the same format (MTX, TSV or CSV).
After that, open BBrowser > Data > Add new study to import all files, select your method for batch correction and name the study, then click Start to run the processing.
How can I generate an image for publication?
BBrowser supports exporting multiple graphs: scatter plots, box plots, violin plots, etc. in either SVG or PNG format with a fixed design and layout.
If you want to customize the color of the graph, go to Settings > Visualization and change the color scale there.
An alternative is to export data of the graph to tsv and reconstruct it by your preferred tools outside BBrowser. BioTuring team also offers a drag-and-drop data visualization tool called BioVinci.
How can I compare a gene’s expression in different groups?
To compare gene expression across different clusters, first, choose the annotation with the clusters you are interested in. Then, type in the gene name or Ensembl ID in the gene query box and click Enter to query for the gene expression. Click on the arrow at the bottom of the color scale to extend the box and click on the Plot button to generate a box plot of gene expression across different clusters.
Azizi, E., Carr, A. J., Plitas, G., Cornish, A. E., Konopacki, C., Prabhakaran, S., ... & Choi, K. (2018). Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell, 174(5), 1293-1308.
Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. "Integrating single-cell transcriptomic data across different conditions, technologies, and species." Nature biotechnology 36, no. 5 (2018): 411.
Consortium, Gene Ontology. 2004. “The Gene Ontology (GO) Database and Informatics Resource.” Nucleic acids research 32(suppl_1): D258--D261.
Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal, Complex Systems 1695(5): 1–9.
Gribov, Alexander et al. 2010. “SEURAT: Visual Analytics for the Integrated Analysis of Microarray Data.” BMC medical genomics 3(1): 21.
Haghverdi, Laleh, Aaron T L Lun, Michael D Morgan, and John C Marioni. 2018. “Batch Effects in Single-Cell RNA-Sequencing Data Are Corrected by Matching Mutual Nearest Neighbors.” Nature biotechnology 36(5): 421.
Joshi-Tope, G et al. 2005. “Reactome: A Knowledgebase of Biological Pathways.” Nucleic acids research 33(suppl_1): D428--D432.
Korsunsky, Ilya, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-Ru Loh, and Soumya Raychaudhuri. "Fast, sensitive, and flexible integration of single cell data with Harmony." BioRxiv (2018): 461954.
Korthauer, K. D., Chu, L. F., Newton, M. A., Li, Y., Thomson, J., Stewart, R., & Kendziorski, C. (2016). A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome biology, 17(1), 222.
Krijthe, J H. 2015. “Rtsne: T-Distributed Stochastic Neighbor Embedding Using Barnes-Hut Implementation.” R package version 0.13, URL https://github. com/jkrijthe/Rtsne.
Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome biology 15(12): 550.
Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing Data Using T-SNE.” Journal of machine learning research 9(Nov): 2579–2605.
McInnes, Leland, and John Healy. 2018. “Umap: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv preprint arXiv:1802.03426.
Melville, James. 2018. “Uwot: The Uniform Manifold Approximation and Projection (UMAP) Method for Dimensionality Reduction.” https://github.com/jlmelville/uwot.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26(1): 139–40.
Shugay, M., Bagaev, D. V., Zvyagin, I. V., Vroomans, R. M., Crawford, J. C., Dolton, G., ... & Eliseev, A. V. (2017). VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic acids research, 46(D1), D419-D427.
Subramanian, Aravind et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102(43): 15545–50.
Tran, Thang, Thao Truong, Hy Vuong, and Son Pham. 2019. "Hera-T: An Efficient And Accurate Approach For Quantifying Gene Abundances From 10X-Chromium Data With High Rates Of Non-Exonic Reads.". doi:10.1101/530501.
Wang, T., Li, B., Nelson, C. E., & Nabavi, S. (2019). Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC bioinformatics, 20(1), 40.