The application is only available on Ubuntu 16.04. After downloading the .deb package, you can either use the software installer to install or use the terminal commands.
Graphical User Interface:
- Navigate to the location of the downloaded .deb file, double click on the file or right click and select “Open with Software Installer”.
- Click on the “Install” button on the Software Installer window.
Command Line Interface:
- Open terminal with
Ctrl + Alt + T, navigate to the location of the downloaded
sudo dpkg -i <name of the package>.deb. This will install all the dependencies you need to run the software.
- If the installation result in dependencies error, you might type
sudo apt install -f to resolve all dependencies and reinstall the software.
Please note that the software may open a local websocket to execute R commands for certain analyses. Internal tcp permissions for port 9004 need to be available for the software to run properly.
- Navigate to the downloaded .dmg package.
- Double click the .dmg package to open the installation window.
- Follow the on-screen instructions to complete the installation process.
Please note that from version 1.1.5, some users will receive a message at the beginning of the installation. This is a warning saying that BBrowser is from an unidentified developer.
In that case, please go to Security & Privacy and allow the installer to open anyway.
Due to an unexpected incidence, Apple is currently not able to provide a proper signature for our issued account. We hope to have this problem resolved as soon as possible. Above all, we continue to protect our user’s privacy as being declared in our terms.
After downloading, run the .exe file with administrator permission to open the installation window, which will guide you through the installation.
If your computer has more than one account using the software, each account can only access its own data. The software requires some extra requirements on windows machines:
- Visual c++ redistributable 2010 and Visual c++ redistributable 2015 are needed for compatibility reasons. The two requirements will be automatically installed during the installation process.
- Internal tcp and udp permissions for ports 8088 and 9004 need to be available for the software to run properly. The software uses these local ports via an independent executable file called bioanaserve.exe. In some cases, you may even experience a warning from your antivirus software and have to add this file to exceptions.
You need to install some dependencies first. If your computer already have these libraries, you can skip this step.
yum install libgfortran libXScrnSaver
Then, use the following command to install BioTuring Browser:
rpm -iU BBrowser-xxx.x86_64.rpm
Please replace “xxx” with the version that you downloaded. Installation of the software and its dependencies may require root access. After installing, BBrowser can be found in Applications > Accessories
Get a dataset
BioTuring Browser is an analytics packages for single-cell sequencing data. After this section, you will be able to perform a basic analysis using this software.
When you first open BioTuring Browser, it looks like a search page of publications. This software provides a database of published single-cell datasets. The database is updated weekly with important studies in biomedical research.
Although the software lets you analyze your own dataset as well, for the sake of simplicity, this section will use a public dataset: GSE114725. This is a study of immune cells in breast cancer (Azizi et al. 2018).
To begin, you need to have this study downloaded on your BioTuring Browser. The study has been indexed in BioTuring Browser, which means you can look it up in the software. You can put the study id “GSE114725” into the search bar. When the study shows up, click “Download” and you are ready to go.
To open this study, you need to have the Single-cell Addon installed on your BioTuring Browser.. If you have never installed this addon before, please follow the instructions on this link.
Learn about the dataset
When you open a dataset, the best way to get to know is to click “Study info”. The software will bring up some general information about the data.
For this particular example, you are looking at a t-SNE plot of 45,549 mouse cells. This is a “multiple batches” data, which means there are more than one expression matrices to create this dataset. These batches are processed with MNN correction to remove the batch-effect.
To learn about the experimental design of this data, you can try coloring the t-SNE plot with the metadata. In order to change the color, you should get to know the “Color by” panel. This panel is on the right hand side. When you first open a dataset, the panel will select “Graph-based clustering”, which means the color of the t-SNE is based on the clustering result.
In a “multiple batches” study, there is a special coloring factor called “Input metadata”. This is the name of the batches. In the “Color by” panel, you can open the dropdown, and select “Input metadata”.
Once you finish, the data points on the t-SNE plot will be colored by the batches’ name. You can see this dataset has 8 breast-cancer (BC) patients. Other metadata can be accessed similarly in “Your annotations”. For this dataset, we already added the “location” and “Cell types”. You can try coloring your t-SNE plot with these metadata to see 4 different tissues and cell types of these 8 patients.
BioTuring Browser integrates state-of-the-art analyses by a system of in-app packages called “add-on”, each of which focuses on a different scope in the field. In our latest release, BBrowser introduces two add-ons:
- RNA-Seq Explorer: This add-on performs alignment, transcript quantification, and all common analyses in bulk RNA sequencing data. Some major analyses are: quality control, dimensionality reduction, enrichment analysis, and differential expression analysis.
- Single-cell Add-on: This add-on is specialized for single-cell sequencing data. Users can perform alignment, transcript quantification, and batch-effect removal. For downstream analysis, it provides a variety of interactive data visualizations, analyses, and a curated knowledge base that can help users annotate cell types based on marker genes.
When you select to access a published dataset, BBrowser will tell which add-on is suitable to view it. If you want to import your own data for analysis, you only need to select a suitable add-on - subject to the kind of your data. In the next section, we will discuss more details about the features in each add-on.
How to enter the Single-cell dashboard
To go to the add-on, you can either go to the Addons section and Open the Single-cell Add-on, or open a scRNA-Seq study.
If you go via the Addons section, you can access the front page. It shows a list of previous studies that have been opened, a list of latest news in scRNA-Seq studies, and also allows you to create a new study.
You can directly access the Single-cell Add-on from the main dashboard, by opening a particular single-cell study. To exit, you can click any button on the top menu, or click the Exit button (at the bottom right corner of the dashboard).
Import data for analysis
Not only allowing you to view and analyze published data, BioTuring Browser also supports importing your private for analysis.
There are 2 ways to import data:
- Navigate to the Data page, then click on New study and choose SingleCell.
- Go to the Addons page and open The Single-cell Add-on, then hit Create a new study button. To create your own studies, you need to buy a license or request a trial beforehand.
Import an expression matrix
If there is only one sample in your dataset, you just need to import a single matrix to gain insights from your own data.
To create your own study from a single matrix, you need to provide exactly 3 files: genes.tsv, barcodes.tsv and matrix.mtx.
File genes.tsv may contain one or two columns. If it has one column, it is the gene name. If it has two columns, the first column is the Ensembl ID, the second is the gene name. File barcodes.tsv contains one column (barcode). Matrix.mtx contains the expression matrix in Market Matrix format, with #columns equals to #barcodes in barcodes.tsv, and #rows equals to #genes in genes.tsv. Note that the files barcodes.tsv and genes.tsv do not contain headers.
After you finish, simply click on Start.
Import multiple matrices
If you want to compare the two or more samples against each other, you need to import multiple matrices.
You need to switch to Multiple MTX mode then import at least 2 matrices including 3 files: genes.tsv, barcodes.tsv and matrix.mtx in each folder.
Import raw data
To import a study with raw sequencing data, you need to provide the FASTQ files. These files can be either unzipped or zipped in gzip format. The importing interface of this function can be accessed by clicking Create a new study button > Raw data tab.
The add-on requires all FASTQ files should be in the same folder and at the same level (no subfolder). At the moment, we only support paired-end reads that are in two distinct FASTQ files. After you selecting a folder, the add-on automatically pairs the FASTQ into runs based on the file names. To make this process easier, please make sure that two FASTQ files of one pair should be named with a prefix different from other pairs’ prefix.
The alignment and quantification process is run with Hera-T (Tran et al. 2018), which is at least 30 times faster with better accuracy and less memory cost than other tools. Hera-T supports raw data from 10X Chromium Chemistry V2 and V3. The output of Hera-T will automatically goes through the analysis pipeline as a single MTX.
Batch effect removal
The addon supports batch effect removal with several methods. This option is only available when you submit a study with multiple MTX files. These methods are among either the most effective or trending strategies. We provide full details on the packages on Section 17.
By default, t-SNE are applied on the PCA of the expression matrix. At the moment, the addon supports t-SNE and UMAP. User can define the method before or after a study is created.
To change this setting before creating a study, you need to declare the method in the advanced settings, which can be accessed by clicking the More settings button above the study title.
To change this setting after creating a study, you can use the Settings button on the bottom right of the dashboard. Then go to Dimensionality reduction and choose the method that you want to use. If you have never used the selected method on the current dataset, it may take a minute to apply changes.
In this section, we introduce the structure and general navigation inside the addon.
During the time working with BioTuring Browser, you may see at most two dialogues on the top left side of the window displaying the following information:
- Cell type prediction: after clicking on a cluster/ column or annotate a cell population, you will see the box displaying the number of cells that you’ve selected and the cell type prediction results.
- Gene information: the box underneath the cell type prediction box shows the main information of a selected marker gene.
To hide these boxes, you can click on the (x) icon below the dialogues.
The scatter plot
The main visualization is a scatter plot of dimensionality reduction, in which each point represents a single cell. By default, cells are colored by the graph-based clustering result in the sidebar on the right-hand side. The scatter plot is interactive, which means you can drag, rotate (3D plot), or select cells.
In normal mode, if you click a data point on the plot, you will select the whole group including that cell. Groups in the scatter plot is defined based on the way you are coloring the graph.
Click and drag in normal mode will move the plot (in 2D plot) or rotate it (in 3D plot).
Mouse scroll is an alternative way to zoom in and out.
In selection mode, click and drag will draw a selection zone. After you stop, the software automatically join the start and end of with a straight line to create a closed area. Cells that fall in this area are selected. Please note that if you start another action, the previous selection zone will be erased.
The control panels
Control panels are on the right side and the bottom of the interface. They come in several small windows which can be expand or collapse. These panels either give your more insights about the data, or provide additional visualization. Each of them has a unique function and tightly connects to another. There are 4 panels:
- Color by: This panel controls the color of the scatter plot. Basically, it helps you define how to group the cells.
- Shape by: This panel function is similar to Color by, except that it will label the cells by changing their shape.
- Composition: This panel shows the percentage of the second coloring factor of the scatter plot. It also helps you run differential expression analysis.
- Marker genes: This panel can detect marker genes given a group of cells.
- Enrichment analysis: This panel run enrichment analysis on several types of term: molecular function, cellular component, biological process, and pathway.
- Clonotype (at the bottom): This panel shows the clonotype counting result and also provide relevant epitope with supported articles.
In the next few sessions, we describe the function in detail of each panel.
Controls of scatter plot
Switch to 2-D and 3-D mode
In the first view, dimensionality reduction is showed in a 2-D scatter plot. To go to a three-dimensional plot, you can click on 3D button (at the bottom right corner of the plot).
Interact with the plot
On the bottom right corner of the scatter plot, there are several buttons that controls the visualization as well as how user can define a selection.
- Move (hand icon): this button activates the navigation mode. It is triggered by default. While in this mode, a user can drag to move (in 2-D plot) or rotate (in 3D plot). Clicking any cell in this mode also allow you to select the whole cluster that consists of the selected cell.
- Lasso selection (pencil icon): this button activates the selection mode. After triggering this button, you can drag the mouse over the plot to draw a selection curve around the targeted cells.
- Reset (circling-arrow icon): this button reset the coordination of the scatter plot to the original state. Selection may lose after resetting. If the plot is using gene expression as color, it will change back to the last clustering factor / annotation that is used.
- 2D / 3D: these buttons help you switch between 2-D and 3-D scatter plot. In 3-D mode, you may also rotate the plot.
- Zoom (plus/minus icon): these buttons help you zoom in and out. You can also use the wheel to perform this action. The point size of the scatter plot remains unchanged.
- Export image (camera icon): this button creates a snapshot of the scatter plot that can be exported in PNG format.
- Gene gallery: this button open the gene gallery, in which you can use the shortcuts that you have created with the gene expression.
There are also 2 hidden buttons that will only pop up after you make a selection:
- Create an annotation (plus icon): this button opens a panel that helps you label the selected cells. Normally, this panel will pop up every time you use the Lasso selection. From version 1.2.0, we introduce this button so that you can label with any kind of selection.
- Subcluster (hierarchical diagram): this button will perform sub-clustering on the selected cells. Please note that it will only work when you are on the main data (not on a sub-clustering result) and your selection is at least 50 cells. The sub-clustering process uses the same analysis pipeline as the original data.
Customize the plot
There are several options to help you customizing the scatter plot.
- Point-size: Depending on the screen resolution, you may need to adjust the point size to fit your screen. By default, the software automatically estimate the point based on the screen size and resolution.
- Equal size: Data that is generated by UMI-count technologies (such as 10X genomics) may have a lot of zeros (the dropout issue). When a user maps the expression values to the scatter plot, cells that have expression value equal to zero will be in grey, and there will be lots of them. This problem sometimes leaves the non-grey cells (has expression value rather than zero) hidden and causes misleading visualization. To tackle this problem, the software makes grey cells much smaller than the original size when a user query a gene. If you turn this option off, the software will keep everything as is.
- Theme: This option is “Dark” by default. This is the optimal theme for analyzing. If you want to make a figure or a live demo which uses a screen projection, you may consider using the “Light” theme.
- Color palette: This option controls the way colors are assigned to groups of cells or gene expression.
This panel controls the way to color your main scatter plot. By default, the plot is color with the graph-based clustering result. You can select the other options as well:
- K-means clustering: It shows in a slide of numbers ranging from 2 to 10. You can use this slider to set the values of k. This analysis run on the PCA result. You can more detail about the method at the end of this document.
- Your annotations: This is where you can color the scatter plot as your definition. Adding an annotation can be done manually by selecting and labeling groups of cells, or by importing a table of metadata.
- Graph-based clustering: This is the default coloring method. The clustering analysis use louvain method on the PCA result.
- Input metadata: You can only see this option when you used multiple MTX files to create the study. It is equivalent to the batch names.
- Add annotation from a file: This is where you can open a file browser and select a TSV file of metadata. To make it possible for the software to match with the current study, the first column of the table must be the barcodes.
For any option that you select, groups will show in horizontal bars, length of which is the number of cells. You untick a group to temporarily remove the cells from the scatter plot. This function is useful when there are too many groups mixing with the cells you want to select. Double-click a group will untick or tick all other groups.
You can only customize your annotation. In that case, you will see a pencil icon next to the options. Clicking this button will open a window where you can delete a group, delete the whole annotation, merge groups together, or change name.
This panel controls the shape of the data point. All functions are similar to the Color by panel except it only apply the metadata in term of shapes. This feature helps users to observe 2 metadata at a time and create a selection based on their intersections.
For example, the picture below illustrates the selection of cells coming from cluster 12 (Color by) and patient BC10 (Shape by). The feature offers a more flexible way to annotate and make a comparison.
The difference of Shape by from Color by is that you cannot modify, delete, or import annotation in this panel. These actions have to be done in the Color by panel. After that, changes will be automatically synced to Shape by panel.
Marker genes and Enrichment analysis
The 2 panels are updated every time you make a selection.
The Maker genes panel shows a table of genes that is uniquely expressed in the cluster. Genes are evaluated based on the difference in the ratio of cells express the gene and the average expression. The table is interactive so that you can immediately color the scatter plot with an expression value by clicking on the gene. Genes are sorted based on their uniqueness.
The Enrichment analysis panel shows a table of biological terms that the marker genes enriching. The analysis run by GSEA (Subramanian et al. 2005) on gene ontology and pathway database. The table is sorted in a decreasing order of GSEA scores.
By default, the software already run marker-genes detection and enrichment analysis for all clustering result. For annotations or a random selection, you need to click the Find marker genes or the Run enrichment result button. It takes at most 10 seconds. The result is then saved and you will not have to rerun the analysis again.
Show gene expression
To look up a gene, you can type a gene name or ensembl ID in the search bar at the top right corner.
By default, the dashboard allows you to view the expression of a single gene when you search for that gene. To view the expression of the selected gene across all groups, click on the Plot button.
It will create a box plot of the gene expression across all the groups. The x-axis of the plot depends on previous coloring factor, which can be a custom annotation, or a clustering result. To change the x-axis, you may need to turn off the box plot. Then change the coloring factor the the one you want to use in the box plot. After that, search for the gene and hit the Plot button again.
To customize the boxplot, you can use the Settings icon on the top left corner. It will open a menu from which you can change this box plot into a violin plot or a bar chart. It also helps you changing the order of the boxes or add data points.
You can follow these steps to view the expression of 2 genes at the same time:
- Look up a gene.
- Click on the arrow underneath the search bar and switch to Dual mode.
- Type in the name of the second gene then press Enter.
When you click on the Plot button, a density plot will appear, showing the expression of those genes across all groups.
Gene gallery and Mini map
To save a screenshot of the expression of a specific gene across all the cells, you can click on that gene and select Add to gene gallery.
You also see a mini map at the bottom left corner when selecting a gene.
Cell type prediction
One of the most special features in BioTuring Browser is predicting cell types in real time.
Whenever you click on a cluster, a box showing cell type prediction result will appear on the left-hand side. You can also see marker genes’ information in the box underneath by clicking on a marker gene listed in the right panel or a gene that you search for.
The prediction is based on curated database of cell types. This database consists of more than 200 cell types with marker genes and the related publications. However, it is obvious that the definition of cell types is getting blurry as more more subtypes and states of a cell have been reported. The software allows users to create their own definition of a cell types.
To create a custom database of cell types, you can select the Settings button and look under the Cell-type prediction knowledge base section.
After choosing Custom mode, you can start input the name and the positive/negative markers for a cell type. Remember to click the plus icon to stage your changes before hitting the Apply button.
BioTuring Browser allows you to annotate any clusters within a few simple steps. You can use the selection tool and go around the cluster that you want to annotate. Then, just input group name, cluster name and press OK.
Group name is the name of the annotation. Cluster name is the name of the cluster. For example, if you select a cells want to label them with “Neuron”, you should create an annotation with a group name of “cell type”, and the cluster name of “neuron”.
Single-cell technology is a great tool to look at cells as individuals. This great resolution poses a challenge for dimensionality reduction method as dataset can have many sub structure. Hence, in many cases, the whole story from the single-cell data will not be shown easily by one scatter plot. BBrowser allows users to quickly select and run sub-clustering on anyway group of cells without manual extracting the expression matrix.
How to run sub-clustering
To run sub-clustering, you can either select a cluster, or select a group in the annotation then hit the Sub-clustering icon listed on the top of the Toolbox.
In a moment, the sub-clusters of the selected cell population will appear. Choose the Main cluster icon listed at the top left corner to get back to the original plot.
Review the sub-clusters in the next time
If you want to review the sub-clustering results without redoing this action in the next time, you can click on the Main cluster icon (to the top left of the scatter plot) and select the sub-clusters of interest.
Annotation of sub-clusters
When you create or edit an annotation in sub-cluster, the software immediately updates all other scatter plot. So that you can mark the cells of interest and see where they locate on another population.
This panel shows the composition of the cells by a stacked bar chart. Please note that triggering this tab will change the coloring method of the scatter plot. In this case, non-selected cells become grey and selected cells will color by a composition factor. You can choose a composition factor from your custom annotation list or from your metadata. This tab also create differential expression analysis with edgeR (Robinson, McCarthy, and Smyth 2010) if there are at least two groups in your composition plot.
It can help you view the components of a specific population: how many percent of the cells come from a specific group in the metadata or from a group in your annotation.
Here are the steps to view the composition of a population:
- Select a population: you can either click the Pencil icon (in the toolbox) and circle the population of interest, or click a group of cells from the graph-based clustering result/ k-means clustering result/input metadata.
- Click Composition (located on the left panel) to view the composition of the selected population. Here you can choose to view the composition by Input metadata, or by the groups in your annotation.
Differential expression (DE) analysis
In many cases, you may find that two clusters with extremely similar expression pattern are clearly separate from each other on the scatter plot. Performing differential expression analysis on these two clusters will help explore the genes that cause the difference.
How to run DE analysis for 2 groups in metadata
Suppose you have 4 groups of cells in the metadata, and want to run DE analysis to find the genes that cause the difference of 2 groups out of 4 in a specific cluster.
- Select or circle that cluster.
- Click Composition (on the right panel) and choose to view the composition of that cluster by Metadata.
- Select 2 groups that you want to run DE analysis on (by clicking on the groups’ names)
- Click Run DE analysis to proceed
Similarly, if you want to run DE analysis on two groups of cells in the metadata in the entire dataset (not just in a specific cluster), just circle the whole plot, and proceed step 2-4.
How to run DE analysis on 2 groups in an annotation
Suppose you detect two clusters of cells with a similar gene expression pattern, and want to compare these two to find the genes that cause the separation.
Here are steps to run the DE analysis between any two populations:
- Annotate the clusters of interest: In order to run DE analysis, the two groups of cells have to be annotated (if there’s no metadata). Just circle the first population, input the group name (eg. DE) and cluster’s label (Group A). Then circle the second population, input the same group name (DE), but with a different cluster’s label (Group B).
- Circle the region that contains both of the clusters of interest: the region that you circle does not have to contain exactly two groups that you want to compare. This region can contain these two groups, and other cells from other groups/clusters.
- Click Composition (on the right panel) and choose to view the composition by the group name that you assign for the two populations of interest (which is “DE” for this example). You will see the percentage of Group A and Group B in the region you are looking at.
- Click on Group A and Group B under the Composition section.
- Click Run DE analysis.
The DE analysis dashboard
After you run the DE analysis on two groups of interest, the software will proceed to the DE dashboard, where you can see: (1) a volcano plot at the upper right corner, (2) a box plot of gene expression under the volcano plot, (3) a table of important genes at the bottom right corner, (4) a scatter plot of cells in two groups on the left, and (5) a toolbar to the bottom right of the scatter plot.
- The volcano plot: showing genes that pass the threshold (ones that are expressed in more than 45% of the cells in two groups), which are the important genes that cause the difference of two groups. The up-regulated genes are colored in red, while down-regulated ones are colored in blue.
The plot is interactive: Hover the mouse over any genes to view the fold-change values and p-values correspondingly. If you click on any gene, the box plot will show that gene’s expression levels across two groups of interest. Right-click at any gene to show or hide its name on the volcano plot.
- The box plot: showing the selected gene’s expression levels across two groups of interest. The box plot is interactive, which allows showing median, mean, max, min when hovering over it. You can freely change the box plot into a violin plot or a bar chart by clicking on the Settings icon on the top left corner of the box plot. This also allows you to add/ remove data points.
- The table of important genes: showing the list of important genes with log2(fold-change) values and p-values. The genes are sorted by the p-values. If you click on any gene, the box plot will also change. Here you can also view the Gene ontology and Pathway (for human data).
- The scatter plot of cells in two groups: By default, cells from two groups will be shown in two different colors. If you click on any gene from the volcano plot or the table, this scatter plot will show the selected gene’s expression across two groups of interest. The redder, the higher the expression level. You can also query a specific gene expression by filling the gene name in the top right box.
- The toolbar: here you will find a button to reset the plot to the original state (colored by the groups of interest - not by gene expression). You can also choose to Split view, which is an option to split the cells of two groups out of each other, so that you can easily see the gene expression in each group. There’s another option, Return back, to return to the original plot of the entire dataset.
Save and name the DE analysis results
By default, the DE analysis results are automatically saved right after you run it. However, you may need to rename the DE analysis results for future review.
To rename the DE analysis result, please return to the original plot screen of the whole dataset. Click on Result (located to the bottom right corner), and click on the Pencil icon to rename it.
Review the previous DE analysis results
Once again, the DE analysis results are automatically saved right after you run it. Therefore, you do not have to perform the analysis again in the future. To review the DE analysis result, just click on the Result button (located to the bottom right corner of the single-cell dashboard), and select the DE analysis result that you want to review.
The Clonotype panel locates at the bottom of the scatter plot, next to the Gene gallery. If you provide a table of V(D)J annotations, this panel can list of out the most common clonotype in your dataset and let you define the way to count clonotype.
When you first open the clonotype tab, everything turns grey. Minimap will pop up and remind you about the previous colorful scatter plot. This is because the BBrowser is trying to color your scatter plot with clonotype and if it cannot find anything (maybe because you have not put anything there yet), it turns grey.
In the next section, we introduce more detail how you can put clonotype data to a current study. After that, every time you expand the clonotype tab, it shows you the clonotype of your cells. Take a look at the picture below as an example.
Cells having the same color have the same clonotype. Grey cells indicate that they have no clonotype information. The table in this tab will summarize the number of clonotype and there relevant antigen information. It is interactive, which means if you hover over a clonotype in the table, the scatter plot will enlarge the cells having that clonotype.
On the left, there is a control panel of the table. It helps you to filter the table or change the clonotype counting method. We discuss more detail about these options in another section. Other helpful function of this panel is to allow you to change the clonotype data and convert the table to an annotation as well. By having this conversion, you can run any analysis that can be run on an annotation, including marker gene detection, enrichment analysis, composition, and differential expression analysis.
The input table must has enough information for a typical V(D)J annotations. We list down the column’s name and its meaning:
- v_gene: name of the V gene
- j_gene: name of the J gene
- crd3: CRD3 sequence in term of amino acid
- barcode: barcode of cell having this clonotype
- raw_clonotype_id: the clonotype ID
- full_length: Whether it has valid V and J annotations
- productive: Whether the transcript translates to a protein with a CRD3 region
The software only choose clonotypes that are both full_length and productive. The CDR3 amino acid sequencing are used to map with the VDJdb (Shugay et al. 2017) to find out about the information of relevant epitopes.
There are two ways to perform clonotype count:
- Clonotype: This is the default method. The software simply count based on the clonotype ID, which means the cell are grouped if they share both chains of the TCR. You can convert this counting result to an annotation to capture the composition of other factors.
- TCR chain: With this option, each row in the table is a single TCR chain. So that cells are grouped if they shared at least one chain with another. Hence, one cell can appear in several groups at a time, and you cannot convert this one into an annotation.
Clonotype data for multiple batches
It is very common when different batches share some of the barcodes. It is even more common when you combine data from different studies. Therefore, when you are inside a multiple batches dataset and you want to submit a clonotype data, the software needs to know to which batches that clonotype data belongs. In this case, the user interface will be a bit different when you click the Upload data button.
Please note that you should not merge clonotype data of different batches together. We highly recommend our users to keep these data as is because of the duplicating barcode issue. The software has its own strategy to uniquely mark the barcodes and it will not accept any other customizations.
This function is included in BBrowser from version 1.1.0. It utilizes your existing private network protocols such as FTP or SFTP (SSH) to create an internal data repository. Please note that this feature only supports data sharing. Users using BBrowser should be able to access the internal FTP server to use the private repository.
Setting up your server
As for FTP services, we support passive mode connections. Port 20, 21 (FTP), 22(SFTP). While you can use any FTP/SFTP servers, we suggest trying vsftp for Linux and Macos, or filezilla for Windows.
Below is the instruction of how to install, configure, and create account for sftp on an Ubuntu system. Note that the IT can set up different systems (Redhat, Debian, CentOS, etc.) for hosting a private repository. Other information please refer this link.
sudo apt install vsftpd
Create a repository directory. This directory will contain all the data in the private data repository.
mkdir DIRECTORY_TO_STORE_DATA ftp
Edit file /etc/vsftpd.conf with the following content:
$USERLIST_FILE contains the users’ name, separated by a line break. The default location of this file is at /etc/vsftpd.userlist
Setting up your BBrowser
To export or to download from FTP or SFTP, you first need to provide your account and host information. You can do this in the Settings section, which can be access via the Settings button on the top menu of the home page.
If you are inside a dataset, you can quickly access this setting panel by clicking the Settings button on the top menu as well.
In the Settings panel, you need to fill in all the information. Those are:
- Protocol: FTP or SFTP. This determines the type of your file transfer protocol
- Host: Where do you host your server. E.g: mydata.bioturing.com
- Encryption: Whether you want to use file encryption in your protocol
- Logon type: FTP protocol may allow anonymous access, by which you do need an account and password. If you choose Ask for password option, you just need to type in your account information once. The software will automatically login as long as your account exists and have the right permission.
- Default directory: The software will create and metadata file, which will give the software the summary of all datasets have been uploaded from BBrowser to the remote repository. This file will be created in the default directory. This is also the where your dataset is uploaded if you choose to export it to the remote repository.
After provide all information, please click Apply setting button to finish.
To export a dataset from your computer to your remote repository, you need to go into the dataset with BBrowser. On the menu at the bottom right corner, click the Export button. You will see the export repository on the left side of the export window. You need to fill in all information before uploading.
Please note that this function will create a clone of the current version of your dataset. The software does not apply any changes of the dataset after uploading. If other users download that dataset from the remote repository, we will only see the exact version when it is uploaded, and their changes will also not sync to the remote version.
Tag your data
On the frontpage of BBrowser, you can notice the list of tags, which helps you quickly get a data from a particular tissue or category. When you upload dataset to your remote repository, you can label it with tags as well.
To tag a dataset, you can go to that data and hit Export. Under the Tag(s) section, you can click to add or remove the tags.
By default, BBrowser initialize several tags which is commonly used to classify a data. When you are inside a data, you can only add or remove the tag from your data.
If you want to add a tag that is not yet included in the list, you can go to the Settings of BBrowser. Under the Custom tags section, you can add or remove any tag you want. But remember that any changes on the list will affect all local dataset. The one that has been uploaded to the remote repository will remain unchanged.
This list of tags (and classes of tags) is automatically updated when the software find studies with new tags. The newly added tags will be immediately appears in the filtering drop list (on the front page).
To download data from the remote repository, you can go to the home page. Next to the search box, there is an option that allows you to view studies on the remote repository. After that, you will see all the dataset have been uploaded, and you can download them just like a public datasets.
Export data and plots
BioTuring Browser Single-cell supports exporting analysis results and graphs for publication.
Clustering results and annotation
With BioTuring Browser, you can export such data in TSV files. To do this, just click on Export (located underneath the right panel, to the bottom right of the scatter plot).
Here you can select among various export options.
To export the scatter plot, just click on the Camera icon underneath the plot, and choose a folder to save the file. The plot will be saved in PNG.
Differential expression mode
To export data in the differential expression mode, you can use the Export button when you are inside that mode. This button can help you export the gene table, enrichment analysis result, and all the plots inside.
Starting from version 1.1.4, the software can let you export the expression matrix from any dataset (public or private). The data will be saved in sparse matrix format, which includes three files: matrix.mtx, genes.tsv, and barcodes.tsv. This is the data that has gone through the preprocessing steps such as filtering or batch-effect removal (if applied).
We are fully aware that different datasets were generated under different experimental designs and may have to be treated uniquely in order to reproduce the published results in the most faithful way. That is also the long-term plan of the addon where the flexibility of the addon is rich enough to run more specific without confusing the users. At the moment, all public datasets underwent the same pipeline, separate analysis of which will be discussed in this section.
For public datasets, we do not rerun the quantification process. These datasets were processed using the expression matrix reported in their papers.
Transcript quantification is only applied when you create a new study with raw sequencing files. The process is run by Hera-T (version 1.2.0) (Tran et al. 2018), a new algorithm that can speed up to at least 30 times faster than CellRanger 3.0 with better accuracy.
This process gets rid of poor-quality cells in term gene expression and redundant non-expressed genes in the data. The data will undergo the gene filter first, in which genes having at least 1 UMI count in less than 4 cells are excluded. Then, cells that with less 200 genes having at least 1 UMI count are excluded. The process creates a new expression matrix which may have less cells than the original data, and the addon only records the number of cells and genes of this filtered matrix. This is the reason why in some public datasets, you may see the number of cells is slightly different from the one reported.
Batch effect removal
This process is applied upon requested by the user. You can only enable this process by creating a new study with multiple MTX files. This process considers each MTX is a batch and will try to scale all batches with the chosen method. Currently, BBrowser provides 3 ways to remove batch effect:
- MNN correction (Haghverdi et al. 2018, ): This method is based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space, which are highly variable genes detected by the Seurat package (Gribov et al. 2010). To select the initial group, the process will sorted the batches by the order from a graph-based clustering analysis and then pick the first batch. This analysis is run with louvain clustering method, on the first 30 components of the PCA result. The method assumes that there is at least a subset of the population is shared by all the batches. It is effective on repetitive measurement but the computational time is expensive, compared to other methods.
- CCA (Butler et al. 2018): This method is widely used in scRNA-Seq via the Seurat package (Gribov et al. 2010). The idea of this method is to use an adaptive version of manifold alignment and anchor all the batches. It is simple, fast, and at the same time, very effective when applying on data of different technologies.
- harmony (Korsunsky et al., 2019): This method is similar to CCA since it also projects cells into a shared embedding. But by considering cell type rather than dataset-specific conditions, it can simultaneously account for multiple experimental and biological factors. Application and the speed of this method is comparable to CCA.
In the current version, there are two ways to run dimensionality reduction.
t-SNE (Maaten and Hinton 2008)
The first 30 components of the PCA are used for the t-SNE. The parameter for t-SNE depends on the size of the data.
- Less than 100: perplexity = 10, theta = 0.0, max_iter = 2000
- Less than 1000: perplexity = 20, theta = 0.0, max_iter = 1500
- Less than 10000: perplexity = 30, theta = 0.4, max_iter = 1000
- Other: perplexity = 50, theta = 0.5, max_iter = 1000
The analysis is done by Rtsne package (Krijthe 2015).
UMAP (McInnes and Healy 2018)
The first 30 components of the PCA are used for the UMAP. The number of neighbours is set at 15. The analysis is done by uwot package (Melville 2018).
This analysis runs on the PCA result. For every dataset, the addon use both louvain (graph-base) and k-means clustering.
- Louvain clustering: The graph-based method is done by the igraph package (Csardi and Nepusz 2006) with a flexible number of nearest neighbours. This number is no larger 20 and estimated by the elbow method of k-means clustering on the PCA result.
- k-means clustering: This method is generated in series of k ranging from 2 to 10. The addon records the outcome from all k values so that the user instantly switch to a different k in when using the scatter plot.
Finding marker genes
We first defined marker genes of a group of cells in a data set as the genes that can be used to distinguish such cells from the rest. From this idea, we used the accuracy of classification as a metric to score the significance of a marker gene.
Considering each gene separately, we denote a cell as where is the label of a group of cells. if the cell is in the group of interest (group 1 - the group that we want to find the marker genes for). if the cell is not in the group of interest (group 2 - the rest of the data). We denote as the complement group of .
The probability for a cell being in group , given its expression level is:
In most of the cases, the group of interest is much smaller than the rest of the data and can generate a sampling bias. To avoid this bias of sample size, we set:
Accuracy of the classifier is:
The accuracy of prediction is:
Intuitively, For the robustness of the calculation, we divide the expression into intervals:
Where is the number of cells of group in group , and is the number of cells in group . For each gene, we can estimate the accuracy measure for using this gene to predict cells inside or outside the cluster and use this as a metric for ranking the marker genes.
We tested Venice on both real and simulated datasets. The benchmark considered the performance on 2 different sequencing technologies (full-lenght and UMI count), 4 different kinds of marker genes (including transitional genes), and 2 different kinds of null genes. Venice exhibited the best performance and accuracy in all cases. It could effectively detect different types of marker genes and avoid false positive results while keeping a modest running time.
Venice is now incorporated in Signac, a single-cell analytics package developed by BioTuring. The package is available at https://www.github.com/bioturing/signac
Gene set enrichment analysis
This analysis is adopted from the GSEA method (Subramanian et al. 2005), a common analysis for selecting potential biological terms given a sorted list of genes. The addon perform GSEA on 4 different terms: biological process, molecular function, cellular component, and biological pathway. The first 3 terms are from the gene ontology (Consortium 2004), and the last one is from the reactome database (Joshi-Tope et al. 2005).
In the addon, there are two places you see this analysis:
- The Data panel: The gene list used for GSEA is the sorted list of marker genes. The genes were sorted in the Marker tab by based the score previously being discussed.
- The differential expression dashboard: The gene list used GSEA comes from the result of the differential expression analysis. Genes are sorted by p-values.
This addon can perform a quick cell-type prediction for a group of cells. When user does a selection by clicking a cluster/annotation or using the Select cell mode, the addon picks genes that express in at least 35% of the group. This process does not select from the whole transcriptome, but instead on a list of cell-type markers in our curated knowledge base. Then, it uses that group’s profile to estimate the correlation with the cell-type’s profile. A cut-off of 0.5 is applied to remove non-potential candidates. The remaining cell types will undergo and tree search to find the common parents. Parents which have less weight (e.g. distinct from the rest) are removed. This process is repeated until only one cell type left. The whole analysis usually takes 1-3 seconds to finish, hence, it triggered automatically.
Differential expression analysis
We support finding the differential expressed genes between two groups of cells. Each group must have at least 3 cells. In the pre-processing step, we only keep genes that being expressed at least 45% of cells in one group and non-spike-in genes. Then we use edgeR package (Robinson, McCarthy, and Smyth 2010) to fit a quasi-likelihood negative binomial generalized log-linear model to UMI-count data. To test for significance, we conduct genewise statistical tests and produce the p-adjusted value for each gene.
For the log2FC value of each gene, we use the same method of Seurat package (Gribov et al. 2010). Below is the detail formula:
Azizi, E., Carr, A. J., Plitas, G., Cornish, A. E., Konopacki, C., Prabhakaran, S., ... & Choi, K. (2018). Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell, 174(5), 1293-1308.
Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. "Integrating single-cell transcriptomic data across different conditions, technologies, and species." Nature biotechnology 36, no. 5 (2018): 411.
Consortium, Gene Ontology. 2004. “The Gene Ontology (GO) Database and Informatics Resource.” Nucleic acids research 32(suppl_1): D258--D261.
Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal, Complex Systems 1695(5): 1–9.
Gribov, Alexander et al. 2010. “SEURAT: Visual Analytics for the Integrated Analysis of Microarray Data.” BMC medical genomics 3(1): 21.
Haghverdi, Laleh, Aaron T L Lun, Michael D Morgan, and John C Marioni. 2018. “Batch Effects in Single-Cell RNA-Sequencing Data Are Corrected by Matching Mutual Nearest Neighbors.” Nature biotechnology 36(5): 421.
Joshi-Tope, G et al. 2005. “Reactome: A Knowledgebase of Biological Pathways.” Nucleic acids research 33(suppl_1): D428--D432.
Korsunsky, Ilya, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-Ru Loh, and Soumya Raychaudhuri. "Fast, sensitive, and flexible integration of single cell data with Harmony." BioRxiv (2018): 461954.
Korthauer, K. D., Chu, L. F., Newton, M. A., Li, Y., Thomson, J., Stewart, R., & Kendziorski, C. (2016). A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome biology, 17(1), 222.
Krijthe, J H. 2015. “Rtsne: T-Distributed Stochastic Neighbor Embedding Using Barnes-Hut Implementation.” R package version 0.13, URL https://github. com/jkrijthe/Rtsne.
Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome biology 15(12): 550.
Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing Data Using T-SNE.” Journal of machine learning research 9(Nov): 2579–2605.
McInnes, Leland, and John Healy. 2018. “Umap: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv preprint arXiv:1802.03426.
Melville, James. 2018. “Uwot: The Uniform Manifold Approximation and Projection (UMAP) Method for Dimensionality Reduction.” https://github.com/jlmelville/uwot.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26(1): 139–40.
Shugay, M., Bagaev, D. V., Zvyagin, I. V., Vroomans, R. M., Crawford, J. C., Dolton, G., ... & Eliseev, A. V. (2017). VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic acids research, 46(D1), D419-D427.
Subramanian, Aravind et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102(43): 15545–50.
Tran, Thang, Thao Truong, Hy Vuong, and Son Pham. 2019. "Hera-T: An Efficient And Accurate Approach For Quantifying Gene Abundances From 10X-Chromium Data With High Rates Of Non-Exonic Reads.". doi:10.1101/530501.
Wang, T., Li, B., Nelson, C. E., & Nabavi, S. (2019). Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC bioinformatics, 20(1), 40.