Hera

Fast and accurate transcript quantification for RNA-seq data

Meet Hera 2.0, a tool for transcript quantification with accuracy and speed.

Hera 2.0 can generate accurate read alignments in bam format, while time and memory complexity are highly comparable to the pseudo-alignment approaches. For more information on how Hera 2.0 addresses the problems in RSEM’s EM algorithm, visit here.

Detailed benchmark of Hera 2.0 is available below.

Get Hera 2.0 binary package

  • The package can be installed on most MacOS, Windows, and Linux distributions without dependencies.
  • Hera 2.0 is free for academic use. For commercial purposes or patent registration, you need to get a license.

Benchmark Data

We benchmarked Hera 2.0 and other transcript quantification tools (Hera, Kallisto, Salmon, RSEM+Bowtie2) using 20 simulated data sets generated from Kallisto paper (using this script), and the most recent benchmark data from SMC-RNA DREAM challenge.

Benchmark results

SMC-RNA DREAM Challenge sim51
Tools Spearman Pearson Log-pearson MAE(asinh) False positive False negative Max false neg Max false pos Real time
Hera 2.0 0.94176 0.99844 0.97477 0.02520 3767 2051 15.93 745.71 355
Hera 0.91420 0.94944 0.94971 0.08953 5434 2638 290.95 961.3 239
Kallisto 0.89005 0.98809 0.94535 0.06715 9175 2321 31.43 872.99 276
Salmon 0.88731 0.98997 0.94365 0.06842 9149 2526 955.18 953.72 252
RSEM+bowtie2 0.92906 0.99704 0.95929 0.06061 4424 2179 29.54 842.17 13294
SMC-RNA DREAM Challenge sim51
Tools Spearman Pearson Log-pearson MAE(asinh) False positive False negative Max false neg Max false pos Real time
Hera 2.0 0.94176 0.99844 0.97477 0.02520 3767 2051 15.93 745.71 355
Hera 0.91420 0.94944 0.94971 0.08953 5434 2638 290.95 961.3 239
Kallisto 0.89005 0.98809 0.94535 0.06715 9175 2321 31.43 872.99 276
Salmon 0.88731 0.98997 0.94365 0.06842 9149 2526 955.18 953.72 252
RSEM+bowtie2 0.92906 0.99704 0.95929 0.06061 4424 2179 29.54 842.17 13294
SMC-RNA DREAM Challenge sim51
Tools Spearman Pearson Log-pearson MAE(asinh) False positive False negative Max false neg Max false pos Real time
Hera 2.0 0.94176 0.99844 0.97477 0.02520 3767 2051 15.93 745.71 355
Hera 0.91420 0.94944 0.94971 0.08953 5434 2638 290.95 961.3 239
Kallisto 0.89005 0.98809 0.94535 0.06715 9175 2321 31.43 872.99 276
Salmon 0.88731 0.98997 0.94365 0.06842 9149 2526 955.18 953.72 252
RSEM+bowtie2 0.92906 0.99704 0.95929 0.06061 4424 2179 29.54 842.17 13294

References and Annotations

For SMC-RNA DREAM Challenge data, we use:

Genome reference: GRCh37.75
ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna_sm.primary_assembly.fa.gz

Gene Annotation: GRCh37.75
ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

For Kallisto simulations, we use:

Genome reference: GRCh38
ftp://ftp.ensembl.org/pub/release-80/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

Gene Annotation: GRCh38.80
ftp://ftp.ensembl.org/pub/release-80/gtf/homo_sapiens/Homo_sapiens.GRCh38.80.gtf.gz

The transcriptome fasta file is generated using RSEM script given the above genome and gene annotation as input.
Log-pearson: pearson correlation between log-transformed tpm values with offset 0.01.
MAE(asinh): mean absolute error of asinh-transformed tpm value (filtered out transcripts with zero tpm value in ground truth and predicted value)
False positive: the number of unexpressed transcripts but predicted to be expressed by the program
False negative: the number of expressed transcripts but predicted to be unexpressed by the program
Max false neg: the maximum tpm value of the transcripts but predicted to be unexpressed by the program
Max false pos: the maximum predicted tpm value of the unexpressed transcripts
Real time: is wall clock time - time from start to finish of the call (in seconds).

Note that both ground truth and predicted tpm value is rounded to 2 decimal digits, highlighted numbers in the table are choosen as the best number with epsilon = 0.005.

Machine specs

Spec 1 - for SMC simulated samples
CPU: 40 cores Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz RAM: 96 GiB DDR3 1866 MHz

Spec 2 - for 20 samples simulated from Kallisto paper’s script.
CPU: 32 cores Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz RAM: 64 GiB DDR3 1333 MHz

Tools used

Hera 2.0

Run with 32 CPU cores…

Bowtie 2 + RSEM

Bowtie 2 ( version 2.3.4.1 ) ( Home page )
RSEM ( version 1.3.0 ) ( Home page )

Run command:

                            Index: $BIN/rsem-prepare-reference                          \
                                        –gtf                       $GENE_GTF            \
                                        –bowtie2 –bowtie2-path     $BIN                 \
                                        -p                         32                   \
                                        $GENOME_FASTA                                   \
                                        $INDEX_DIR/rsem/genome
                        
                            Quant: $BIN/rsem-calculate-expression               \
                                    --bowtie2 --bowtie2-path    $BIN            \
                                    -p                          32              \
                                    --paired-end                                \
                                    $READS_1 $READS_2                           \
                                    $INDEX_DIR/rsem/genome                      \
                                    $OUTPUT_DIR/rsem/result
                        

We are happy to release Hera, a fast and accurate algorithm that maps spliced RNA-seq reads to a genome while simultaneously estimates transcript abundances, detects gene fusions, and outputs alignment files for visualizing and variant calling purposes.

In the same period of time for STAR to output a SAM alignment. Hera is capable of outputting a BAM file (with base-to-base alignment), transcript quantification (in TPM) and a list of fusion genes.

Hera quantification algorithm obtained the best ranking in a round of the SMC-RNA DREAM challenge: https://www.synapse.org/#!Synapse:syn2813589/wiki/423306

RUN: running Hera is simple: hera quant -i index/ -t 32 read1.fastq read2.fastq

./hera/build/hera quant -i path/to/index_directory [OPTIONAL] read1.fastq read2.fastq

[OPTIONAL]:

  • -o [output directory] (default: ./)
  • -t [number of running threads] (default: 1)
  • -o [output directory] (default: ./)
  • -t [number of running threads] (default: 1)
  • -z [level of bam file compression (1 - 9)] (default: -1)
  • -b [Number of boostrap] (default: 1)
  • -w [Output bam file 0: true, 1: false] (defaut: 0)
  • -f [Genome fasta file]

Hera is available for both academics and industry labs. Its source code is released under the MIT license. For more information: https://github.com/bioturing/hera

You can download hera here: https://github.com/bioturing/hera/releases