Welcome to NORA

A tool for transcript quantification where accuracy matters

NORA: A tool for transcript quantification where accuracy matters

We introduce Nora, a tool for transcript quantification where accuracy matters. Nora is much more accurate than Hera, an algorithm that we released last year (Hera obtained the best ranking in the latest round of SMC-RNA DREAM challenge). Nora also outputs accurate read alignments in bam format, while time and memory consumption is similar to pseudo-alignment approaches’.

Nora source code is written in C using the linux kernel coding style . The compiled package is tiny: 262 KB, and can be installed on most macOS and linux distributions without dependencies. For now, Nora’s binary package is not yet released for public. To use Nora, you need to get an invitation.

Be the first ones to hear about Nora's release!

Benchmark Data

We benchmarked Nora and other transcript quantification tools (Hera, Kallisto, Salmon, RSEM+Bowtie2) using 20 simulated data sets generated from Kallisto paper (using this script), and the most recent benchmark data from SMC-RNA DREAM challenge.

Benchmark results

References and Annotations

For SMC-RNA DREAM Challenge data, we use:

Genome reference: GRCh37.75

ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna_sm.primary_assembly.fa.gz

Gene Annotation: GRCh37.75

ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

For Kallisto simulations, we use:

Genome reference: GRCh38

ftp://ftp.ensembl.org/pub/release-80/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

Gene Annotation: GRCh38.80

ftp://ftp.ensembl.org/pub/release-80/gtf/homo_sapiens/Homo_sapiens.GRCh38.80.gtf.gz

The transcriptome fasta file is generated using RSEM script given the above genome and gene annotation as input.

Log-pearson: pearson correlation between log-transformed tpm values with offset 0.01.

MAE(asinh): mean absolute error of asinh-transformed tpm value (filtered out transcripts with zero tpm value in ground truth and predicted value)

False positive: the number of unexpressed transcripts but predicted to be expressed by the program

False negative: the number of expressed transcripts but predicted to be unexpressed by the program

Max false neg: the maximum tpm value of the transcripts but predicted to be unexpressed by the program

Max false pos: the maximum predicted tpm value of the unexpressed transcripts

Real time: is wall clock time - time from start to finish of the call (in seconds).

Note that both ground truth and predicted tpm value is rounded to 2 decimal digits, highlighted numbers in the table are choosen as the best number with epsilon = 0.005.

Machine specs

Spec 1 - for SMC simulated samples

CPU: 40 cores Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz RAM: 96 GiB DDR3 1866 MHz

Spec 2 - for 20 samples simulated from Kallisto paper’s script.

CPU: 32 cores Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz RAM: 64 GiB DDR3 1333 MHz

Tools used

Nora (coming soon…)

Run with 32 CPU cores…

Bowtie 2 + RSEM

Bowtie 2 (version 2.3.4.1) (Home page)

RSEM (version 1.3.0) (Home page)

Run command:


        
            Index: $BIN/rsem-prepare-reference                          \
                        –gtf                       $GENE_GTF      \
                        –bowtie2 –bowtie2-path    $BIN      \
                        -p                          32                  \
                        $GENOME_FASTA                                   \
                        $INDEX_DIR/rsem/genome
        
    

        
            Quant: $BIN/rsem-calculate-expression               \
                    --bowtie2 --bowtie2-path    $BIN            \
                    -p                          32              \
                    --paired-end                                \
                    $READS_1 $READS_2                           \
                    $INDEX_DIR/rsem/genome                      \
                    $OUTPUT_DIR/rsem/result
        
    

Kallisto (Home page)

Version 0.44.0

Run command:


        
            Index: $BIN/kallisto index -i $INDEX_DIR/kallisto $RSEM_TRANSCRIPT
        
    

        
            Quant: $BIN/kallisto quant  -i          $INDEX_DIR/kallisto     \
                                        -o          $OUTPUT_DIR/kallisto    \
                                        -t          32                      \
                                        $READS_1 $READS_2
        
    

Salmon (Home page)

Version 0.9.1

Run command:


    
        Index: $BIN/salmon index    –index         $INDEX_DIR/salmon   \
                                    –transcripts   $RSEM_TRANSCRIPT
    

    
        Quant: $BIN/salmon quant    --index         $INDEX_DIR/salmon   \
                                    --libType       A                   \
                                    -1              $READS_1            \
                                    -2              $READS_2            \
                                    -p              32                  \
                                    --ouput         $OUTPUT_DIR/salmon
    

Hera (Home page)

Version 1.2

Run command:


    
        Index: $BIN/hera_build  –fasta         $GENOME_FASTA       \
                                –gtf           $GENE_GTF           \
                                –outdir        $INDEX_DIR/hera/genome
    

    
        Quant: $BIN/hera quant      -i              $INDEX_DIR/hera     \
                                    -1              $READS_1            \
                                    -2              $READS_2            \
                                    -t              32                  \
                                    -o              $OUTPUT_DIR/hera