RNASEQR Manual (Last revision: 2012.02.07)

What is RNASEQR? [back to top]

RNASEQR is a nucleotide sequence mapper/aligner and is designed specifically for RNA-seq data analysis. It takes advantage of annotated transcripts and genomic reference sequences to obtain high quality mapping/alignment results.

RNASEQR is very fast and highly accurate. It significantly improves the mapping results, especially on transcripts containing smaller exons, which results in a more accurate assessment of gene-expression profiles and better transcript structures. The RNASEQR pipeline also significantly reduces false identification of single nucleotide variants (SNVs) near the splice junctions.

RNASEQR output is a standard SAM file and can be converted BAM files using SAMTools. Such standard SAM is compatible with many downstream analysis tool except a few, e.g. Cufflinks. To get the Cufflinks compatiable SAM ouput, please read Output. RNASEQR and its open source codes are available at http://hood.systemsbiology.net/rnaseqr.php.

Warning: Please note that the files, such as sequence files, gene annotation files etc, obtained from different databases are likely to have headers and/or identifiers in different formats which may cause problems in running RNASEQR.

Installation and system requirements [back to top]

RNASEQR was written in Python 2.7, and shall run on most platforms as long as a proper Python interpreter (v2.7+) is installed. Please make sure the following programs are also properly installed before running RNASEQR.

  • Bowtie v.0.12.7
  • BLAT v.34
  • Python 2.7
RNASEQR has been tested on the above programs with the indicated version. RNASEQR was developed on the 64-bit Cent OS Linux and shall run on both the 64- and the 32-bit Linux systems.

To install Bowtie, BLAT, and Python, please go to the developers’ websites.

To install RNASEQR, download the RNASEQR source codes and decompress it to the directory of your choice.

# tar zxvf RNASEQR_v1.0.2.tgz

Run test dataset [back to top]

Before analyzing your own RNA-seq data using RNASEQR, please run a test dataset to make sure everything is properly setup. Please download the prebuilt indices and transcriptomic annotation, as well as the test dataset and configuration files (for both single-end and paired-end reading).

  • Prebuilt BLAT index for human genome
  • Prebuilt Bowtie index for human genome
  • Prebuilt Bowtie index for human transcriptome (UCSC Known Gene hg19)
  • UCSC Known Gene annotation (hg19, BED format)
  • Test RNA-seq dataset and configuration files
  • To decompress these files, please enter

    # tar zxvf [file name].tgz
    

    The sample dataset is a small portion of the RNA-seq data in database GEO with accession number GSE19166. The two FASTA files are paired-end; each contains 10,000 RNA-seq reads. The pre-computed result is also available here.

  • Mapping/alignment result of the test dataset
  • The RNA-seq sequence files described in the RNASEQR manuscripts is also available in database GEO with accession number GSE33328. Please read the RNASEQR manuscript for more information.

    Build your own indices for RNASEQR [back to top]

    RNASEQR requires three index files: two for Bowtie that record both the transcriptomic and the genomic reference sequences; and one for BLAT that indexes genomic reference sequences. To build these indices, you need the FASTA-format sequence files for both genome and transcriptome.

    genome [back to top]

    To build the genomic indices for Bowtie and BLAT, you will need to download the genomic reference file. Go to the ENSEMBL FTP site and click the link in the DNA column. Download the FASTA files for each chromosome, e.g. Homo_sapiens.GRCh37.65.dna.chromosome.(1-22, X, Y, and MT).fa.gz for human. Once all the chromosome are downloaded, decompress these files and concatenate these files to a single file. For example, human.

    # gzip -d Homo_sapiens.GRCh37.65.dna.chromosome.*.fa.gz
    # cat Homo_sapiens.GRCh37.65.dna.chromosome.*.fa > Homo_sapiens.GRCh37.65.genome.fasta
    

    After appending all chromosomes to a single file, run create_map_and_tables.pl to create the map_and_tables.py script which contains the chromosome information of the species in your dataset. The original map_and_tables.py provided in the RNASEQR package is meant for human. Warning: Inaccurate chromosome information written in the map_and_tables.py will screw up the mapping result completely.

    # perl create_map_and_tables.pl Homo_sapiens.GRCh37.65.genome.fasta

    Once having the correct map_and_tables.py, build the genomic index for Bowtie.
    # /directory_path/bowtie-0.12.7/bowtie-build Homo_sapiens.GRCh37.65.genome.fasta Homo_sapiens.GRCh37.65.genome
    
    Then build the index for BLAT,
    # /directory_path/blat_v34/faToTwoBit Homo_sapiens.GRCh37.65.genome.fasta Homo_sapiens.GRCh37.65.genome.2bit
    
    You shall now have all the genomic indices listed below for both Bowtie and BLAT.
  • Homo_sapiens.GRCh37.65.genome.2bit
  • Homo_sapiens.GRCh37.65.genome.1.ebwt
  • Homo_sapiens.GRCh37.65.genome.2.ebwt
  • Homo_sapiens.GRCh37.65.genome.3.ebwt
  • Homo_sapiens.GRCh37.65.genome.4.ebwt
  • Homo_sapiens.GRCh37.65.genome.rev.1.ebwt
  • Homo_sapiens.GRCh37.65.genome.rev.2.ebwt
  • transcriptome [back to top]

    RNASEQR supports the UCSC GENE and ENSEMBL GENE transcriptome annotations. Both the UCSC and the ENSEMBL annotations record more gene transcripts than other projects, such as the NCBI RefSeq and the Consensus CDS (CCDS) project, and also give a better result in RNASEQR. Please read RNASEQR manuscript for more information. Please follow the steps to download the FASTA transcriptomic references and gene annotation either from the ENSEMBL or the UCSC webpages.

    In ENSEMBL, go to the ENSEMBL FTP site and download the FASTA-format cDNA sequences (in the cDNA column). Please download the GTF gene annotation file which is also required by RNASEQR. Decompress the cDNA sequence file, and run filter_alternative_loci_ENS.pl to remove the transcripts annotated to the alternative genomic loci.

    # gzip -d Homo_sapiens.GRCh37.65.cdna.all.fa.gz
    # perl  filter_alternative_loci_ENS.pl  Homo_sapiens.GRCh37.63.genome.fasta  Homo_sapiens.GRCh37.65.cdna.all.fa Homo_sapiens.GRCh37.65.cdna.cleaned.fa
    

    You can also get both the transcriptomic sequence files and annotation files at the UCSC Table Browser. Select the 'Genes and Gene Prediction Tracks' group. In the track option, select 'Ensembl Genes' for the Ensembl gene annotation or 'UCSC Genes' for the UCSC gene annotation. Select 'sequence' or 'BED' to get the FASTA sequence files or the BED gene annotation files. Warning: the GTF format is not supported by current RNASEQR. Once the download is completed, run filter_alternative_loci_UCSC_BED.pl to remove the transcripts annotated to the alternative genomic loci.

    Usage: perl  filter_alternative_loci_UCSC_BED.pl  Genomic_fasta_file  UCSC_bed_file Transcriptomic_fasta_file  Output_fasta_file_name
    For examples:
    # perl  filter_alternative_loci_UCSC_BED.pl  Homo_sapiens.GRCh37.65.genome.fasta  Homo_sapiens.UCSC.hg19.bed  Homo_sapiens.UCSC.hg19.fa Homo_sapiens.UCSC.hg19_cleaned.fa
    

    Once the transcripts annotated in the alternative genomic loci are removed using filter_alternative_loci_ENS.pl or filter_alternative_loci_UCSC_BED.pl, depending on from which website you obtained the sequence files and annotation. Use Bowtie to create the transcriptomic index. For exmaple, human.

    # /directory_path/bowtie-0.12.7/bowtie-build Homo_sapiens.GRCh37.65.cdna.cleaned.fa Homo_sapiens.GRCh37.65.transcriptome
    

    Once completed, you shall have the transcriptomic index files listed below for Bowtie.

  • Homo_sapiens.GRCh37.65.transcriptome.1.ebwt
  • Homo_sapiens.GRCh37.65.transcriptome.2.ebwt
  • Homo_sapiens.GRCh37.65.transcriptome.3.ebwt
  • Homo_sapiens.GRCh37.65.transcriptome.4.ebwt
  • Homo_sapiens.GRCh37.65.transcriptome.rev.1.ebwt
  • Homo_sapiens.GRCh37.65.transcriptome.rev.2.ebwt
  • color-spaced index [back to top]

    You will need a color space specific index for Bowtie. To build the color space indices for human genome and transcriptome. Use -C to instruct Bowtie for color-spaced index. For example:

    # /directory_path/bowtie-0.12.7/bowtie-build -C Homo_sapiens.GRCh37.63.genome.fasta Homo_sapiens.GRCh37.63.genome
    # /directory_path/bowtie-0.12.7/bowtie-build -C -o 0 Homo_sapiens.GRCh37.63.cdna.all.fa Homo_sapiens.GRCh37.63.transcriptome
    

    Please note that the current version of RNASEQR does not support color-spaced sequence mapping/alignment using BLAT. This will affect the detection of novel splicing junctions in your RNA-seq dataset.

    Instead of creating a color-space index for Bowtie, you can convert your csFASTA and sequence quality files to FASTQ using some publicly available tools and run RNASEQR.

    Execute RNASEQR [back to top]

    RNASEQRp.py is the main script, and can be executed either solely in command line or with a configuration file specifying arguments, or a combination of both. Warning: Please note that the command line augments will override the same arguments specified in the configuration file.

    For example, specify all arguments solely in the command line.

    # python /directory_path/RNASEQRp.py -a /PATH/bowtie-0.12.7/bowtie -m 3 -b 1 -p 2 -s Illumina1.5+ -t BED -i /PATH/seq_indexes/Homo_sapiens.GRCh37.59.UCSC.transcriptome -f /PATH/seq_indexes/UCSC_knownGene_hg19.bed -g /PATH/seq_indexes/Homo_sapiens.GRCh37.59.genome -r 75 -q /PATH/example.fastq

    To list all available arguments, use the option -h.
    # python RNASEQRp.py -h
    usage: RNASEQRp.py [-h] [-a PATH] [-m Integer] [-b Integer] [-p Integer]
                       [-s TYPE] [-l PATH] [-n PATH] [-t TYPE] [-i PATH] [-q PATH]
                       [-f PATH] [-g PATH] [-r Integer] [-e Integer] [-x Integer]
                       [-C Integer] [-G String] [-o Integer] [--version]
                       [--debug Integer]
                       [setting.config]
    
    positional arguments:
      setting.config   configuration file. Please note that the command-line
                       arguments will override the parameters specified in this
                       file.
    
    optional arguments:
      -h, --help       show this help message and exit
      -a PATH          aligner_path
      -m Integer       align_mismatch
      -b Integer       segment align_mismatch
      -p Integer       number_of_thread
      -s TYPE          phred_type: "Sanger","Solexa","Illumina1.3","Illumina1.5+"
      -l PATH          local_aligner_path
      -n PATH          genome_index_prefix_for_local_aligner
      -t TYPE          annotation_type
      -i PATH          annotation_index_prefix
      -q PATH          Input FASTQ file. If the input is paired-end data, use
                       comma to separate them in order.
      -f PATH          annotation_file
      -g PATH          genome_index_prefix
      -r Integer       The length of short read
      -e Integer       The number of hit anchors to be valid
      -x Integer       The length of anchors
      -C Integer       If use colorspace, set the value of this option as 1.
      -G String        [Paired-end Only] The insertion length of paired-end reads.
                       The format is Integer-Integer, e.g., 200-400.
      -o Integer       Maximum number of overlaping exons in annotation
      --version        show program's version number and exit
      --debug Integer  Set value 1 to open the debug mode
    

    Arguments and descriptions.

    -aDirectory where Bowtie is installed.
    -mNumber of mismatches allowed in an RNA-seq read.
    -bNumber of mismatches allowed in a segmented RNA-seq read. The default length of a segmented read is 25bp.
    -pNumber of CPU threads assigned to RNASEQR. Warning: Please make sure the maximal CPU threads available in your system.
    -sPhred format mark of the sequencing qaulity in the input FASTQ RNA-seq file. Please specify one of the followings: "Sanger","Solexa","Illumina1.3", and "Illumina1.5+".
    -lDirectory where BLAT is installed.
    -nFile location of the genomic indices for BLAT.
    -qFile location of the RNA-seq sequence file(s).
    -fFile location of the transcriptomic annotation.
    -tFile format of the transcriptome annotation. RNASEQR currently supports BED and GTF formats.
    -iFile location of the transcriptomic index (for Bowtie).
    -gFile location of the genomic indices (for Bowtie).
    -rRNA-seq sequence read length
    -CWrite 1 when the input RNA-seq data is colorpace format.
    -GInsert size in the paired-end reading.
    -oMaximum number of overlapped transcripts which constitute the same exon in full or in partitial.

    configuration file [back to top]

    Configuration file records specified arguments in a single file. Please specify each parameter in the form of VARIABLE_NAME=VALUE, one per line.

    Warning: Please note that the parameters specified in the configuration file will be overrode by the same parameters, if exists, in the command line.

    Example configuration file RNASEQR.cfg.

    input_fastq=/directory_path/data_1.fq,/directory_path/data_2.fq
    paired_end_gap=200-400
    short_read_length=50
    phred_type=Illumina1.5+
    align_mismatch=2
    seg_align_mismatch=4
    aligner_path=/directory_path/bowtie-0.12.7/bowtie
    local_aligner_path=/directory_path/blat_v34/blat
    number_of_thread=2
    annotation_type=BED
    annotation_file=/directory_path/seq_indexes/Homo_sapiens.UCSC_knownGene_hg19.bed
    annotation_index_prefix=/directory_path/Homo_sapiens.GRCh37.59.UCSC.transcriptome.o0
    genome_index_prefix=/directory_path/Homo_sapiens.GRCh37.59.genome.fa
    genome_index_prefix_for_local_aligner=/directory_path/Homo_sapiens.GRCh37.59.genome.fa.2bit
    

    Command line arguments and their VARIABLE_NAME counterpart.

    OptionVarible name used in configuration file
    -aaligner_path
    -malign_mismatch
    -bseg_align_mismatch
    -pnumber_of_thread
    -sphred_type
    -iannotation_index_prefix
    -fannotation_file
    -ggenome_index_prefix
    -rshort_read_length
    -llocal_aligner_path
    -ngenome_index_prefix_for_local_aligner
    -tannotation_type
    -Ccolorspace
    -Gpaired_end_gap

    RNA-seq sequence library [back to top]

    RNASEQR reads the FASTQ format RNA-seq library single-ended by default unless otherwise specified. Please not that different platforms and output from different versions of software may apply different Phred quality score encodings. Specify the quality scoring with the option -s in the command line or phred_type in the configuration file.

    paired-end RNA-seq library [back to top]

    The paired-end RNA-seq data is usually written in two FASTQ files: one for each end (mate). To analyze a paired-end library, the two FASTQ files should be separated by comma after the option -q. For example,

    # python  RNASEQRp.py  -q  end_1.fastq,end_2.fastq  RNASEQR.cfg

    The mapping/alignment result of each end will be recorded in separate lines. The result of the upstream end is written ahead of that of the downstream end.

    Warning: RNASEQR was initially developed under the older Illumina CASAVA versions. Since CASAVA version 1.8, Illumina has changed the denotation of the header line for paired-end reads which causes a problem to RNASEQR. Please run convert_CASAVA1.8_headers.pl to fix this issue.

    # perl  convert_CASAVA1.8_headers.pl  paired_end_file_1  paired_end_file_2  output_paired_end_file_1 output_paired_end_file_2
    Meanwhile, if you have the Illumina export text files rather than FASTQ files, you will need to convert to the FASTQ format. Please run convert_CASAVA_export.py to convert the CASAVA export text files to FASTQ.

    Color space sequence file [back to top]

    RNASEQR processes color space FASTQ files when the option -C is specified. In the meantime, the index files need to be specifically built in color space format. The csfasta files can also be converted to fastq files using the publicly available third-party programs.

    RNASEQR output [back to top]

    Only uniquely mapped RNA-seq reads are reported in the current version of RNASEQR. The mapping/alignment output is written in SAM format and has 12 columns, delimited by tab. Each line records a sequence and its mapping/alignment result. For more detail about the SAM file format, please go to SAM format specification page.

    1. Read name

    2. Sum of all applicable flags. Flags relevant to Bowtie are:

      1

      The read is one of a pair

      2

      The alignment is one end of a proper paired-end alignment

      4

      The read has no reported alignments

      8

      The read is one of a pair and has no reported alignments

      16

      The alignment is to the reverse reference strand

    3. Name of reference sequence where alignment occurs, or ordinal ID if no name was provided.

    4. 1-based offset into the forward reference strand where leftmost character of the alignment occurs.

    5. Mapping quality.

    6. CIGAR string representation of alignment.

    7. Name of reference sequence where mate's alignment occurs. Set to = if the mate's reference sequence is the same as this alignment's, or * if there is no mate.

    8. 1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate.

    9. Inferred insert size. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if there is no mate.

    10. Read sequence (reverse-complemented if aligned to the reverse strand).

    11. ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33, similarly to a FASTQ file.

    12. Optional fields.

    compatibility to Cufflinks [back to top]

    Cufflinks requires an additional tag for the spliced alignments. To use RNASEQR's output SAM files as inputs to Cufflinks, please run convert_RNASEQRSAM_to_CufflinkSAM.py script.

    # Python /directory_path/convert_RNASEQRSAM_to_CufflinkSAM.py -s (your RNASEQR SAM file) (genome sequence file (the concatenated FASTA file)) -o (file name of the Cufflinks compatible SAM file)

    Change logs [back to top]

    • Version 1.0.2

      • add support to the gene annotation and gene reference sequences from UCSC Table Browser download.
      • add filter_alternative_loci_ENS.pl and filter_alternative_loci_UCSC_BED.pl scripts to fix an issue due to the transcriptomic index when running paired-end library.
      • add create_map_and_tables.pl script to fix hard-coded chromosome information stored in the map_and_tables.py.
    • Version 1.0.1

      • add convert_CASAVA1.8_headers.pl to fix an issue caused by CASAVA 1.8 header when running paired-end libraries.
      • add convert_CASAVA_export.py to convert CASAVA export text file to FASTQ format.
      • add convert_RNASEQRSAM_to_CufflinkSAM.py to support the Cufflinks analysis.