RNASEQR Manual (Last revision: 2012.02.07)
- What is RNASEQR?
- Installation and system requirements
- Run test dataset
- Build your own indices for RNASEQR
- Execute RNASEQR
- RNA-seq sequence library
- RNASEQR output
- Change logs
What is RNASEQR? [back to top]
RNASEQR is a nucleotide sequence mapper/aligner and is designed specifically for RNA-seq data analysis. It takes advantage of annotated transcripts and genomic reference sequences to obtain high quality mapping/alignment results.
RNASEQR is very fast and highly accurate. It significantly improves the mapping results, especially on transcripts containing smaller exons, which results in a more accurate assessment of gene-expression profiles and better transcript structures. The RNASEQR pipeline also significantly reduces false identification of single nucleotide variants (SNVs) near the splice junctions.
RNASEQR output is a standard SAM file and can be converted BAM files using SAMTools. Such standard SAM is compatible with many downstream analysis tool except a few, e.g.
Cufflinks. To get the Cufflinks compatiable SAM ouput, please read Output. RNASEQR and its open source codes are available at http://hood.systemsbiology.net/rnaseqr.php.
Warning: Please note that the files, such as sequence files, gene annotation files etc, obtained from different databases are likely to have headers and/or identifiers in different formats which may cause problems in running RNASEQR.
Installation and system requirements [back to top]
RNASEQR was written in Python 2.7, and shall run on most platforms as long as a proper Python interpreter (v2.7+) is installed. Please make sure the following programs are also properly installed before running RNASEQR.
- Bowtie v.0.12.7
- BLAT v.34
- Python 2.7
To install Bowtie, BLAT, and Python, please go to the developers’ websites.
To install RNASEQR, download the RNASEQR source codes and decompress it to the directory of your choice.
# tar zxvf RNASEQR_v1.0.2.tgz
Run test dataset [back to top]
Before analyzing your own RNA-seq data using RNASEQR, please run a test dataset to make sure everything is properly setup. Please download the prebuilt indices and transcriptomic annotation, as well as the test dataset and configuration files (for both single-end and paired-end reading).
BLATindex for human genome
Bowtieindex for human genome
Bowtieindex for human transcriptome (UCSC Known Gene hg19)
UCSC Known Gene annotation (hg19, BED format) Test RNA-seq dataset and configuration files
To decompress these files, please enter
# tar zxvf [file name].tgz
The sample dataset is a small portion of the RNA-seq data in database GEO with accession number GSE19166. The two FASTA files are paired-end; each contains 10,000 RNA-seq reads. The pre-computed result is also available here.GSE33328. Please read the RNASEQR manuscript for more information.
Build your own indices for RNASEQR [back to top]
RNASEQR requires three index files: two for
Bowtie that record both the transcriptomic and the genomic reference sequences; and one for
BLAT that indexes genomic reference sequences. To build these indices, you need the FASTA-format sequence files for both genome and transcriptome.
genome [back to top]
To build the genomic indices for
BLAT, you will need to download the genomic reference file. Go to the ENSEMBL FTP site and click the link in the DNA column. Download the FASTA files for each chromosome, e.g. Homo_sapiens.GRCh37.65.dna.chromosome.(1-22, X, Y, and MT).fa.gz for human. Once all the chromosome are downloaded, decompress these files and concatenate these files to a single file. For example, human.
# gzip -d Homo_sapiens.GRCh37.65.dna.chromosome.*.fa.gz # cat Homo_sapiens.GRCh37.65.dna.chromosome.*.fa > Homo_sapiens.GRCh37.65.genome.fasta
After appending all chromosomes to a single file, run
create_map_and_tables.pl to create the map_and_tables.py script which contains the chromosome information of the species in your dataset. The original
map_and_tables.py provided in the RNASEQR package is meant for human. Warning: Inaccurate chromosome information written in the
map_and_tables.py will screw up the mapping result completely.
Once having the correct map_and_tables.py, build the genomic index for
# perl create_map_and_tables.pl Homo_sapiens.GRCh37.65.genome.fasta
Then build the index for
# /directory_path/bowtie-0.12.7/bowtie-build Homo_sapiens.GRCh37.65.genome.fasta Homo_sapiens.GRCh37.65.genome
You shall now have all the genomic indices listed below for both
# /directory_path/blat_v34/faToTwoBit Homo_sapiens.GRCh37.65.genome.fasta Homo_sapiens.GRCh37.65.genome.2bit
Homo_sapiens.GRCh37.65.genome.2bit Homo_sapiens.GRCh37.65.genome.1.ebwt Homo_sapiens.GRCh37.65.genome.2.ebwt Homo_sapiens.GRCh37.65.genome.3.ebwt Homo_sapiens.GRCh37.65.genome.4.ebwt Homo_sapiens.GRCh37.65.genome.rev.1.ebwt Homo_sapiens.GRCh37.65.genome.rev.2.ebwt
transcriptome [back to top]
RNASEQR supports the UCSC GENE and ENSEMBL GENE transcriptome annotations. Both the UCSC and the ENSEMBL annotations record more gene transcripts than other projects, such as the NCBI RefSeq and the Consensus CDS (CCDS) project, and also give a better result in RNASEQR. Please read RNASEQR manuscript for more information. Please follow the steps to download the FASTA transcriptomic references and gene annotation either from the ENSEMBL or the UCSC webpages.
In ENSEMBL, go to the ENSEMBL FTP site and download the FASTA-format cDNA sequences (in the cDNA column). Please download the GTF gene annotation file which is also required by RNASEQR. Decompress the cDNA sequence file, and run
filter_alternative_loci_ENS.pl to remove the transcripts annotated to the alternative genomic loci.
# gzip -d Homo_sapiens.GRCh37.65.cdna.all.fa.gz # perl filter_alternative_loci_ENS.pl Homo_sapiens.GRCh37.63.genome.fasta Homo_sapiens.GRCh37.65.cdna.all.fa Homo_sapiens.GRCh37.65.cdna.cleaned.fa
You can also get both the transcriptomic sequence files and annotation files at the UCSC Table Browser. Select the 'Genes and Gene Prediction Tracks' group. In the track option, select 'Ensembl Genes' for the Ensembl gene annotation or 'UCSC Genes' for the UCSC gene annotation. Select 'sequence' or 'BED' to get the FASTA sequence files or the BED gene annotation files. Warning: the GTF format is not supported by current RNASEQR. Once the download is completed, run
filter_alternative_loci_UCSC_BED.pl to remove the transcripts annotated to the alternative genomic loci.
For examples:Usage: perl filter_alternative_loci_UCSC_BED.pl Genomic_fasta_file UCSC_bed_file Transcriptomic_fasta_file Output_fasta_file_name
# perl filter_alternative_loci_UCSC_BED.pl Homo_sapiens.GRCh37.65.genome.fasta Homo_sapiens.UCSC.hg19.bed Homo_sapiens.UCSC.hg19.fa Homo_sapiens.UCSC.hg19_cleaned.fa
Once the transcripts annotated in the alternative genomic loci are removed using
filter_alternative_loci_UCSC_BED.pl, depending on from which website you obtained the sequence files and annotation. Use
Bowtie to create the transcriptomic index. For exmaple, human.
# /directory_path/bowtie-0.12.7/bowtie-build Homo_sapiens.GRCh37.65.cdna.cleaned.fa Homo_sapiens.GRCh37.65.transcriptome
Once completed, you shall have the transcriptomic index files listed below for
Homo_sapiens.GRCh37.65.transcriptome.1.ebwt Homo_sapiens.GRCh37.65.transcriptome.2.ebwt Homo_sapiens.GRCh37.65.transcriptome.3.ebwt Homo_sapiens.GRCh37.65.transcriptome.4.ebwt Homo_sapiens.GRCh37.65.transcriptome.rev.1.ebwt Homo_sapiens.GRCh37.65.transcriptome.rev.2.ebwt
color-spaced index [back to top]
You will need a color space specific index for
Bowtie. To build the color space indices for human genome and transcriptome. Use
-C to instruct
Bowtie for color-spaced index. For example:
# /directory_path/bowtie-0.12.7/bowtie-build -C Homo_sapiens.GRCh37.63.genome.fasta Homo_sapiens.GRCh37.63.genome # /directory_path/bowtie-0.12.7/bowtie-build -C -o 0 Homo_sapiens.GRCh37.63.cdna.all.fa Homo_sapiens.GRCh37.63.transcriptome
Please note that the current version of RNASEQR does not support color-spaced sequence mapping/alignment using BLAT. This will affect the detection of novel splicing junctions in your RNA-seq dataset.
Instead of creating a color-space index for
Bowtie, you can convert your csFASTA and sequence quality files to FASTQ using some publicly available tools and run RNASEQR.
Execute RNASEQR [back to top]
RNASEQRp.py is the main script, and can be executed either solely in command line or with a configuration file specifying arguments, or a combination of both. Warning: Please note that the command line augments will override the same arguments specified in the configuration file.
For example, specify all arguments solely in the command line.
To list all available arguments, use the option
# python /directory_path/RNASEQRp.py -a /PATH/bowtie-0.12.7/bowtie -m 3 -b 1 -p 2 -s Illumina1.5+ -t BED -i /PATH/seq_indexes/Homo_sapiens.GRCh37.59.UCSC.transcriptome -f /PATH/seq_indexes/UCSC_knownGene_hg19.bed -g /PATH/seq_indexes/Homo_sapiens.GRCh37.59.genome -r 75 -q /PATH/example.fastq
# python RNASEQRp.py -h usage: RNASEQRp.py [-h] [-a PATH] [-m Integer] [-b Integer] [-p Integer] [-s TYPE] [-l PATH] [-n PATH] [-t TYPE] [-i PATH] [-q PATH] [-f PATH] [-g PATH] [-r Integer] [-e Integer] [-x Integer] [-C Integer] [-G String] [-o Integer] [--version] [--debug Integer] [setting.config] positional arguments: setting.config configuration file. Please note that the command-line arguments will override the parameters specified in this file. optional arguments: -h, --help show this help message and exit -a PATH aligner_path -m Integer align_mismatch -b Integer segment align_mismatch -p Integer number_of_thread -s TYPE phred_type: "Sanger","Solexa","Illumina1.3","Illumina1.5+" -l PATH local_aligner_path -n PATH genome_index_prefix_for_local_aligner -t TYPE annotation_type -i PATH annotation_index_prefix -q PATH Input FASTQ file. If the input is paired-end data, use comma to separate them in order. -f PATH annotation_file -g PATH genome_index_prefix -r Integer The length of short read -e Integer The number of hit anchors to be valid -x Integer The length of anchors -C Integer If use colorspace, set the value of this option as 1. -G String [Paired-end Only] The insertion length of paired-end reads. The format is Integer-Integer, e.g., 200-400. -o Integer Maximum number of overlaping exons in annotation --version show program's version number and exit --debug Integer Set value 1 to open the debug mode
Arguments and descriptions.
-a Directory where
-m Number of mismatches allowed in an RNA-seq read. -b Number of mismatches allowed in a segmented RNA-seq read. The default length of a segmented read is 25bp. -p Number of CPU threads assigned to
RNASEQR. Warning: Please make sure the maximal CPU threads available in your system.
-s Phred format mark of the sequencing qaulity in the input FASTQ RNA-seq file. Please specify one of the followings: "Sanger","Solexa","Illumina1.3", and "Illumina1.5+". -l Directory where
-n File location of the genomic indices for
-q File location of the RNA-seq sequence file(s). -f File location of the transcriptomic annotation. -t File format of the transcriptome annotation.
-i File location of the transcriptomic index (for
-g File location of the genomic indices (for
-r RNA-seq sequence read length -C Write
1when the input RNA-seq data is colorpace format.
-G Insert size in the paired-end reading. -o Maximum number of overlapped transcripts which constitute the same exon in full or in partitial.
configuration file [back to top]
Configuration file records specified arguments in a single file. Please specify each parameter in the form of
VARIABLE_NAME=VALUE, one per line.
Warning: Please note that the parameters specified in the configuration file will be overrode by the same parameters, if exists, in the command line.
Example configuration file RNASEQR.cfg.
input_fastq=/directory_path/data_1.fq,/directory_path/data_2.fq paired_end_gap=200-400 short_read_length=50 phred_type=Illumina1.5+ align_mismatch=2 seg_align_mismatch=4 aligner_path=/directory_path/bowtie-0.12.7/bowtie local_aligner_path=/directory_path/blat_v34/blat number_of_thread=2 annotation_type=BED annotation_file=/directory_path/seq_indexes/Homo_sapiens.UCSC_knownGene_hg19.bed annotation_index_prefix=/directory_path/Homo_sapiens.GRCh37.59.UCSC.transcriptome.o0 genome_index_prefix=/directory_path/Homo_sapiens.GRCh37.59.genome.fa genome_index_prefix_for_local_aligner=/directory_path/Homo_sapiens.GRCh37.59.genome.fa.2bit
Command line arguments and their
Option Varible name used in configuration file -a aligner_path -m align_mismatch -b seg_align_mismatch -p number_of_thread -s phred_type -i annotation_index_prefix -f annotation_file -g genome_index_prefix -r short_read_length -l local_aligner_path -n genome_index_prefix_for_local_aligner -t annotation_type -C colorspace -G paired_end_gap
RNA-seq sequence library [back to top]
RNASEQR reads the FASTQ format RNA-seq library single-ended by default unless otherwise specified. Please not that different platforms and output from different versions of software may apply different Phred quality score encodings. Specify the quality scoring with the option
-s in the command line or
phred_type in the configuration file.
paired-end RNA-seq library [back to top]
The paired-end RNA-seq data is usually written in two FASTQ files: one for each end (mate). To analyze a paired-end library, the two FASTQ files should be separated by comma after the option
-q. For example,
# python RNASEQRp.py -q end_1.fastq,end_2.fastq RNASEQR.cfg
The mapping/alignment result of each end will be recorded in separate lines. The result of the upstream end is written ahead of that of the downstream end.
Warning: RNASEQR was initially developed under the older Illumina CASAVA versions. Since CASAVA version 1.8, Illumina has changed the denotation of the header line for paired-end reads which causes a problem to RNASEQR. Please run
convert_CASAVA1.8_headers.pl to fix this issue.
Meanwhile, if you have the Illumina export text files rather than FASTQ files, you will need to convert to the FASTQ format. Please run
# perl convert_CASAVA1.8_headers.pl paired_end_file_1 paired_end_file_2 output_paired_end_file_1 output_paired_end_file_2
convert_CASAVA_export.pyto convert the CASAVA export text files to FASTQ.
Color space sequence file [back to top]
RNASEQR processes color space FASTQ files when the option
-C is specified. In the meantime, the index files need to be specifically built in color space format. The csfasta files can also be converted to fastq files using the publicly available third-party programs.
RNASEQR output [back to top]
Only uniquely mapped RNA-seq reads are reported in the current version of
RNASEQR. The mapping/alignment output is written in SAM format and has 12 columns, delimited by tab. Each line records a sequence and its mapping/alignment result. For more detail about the SAM file format, please go to SAM format specification page.
Sum of all applicable flags. Flags relevant to
The read is one of a pair
The alignment is one end of a proper paired-end alignment
The read has no reported alignments
The read is one of a pair and has no reported alignments
The alignment is to the reverse reference strand
Name of reference sequence where alignment occurs, or ordinal ID if no name was provided.
1-based offset into the forward reference strand where leftmost character of the alignment occurs.
CIGAR string representation of alignment.
Name of reference sequence where mate's alignment occurs. Set to
=if the mate's reference sequence is the same as this alignment's, or
*if there is no mate.
1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate.
Inferred insert size. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if there is no mate.
Read sequence (reverse-complemented if aligned to the reverse strand).
ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33, similarly to a FASTQ file.
compatibility to Cufflinks [back to top]
# Python /directory_path/convert_RNASEQRSAM_to_CufflinkSAM.py -s (your RNASEQR SAM file) (genome sequence file (the concatenated FASTA file)) -o (file name of the Cufflinks compatible SAM file)
Change logs [back to top]
- Version 1.0.2
- add support to the gene annotation and gene reference sequences from UCSC Table Browser download.
filter_alternative_loci_UCSC_BED.plscripts to fix an issue due to the transcriptomic index when running paired-end library.
create_map_and_tables.plscript to fix hard-coded chromosome information stored in the
- Version 1.0.1
convert_CASAVA1.8_headers.plto fix an issue caused by CASAVA 1.8 header when running paired-end libraries.
convert_CASAVA_export.pyto convert CASAVA export text file to FASTQ format.
convert_RNASEQRSAM_to_CufflinkSAM.pyto support the Cufflinks analysis.