BAM files

Below are guidelines to consider when creating a BAM file(s) for a miRspring document:

General workflow to create a BAM file

The basic workflow to create a mapped BAM (Binary alignment/Map) file is shown below. The links to the left contain more detailed information for specific steps.

Firstly lets consider a miRNA sequence tag, where the insert (i.e. the miRNA) is 22nt. This means that the sequence in the output file will be a chimeric of both the miRNA sequence plus the 3' adapter (highlighted in A of figure). Therefore we cannot map the map the entire output sequence against the reference as the adapter will not map.

There are at least two ways of removing adaptor, these include:
1) identify and trim off the adaptor.
2) Align a short fragment (eg 18nt) of insert against reference and then extend the alignment as far as possible.
Both methods benefit from knowing the adaptor sequence (highlighted in B of figure)

After removing the adaptor the output modified sequence file will only contain insert sequences (highlighted in C of figure). These insert sequences are mapped to the reference using a program called an aligner (highlighted in D of figure). The final output (highlighted in E of figure) is a sequence alignment map file (SAM) which can be converted into a binary equivalent (i.e. BAM file).

BAM workflow

Trimming adaptor sequencers from sequencing data


The 3' adaptor sequence is very likely to be present in the sequencing data files, and if not removed will cause significant alignment problems to the reference. Therefore correctly identifying the adaptor and removing it will significantly enhance the mappability of the data set. There are a large number of software available to do this task, the bioscholar web site has collated a thorough list of such tools. Below is an example on how to use the cutadapt software.

cutadapt -a ATCTCGTATGCCGTCTTC -m 18 InputFile.fastq > OutputFile.fastq

The above example will remove the 3' illumina adaptor. The "-m 18" option will discards all sequences having a length less than 18nt (after adapter removal).

Using an aligner to create a BAM file

There are a number of software aligner tools available to make a BAM file. If you have access to a high performance computer setup you will most likely utilise the aligner via the command line. Below are two popular aligners used to map small RNA libraries and examples of the command line syntax. Please note there are various parameters not listed below that may be more suitable for your project, therefore consider this a quick start guide to get you up and running. Please refer to the linked home pages for each aligner for detailed instructions on their use.

BOWTIE
Before using BOWTIE aligner you need to build a reference index using the "bowtie-build" program.
The syntax is as follows:
bowtie-build reference.fa ref_index

where:  reference.fa is your reference sequence in fasta format.
        ref_index is the output name for the index file. 

Note that a number of output files will be generated and the name you set in this step will be needed for 
the subsequence mapping steps.

Below is an example of a bowtie command that takes a fastq data set ("seqfile.fastq") and aligns it against a reference index ("ref_index") and outputs the result in sam format. This example then pipes ("|") the sam output into samtools which converts it into bam format into the file "out.bam".
bowtie -p 4 -n 1 -l 21 --nomaqround --best --chunkmbs 256 ref_index seqfile.fastq -S | samtools view -bS -o out.bam -
where: -p number of processes -n number of mismatches allowed in the first number of bases defined by -l. So the above example is allowing 1 mismatch in the first 21 nucleotides. --best list the best alignment and not the first valid alignment --chunkmbs is the amount of memory set aside to help in assigning best location. --nomaqround No rounding of the Phred score to the nearest 10

Note that for multimapping miRNAs this results in a random assignment of mapped locations. If you want to assign multimappers to all locations then add "-k 100" to the above command line options. This will assign up to 100 valid alignments. Note that doing this will result in multimappers being present in the final miRspring document and will distort the seed analysis tools.

Note the samtools command receives the SAM output from the bowtie command and converts into BAM format.

Note if you are mapping colour space data must add the -C flag to specify this.

Don't forget to sort and index the bam file (refer link on side menu) before attempting to generate a miRspring document

BWA
Before using this aligner you need to create an index for your reference
bwa index reference.fa
bwa aln reference.fa short_read.fq > aln_sa.sai (map the data to the reference)
bwa samse reference.fa aln_sa.sai short_read.fq > aln-se.sam (create a sam file)
(create a bam file from sam file)
(Index and sort the bam file. Now file is ready to be transformed into a miRspring file)


Alternatively you can use the above mentioned aligners via the GALAXY web based analysis tool box.

Want to learn more about aligners? Below is a list of useful links:
List of publications on NGS aligners.
Biowulf list of programs,packages and databases.
Wikipedia listing of aligners
Recent article by Thomas Tuschl group published in Methods (October 2012) on the bioinformatics of small RNAs

What reference to map against to create a BAM file


There are basically two reference choices for miRNA mapping, whole genome or miRbase. There are cetain benefits of using one or the other. Of these two references miRBase is a considerably smaller and requires less computing resources to align sequence files. Additionally there is considerably less multimapping. The downside is that it is possible for a sequence tag to map to a miRbase entry even if it aligns better to another genome coordinate. Another downside is that miRbase precursor entries contain the predicted stem loop sequence. It is possible that the stem loop structure extends further than what is incorporated in miRBase and processed miRNA from these regions (eg MOR miRNAs) will not be reported in the BAM file.

Conversly using the genome will require more computing resources and possibly resulting in false negative mapping events. For example A miRNA sequence tag containing an error/editing event may map to a different part of the genome. It is also possible that a multi-mapping sequence may not be reported (especially if aligner settings not set correctly). However the benefits is that you can potentially discover new miRNAs that are not annotated in miRbase.

Implementation note: miRspring relies on the BAM header to define what is in the reference. Therefore if you decide to align to miRBase make sure the reference file is in fasta format and that each sequence header has a precursor naming convention (i.e. species code - miR - number). The sequence header names in the fasta file must match the names of the miRBase files used by miRspring.

Sorting / indexing a BAM file

The BAM files used to generate a miRspring document must be indexed (i.e. must have corresponding bai file), and therefore sorted. This is to allow miRspring-document generating scripts to interrogate specific genomic regions that correspond to miRNAs or other small RNAs of interest. Below are samtools command line examples to sort and index BAM files:

Create a sorted bamfile:
samtools sort inputfile.bam outputfile
					
Note: the output file will append a ".bam" to the end of the given filename.

TCreate an index for a bam file. You need to state the name of the index file, and recommend that you use the exact same input name but add a ".bai" at the end.
samtools index inputfile.bam inputfile.bam.bai

Mismatches, length and multimappers

Aligning short sequences to a reference is a bioinformatic challenge. While the quality of sequencing is continuously improving there is still a good chance that a significant number of sequence reads will contain an sequencing error. There is always a delicate balance in trying to recover more information without incorporating to much noise. Additionally modifying the accepted length and mismatches will increase the number of multimappers and therefore accuracy and computing resources.

MISMATCHES: If you are interested in identifying RNA editing events or non-templated additions you will need to accept mismatches in your data set. MiRspring is designed to display mismatches but will not discriminate between sequencing errors and RNA editing/modification events. Note that the miRspring pipeline identifies mismatches through the NM tag in the BAM file. The exact mismatch position is then identified by comparing the sequence against the reference. Therefore to correctly identify the mismatch position the sequence tag of a BAM file must have a non-masked tag-sequence field. (i.e. no masking of sequences with 'N'). Please refer to the FAQ if you are having troubles with this feature.

LENGTH: The miRspring document is primarily designed for miRNA sequences. The biggest challenge in aligning these short sequences is the ability to uniquely map a miRNA. Sequences that align to multiple locations are problematic in identifying the transcriptional genomic point of origin. Sequences that map/align more than one time are referred to as a multimappers and consume more computer resources. Supplementary Figure 1 of Cloonan et al paper highlights the mulitmapping issue of short sequences. Many popular aligners provide options to limit the computing resources devoted to multimappers. For example you can set parameters to not consider sequences that map more than 3 times (and these would then be considered unmapped).

Given that miRNA sequences are short it is worth considering a number of entries will result in multimapping. While the miRspring document can display sequences of any length up to 35nt you should be aware of the problematic nature of correctly mapping short sequences. We recommended not to mapped sequence tags shorter than 18nt. Remember the average length of a mature miRNA is 22nt.

Multimappers: A number of entries in miRbase are known to occur at multiple loci. It is wise to be familiar with the miRbase naming convention as this will indicate which miRNAs are likely to multimap. As a rough guide and miRNA with three "-" in the precursor name is a candidate eg hsa-miR-133a-1.

Other interesting examples also exist:
hsa-mir-625

Publically available small RNA BAM files


You do not need your own data sets to use the miRspring pipeline as there are a large resource of BAM files provided by the ENCODE project. Please note that while you can access very recent data sets there a strict time limit imposed on using recent data sets for publications (9 months). Also please be aware of the ENCODE data policy.

COLOR SPACE


Below are some commands to build and then align color space data:
bowtie-build -C reference.fa reference_index

where:   -C              indicates to make the reference in color space



bowtie -C -f --integer-quals -l 21 -n 1 --nomaqround --chunkmbs 256 --best --col-cqual --col-keepends 
--sam --quals input.qual referemce_index input.csfasta output.sam
					
where:  -C               indicates color space data
        -f 				 indicates files in fasta format
		-l               seed length (same as --seedlen)
		-n               Number of mismatches in seed (same as --seedmms) 
		--best           will list the best alignment and not the first valid alignment
		--nomaqround     No rounding of the Phred score to the nearest 10.
        --interger-quals indicate quality values in integer format
		--nomaqround     no rounding of the quality values
		--chunkmbs 		 is the amount of memory set aside to help in assigning best location.
		--col-cqual      quality values are in colorspace
		--col-keepends   do not trim 1nt from 5' and 3' ends of sequence
		--sam            produce sam output
		--quals          the input quality file


© Victor Chang Cardiac Research Institute 2012