sirFAST is a cache oblivious mapper that is designed to map short reads (Complete Genomics Reads) to reference genome. sirFAST maps short reads with respect to user defined error threshold. In this manual, we will show how to choose the parameters and tune sirFAST with respect to the library settings. sirFAST is designed to find 'all' the mappings for a given set of reads.


General

To work with sirFAST, please download the latest version from our download page and then unzip the downloaded file. Run 'make' to build sirFAST. Mapping: sirFAST 1) needs to generate the index of the reference genome(s) and 2) maps the reads to reference genome.

Parallelism: The best way to optimize sirFAST is to split the reads into chunks that fit into the memory of the cluster nodes. The number of reads is approximately ((M-600)/(4*L)) mil where M is the size of the memory for the cluster node(MB) and L is the read length. If you have more nodes, you can make the chunks smaller to use the nodes efficiently. For example, if the library length is 50bp and the memory of nodes is 2GIG, chunks should (2000-600)/(4*50)= 7mil reads.

To see the list of options, use "-h" or "--help".
To see the current version of sirFAST, user "-v" or "--version".

Indexing

sirFAST's indices can be generated in two modes (single, batch). In single mode, sirFAST indexes a fasta file (which may contain one or more reference genomes). By default sirFAST uses the window size of 10 characters to generate its index. Please be advised that if you do not choose the window size carefully, you will lose sensitivity.

Single Mode:
To index a reference genome like "refgen.fasta" run the following command:

$./sirFAST --index refgen.fasta


Upon the completion of the indexing phase, you can find "refgen.fasta.index" in the same directory as "refgen.fasta".

The indexing done in sirFAST depends on the read format you are using. We use 10 and 9 for read format 5-10-10-10 and read format 10-9-N-10 respectively.

Mapping

sirFAST can map single-end reads and paired-end reads to a reference genome. sirFAST can map in either single or batch mode. In single mode, it only maps to one index. In batch mode, it maps to a list of indices. mrsFAST supports both fasta and fastq formats.

Single-end Reads - Single Mode
To map single reads to a reference genome in single mode, run the following command. Use "--seq" to specify the input file. refgen.fa and refgen.fa.index should be in the same folder.

$./sirsfast --search refgen.fa --seq reads.tsv



The reported locations will be saved into "output" by default. If you want to save it somewhere else, use "-o" to specify another file. sirFAST can report the unmapped reads in fasta/fastq format.

$./sirFAST --search refgen.fasta --seq reads.tsv -o my.map 


The number of the mismatches allowed by sirFAST is 2 by default. You can modify this number by using "-e". There is no best mapping option in sirFAST.

$./sirFAST --search refgen.fasta --seq reads.tsv -e 3


Paired-end Reads

To map paired-end reads, use "--pe" option. The mapping can be done in single/batch mode. If the reads are in two different files, you have to use "--seq1/--seq2" to indicate the files. If the reads are interleaved, use "--seq" to indicated the file. The distance allowed between the paired-end reads should be specified with "--min" and "--max". "--min" and "--max" specify the minmum and maximum of the inferred size (the distance between outer edges of the mapping mates).

$./sirFAST --search refgen.fasta --pe --seq reads.tsv --min 150 --max 250 


$./sirFAST -b --search index.list --pe --seq1 reads1.tsv --seq2 reads2.tsv --min 50 --max 75

Discordant Mapping

sirFAST can report the discordant mapping for use of Variation Hunter (Work In Progress.). The --min and --max optiopns will define the minimum and maximum inferred size for concordant mapping.

$./sirFAST --search refgen.fasta --pe --discordant-vh --seq reads.tsv --min 50 --max 75


Output Format
sirFAST output format is in SAM format. For detail about the definition of the fields please refer to SAM Manual

Citation

Please cite the following accompanying article if you would like to use or build upon the tools:

Donghyuk Lee, Farhad Hormozdiari, Hongyi Xin, Faraz Hach, Onur Mutlu, and Can Alkan,
"Fast and Accurate Mapping of Complete Genomics Reads" Methods, October 22, 2014.
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki