Mapping Times and Accuracies
random10k test set - randomly generated "genome" of just 10k. Reads are one 25-mer at each
position. Percentage numbers in table are the proportion of reads that were mapped by the various
programs using the given settings. All programs completed this short test quickly. This test highlights splat's capability of handling single base insertions and deletions as well as the up to
two substitutions per read that most of the programs handle.
sub ins del splat splat soap soap soap maq eland eland bowtie Mosaik
per per per default maxGap=0 default -g 1 -g 1 default default --multi default -mm 2
read read read -s 6 -hs 12
-----------------------------------------------------------------------------------------------------
0 0 0 100.00% 100.00% 100.00% 100.00% 100.00% 99.86% 100.00% 100.00% 100.00% 100.00%
1 0 0 100.00% 100.00% 100.00% 100.00% 100.00% 99.86% 100.00% 100.00% 100.00% 100.00%
2 0 0 100.00% 100.00% 96.37% 96.37% 97.70% 99.88% 100.00% 100.00% 100.00% 65.80%
0 1 0 100.00% 27.08% 28.04% 91.10% 90.97% 22.69% 22.69% 22.69% 22.69% 100.00%
1 1 0 100.00% 19.85% 17.45% 17.52% 17.47% 15.61% 15.61% 15.61% 15.61% 57.96%
0 0 1 100.00% 31.90% 22.69% 82.20% 82.92% 27.94% 28.04% 28.04% 28.04% 100.00%
1 0 1 100.00% 21.73% 15.61% 19.60% 19.60% 17.37% 17.45% 17.45% 17.45% 81.45%
hardSim22 test set - chromosome 22 with simulated errors mapped back to chromosome 22.
This was generated with the maq simulate command with a relatively high error rate of about
2 errors per read on average. 10% of errors are indels. Splat maps significantly more reads
than the others. The run-times are minutes:seconds.
splat soap soap maq eland eland bowtie Mosaik bfast
default default -g 1 default default --multi=100 default -mm 2 -hs 12 3 14/24 index
reads time mapped time mapped time mapped time mapped time mapped time mapped time mapped time mapped -x 2
-----------------------------------------------------------------------------------------------------------------------------------------------
1 0:10 100.00% 0:09 100.00% 0:08 100.00% 0:55 100.00% 0:22 100.00% 0:25 100.00% 0:02 100.00% 0:19 100.00%
10 0:11 90.00% 0:08 80.00% 0:09 90.00% 0:55 80.00% 0:22 70.00% 0:24 80.00% 0:02 80.00% 0:28 20.00%
100 0:10 84.00% 0:08 71.00% 0:08 72.00% 0:54 74.00% 0:21 40.00% 0:24 70.00% 0:02 71.00% 0:19 34.00%
1000 0:11 82.90% 0:08 71.50% 0:10 72.30% 1:13 73.20% 0:22 46.30% 0:23 70.60% 0:02 71.80% 0:24 29.40%
10000 0:15 81.00% 0:13 69.68% 0:14 70.99% 1:32 72.29% 0:23 43.67% 0:23 68.96% 0:03 70.13% 0:18 29.67%
100000 0:56 81.24% 0:43 70.07% 0:58 71.66% 2:02 72.49% 0:31 44.21% 0:30 68.65% 0:11 70.54% 0:22 30.36%
1000000 8:23 81.29% 5:59 70.07% 8:49 71.69% 6:36 72.53% 1:46 44.14% 1:46 61.83% 1:24 70.54% 0:53 30.40%
tenMegVs22 test set - 10,000,000 chr22 reads with simulated errors averaging one per read,
10% of errors are indels.
splat soap soap maq eland eland bowtie Mosaik
default default -g 1 default default --multi=100 default -mm 2 -hs 12
reads time mapped time mapped time mapped time mapped time mapped time mapped time mapped time mapped
------------------------------------------------------------------------------------------------------------------------------
10000000 114:28 96.72% 43:16 91.58% 49:52 93.27% 62:09 92.78% 13:11 75.28% 13:49 90.24% 8:47 91.91% 6:11 32.39%
14vs22 test set - part of chromosome 14 that is a duplicon of chr22 repeat masked, and
non-masked parts shredded into non-overlapping 25-mers and mapped to chr22 (10252 reads)
splat soap soap maq eland eland bowtie
default default -g 1 default default --multi=100 default
reads time mapped time mapped time mapped time mapped time mapped time mapped time mapped
---------------------------------------------------------------------------------------------------------------
10252 0:11 96.33% 0:09 95.77% 0:09 96.03% 1:23 95.79% 0:18 90.17% 0:21 95.71 0:02 95.80%
bigChrom test set - 1 million simulated reads (same as in first test set) with simulated
error averaging 2 per read, 10% of errors are indels, against bigger and bigger sets of chromosomes.
splat soap soap maq eland eland bowtie
default default -g 1 default default --multi=100 default
size chroms time mapped time mapped time mapped time mapped time mapped time mapped time mapped
--------------------------------------------------------------------------------------------------------------------
35M 22 8:23 81.29% 5:59 70.07% 8:49 71.69% 6:36 72.53% 1:46 44.14% 1:46 61.83% 1:24 70.54%
69M 21,22 14:20 81.80% 11:38 70.04% 14:48 72.07% 12:27 72.94% 2:31 43.75% 2:20 69.60% 1:47 70.74%
142M 18,21,22 26:43 82.54% 20:10 71.01% 26:20 72.57% 21.42 73.50% 4:56 43.17% 4:33 69.72% 2:13 70.60%
273M 2,22 ??:?? 83.42% 34:19 71.64% 53.16 73.15% 34.05 74.14% 6:35 42.17% 7:21 69.95% 2:20 70.09%
3080M all n/a n/a n/a n/a n/a n/a 4:26 69.77%
splat soap soap maq eland eland bowtie
default default -g 1 default default --multi=100 default
size chroms time mapped time mapped time mapped time mapped time mapped time mapped time mapped
--------------------------------------------------------------------------------------------------------------------
35M 22 8:48 81.29% 5:59 70.07% 8:49 71.69% 6:36 72.53% 1:46 44.14% 1:46 61.83% 1:24 70.54%
69M 21,22 13:39 81.80% 11:38 70.04% 14:48 72.07% 12:27 72.94% 2:31 43.75% 2:20 69.60% 1:47 70.74%
142M 18,21,22 23:48 82.54% 20:10 71.01% 26:20 72.57% 21.42 73.50% 4:56 43.17% 4:33 69.72% 2:13 70.60%
273M 2,22 48:11 83.42% 34:19 71.64% 53.16 73.15% 34.05 74.14% 6:35 42.17% 7:21 69.95% 2:20 70.09%
3080M all n/a n/a n/a n/a n/a n/a 4:26 69.77%
Database and Index Times and Sizes
Some of the programs need either a sequence database, or a sequence index to be built.
This table looks at the time and build sizes for these for chromosome 22, some data sets
that are roughly 2x, 4x, and 8x as large as chromosome 22, and for the entire human genome.
Splat spends significantly longer on this index/database building phase than the other
programs (nothing comes for free) but still, this only needs to be done once per genome.
DNA splix i16 sufx itsa soap maq eland bowtie
size time size time size time size time size time size time size time size time size
------------------------------------------------------------------------------------------------------------------
35M 0:59 953M 3:27 4484M 0:59 328M 1:03 664M 0 0 0:02 25M 0:01 24M 1:45 37M
69M 1:39 1820M 4:36 4667M 2:02 649M 1:47 984M 0 0 0:04 48M 0:02 47M 3:36 66M
143M 3:30 3688M 7:10 5042M 3:24 1322M 4:42 1658M 0 0 0:07 86M 0:03 83M 8:21 127M
272M 7:16 6901M 9:34 5678M 6:04 2473M 7:04 2809M 0 0 0:12 146M 0:05 140M 18:30 233M
2858M 80:00* 72000M* ?:?? ?????M 183:00 25945M 202:05 26280M 0 0 2:05 1554M 0:40 1470M
* Estimated - need machine with more RAM than we have (about the same amount of ram as the file size)
to make or use the index. Currently this restricts splat to be run one chromosome at a time on most
machines.
Notes on Traversable Suffix Arrays
I'm experimenting around with a traversable suffix array (sufx) as an alternative to the "splix" index
that splat currently users. The advantage is that the sufx index is less than half the size, taking up
9x as many byte as there are bases in the genove vs. 24x as many bytes for the splix index. If it can
be made to be competitive in terms of speed therefore it would be very helpful, allowing us to do genome-wide
runs on machines with 32 gig of RAM, that these days are relatively cheap (we have a cluster of 8 of them already).
10000 reads vs. chromosome 22 just taking the alignment time in milliseconds, not the index load time.
This seems be enough reads to get accurate timing information for the local suffix array vs. splat timings.
Note at the moment splat is searching both strands while the other programs are just searching one strand.
Since both strands really do need to be searched, the times on the non-splat programs are doubled in the
table below. Tmes are a median of three runs in milliseconds.
chr22 18,21,22 2,22 hg18
program mismatch time time time time
------------------------------------------------
i16Find 0 70 106 318
itsaFind 0 68 79 102 573
sufxFind 0 115 140 171 632*
sufxFind 1 1037 1588 1883 6336*
sufxFind 2 8480 12892 16846 46545*
splat 0 397 998 2108 21080**
splat 1 600 1164 1956 19560**
splat 2 2878 6270 11245 112450**
splat 0+gap 452 1132
splat 1+gap 628
splat 2+gap 3809
* Based just on a single run, not median of 3.
**Estimates - based on having to run it in 10 parts.
The exact matches are a little too fast to measure accurately, so going to 100,000 reads.
chr22 21,22 18,21,22 2,22 hg18
program time time time time time
------------------------------------------------
i16Find 886
itsaFind 689
sufxFind 933
splat 3878 5689