Mapping Times and Accuracies


random10k test set - randomly generated "genome" of just 10k.  Reads are one 25-mer at 
each position.  Percentage numbers in table are the proportion of reads that were mapped by 
the various programs using the given settings.  All programs completed this short test quickly. 
This test highlights splat's capability of handling single base insertions and deletions as well 
as the up to two substitutions per read that most of the programs handle.

sub  ins  del   splat    splat    soap     soap     soap     maq      eland    eland  bowtie   Mosaik    bfast
per  per  per   default  maxGap=0 default  -g 1     -g 1     default  default --multi default   -mm 2    4 14/24 index
read read read                                      -s 6                                        -hs 12   -x 2
-------------------------------------------------------------------------------------------------------------------
0    0    0     100.00%  100.00%  100.00%  100.00%  100.00%  99.86%  100.00% 100.00%  100.00%  100.00%   100.00%
1    0    0     100.00%  100.00%  100.00%  100.00%  100.00%  99.86%  100.00% 100.00%  100.00%  100.00%	 100.00%
2    0    0     100.00%  100.00%   96.37%   96.37%   97.70%  99.88%  100.00% 100.00%  100.00%   65.80%    98.99%
0    1    0     100.00%   27.08%   28.04%   91.10%   90.97%  22.69%   22.69%  22.69%   22.69%  100.00%    90.79%
1    1    0     100.00%   19.85%   17.45%   17.52%   17.47%  15.61%   15.61%  15.61%   15.61%	57.96%    70.77%
0    0    1     100.00%   31.90%   22.69%   82.20%   82.92%  27.94%   28.04%  28.04%   28.04%  100.00%    94.60%
1    0    1     100.00%   21.73%   15.61%   19.60%   19.60%  17.37%   17.45%  17.45%   17.45%   81.45%    81.62%

hardSim22 test set - chromosome 22 with simulated errors mapped back to chromosome 22.
This was generated with the maq simulate command with a relatively high error rate of about 
2 errors per read on average. 10% of errors are indels.  Splat maps significantly more reads
than the others.  The run-times are minutes:seconds. The (differ) figures if present show the
proportion of reads that map someplace _other_ than where the original unmutated version in 
the simulation came from.  

               splat          soap          soap           maq           eland          eland          bowtie       Mosaik          bfast 
	     default         default         -g 1          default        default     --multi=100      default     -mm 2 -hs 12   4 14/24 index
reads      time mapped    time mapped    time mapped    time mapped    time mapped    time mapped    time mapped   time mapped     time mapped 
               (differ)                      (differ)
-----------------------------------------------------------------------------------------------------------------------------------------------
1          0:10 100.00%   0:09 100.00%   0:08 100.00%   0:55 100.00%   0:22 100.00%   0:25 100.00%   0:02 100.00%   0:19 100.00%   2:06 100.00% 
10         0:11  90.00%   0:08  80.00%   0:09  90.00%   0:55  80.00%   0:22  70.00%   0:24  80.00%   0:02  80.00%   0:28  20.00%   2:05  90.00%
100        0:10  84.00%   0:08  71.00%   0:08  72.00%   0:54  74.00%   0:21  40.00%   0:24  70.00%   0:02  71.00%   0:19  34.00%  10:20  91.00%
1000       0:11  82.90%   0:08  71.50%   0:10  72.30%   1:13  73.20%   0:22  46.30%   0:23  70.60%   0:02  71.80%   0:24  29.40%  program crash
10000      0:15  81.00%   0:13  69.68%   0:14  70.99%   1:32  72.29%   0:23  43.67%   0:23  68.96%   0:03  70.13%   0:18  29.67%
100000     0:56  81.24%   0:43  70.07%   0:58  71.66%   2:02  72.49%   0:31  44.21%   0:30  68.65%   0:11  70.54%   0:22  30.36%
1000000    8:23  81.29%   5:59  70.07%   8:49  71.69%   6:36  72.53%   1:46  44.14%   1:46  61.83%   1:24  70.54%   0:53  30.40%


With the simulation we know where the reads came from before they were mutated.  The following 
table shows the proportion of reads that are mapped, but not to the place they came from.  This 
can be due to several things.  For splat the predominant factor seems to be that after mutation 
there is another spot the actually matches more bases.  For maq the main reason is that it only 
picks one place when a read maps multiple times, and this one place is only coincedentally the 
place the read came from.  For bowtie the story is similar to maq, but there may be, according 
to the documentation on bowtie, cases where the second-best mapping is chosen over the best.  For 
soap I suspect that it is also predominantly a case of it only taking one of many possible 
mappings in the multiple-mapping case.  It looks like the -w option might change this.


               splat          soap          soap           maq           eland          eland          bowtie       Mosaik          bfast 
reads         default         default         -g 1          default        default     --multi=100      default     -mm 2 -hs 12   4 14/24 index
-----------------------------------------------------------------------------------------------------------------------------------------------
1                (0.00%)                       (0.00%)        (0.00%)        (0.00%)        (0.00%)        (0.00%)                       (0.00%)
10              (10.00%)                      (10.00%)       (10.00%)        (0.00%)        (0.00%)        (0.00%)                       (0.00%)
100             (11.00%)                      (30.00%)       (27.00%)        (2.00%)       (13.00%)       (29.00%)                      (12.00%)
1000             (9.40%)                      (22.90%)       (23.20%)        (3.00%)        (7.50%)       (24.30%)
10000            (7.71%)                      (23.22%)       (24.17%)        (2.53%)        (6.64%)       (23.67%)
100000           (7.45%)       (22.80%)                      (23.54%)        (2.58%)        (6.14%)       (23.35%)
1000000          (7.41%)       (22.87%)                      (24.56%)        (2.55%)        (6.05%)       (23.30%)


tenMegVs22 test set - 10,000,000 chr22 reads with simulated errors averaging one per read, 
10% of errors are indels.
            
               splat          soap          soap           maq           eland          eland          bowtie      Mosaik
	      default       default         -g 1          default        default      --multi=100      default    -mm 2 -hs 12
reads      time mapped    time mapped    time mapped    time mapped    time mapped    time mapped    time mapped  time mapped
------------------------------------------------------------------------------------------------------------------------------
10000000 114:28  96.72%  43:16  91.58%  49:52  93.27%  62:09  92.78%  13:11  75.28%  13:49  90.24%   8:47 91.91%  6:11  32.39%


14vs22 test set - part of chromosome 14 that is a duplicon of chr22 repeat masked, and
non-masked parts shredded into non-overlapping 25-mers and mapped to chr22 (10252 reads)

               splat          soap          soap           maq           eland          eland          bowtie
	      default       default         -g 1          default        default      --multi=100      default
reads      time mapped    time mapped    time mapped    time mapped    time mapped    time mapped    time mapped
---------------------------------------------------------------------------------------------------------------
10252      0:11  96.33%   0:09  95.77%   0:09  96.03%   1:23  95.79%   0:18  90.17%   0:21  95.71    0:02 95.80%

bigChrom test set - 1 million simulated reads (same as in first test set) with simulated
error averaging 2 per read, 10% of errors are indels, against bigger and bigger sets of chromosomes.

                    splat          soap          soap           maq           eland          eland          bowtie
 	           default       default         -g 1          default        default      --multi=100      default
size    chroms   time mapped    time mapped    time mapped    time mapped    time mapped    time mapped   time mapped
--------------------------------------------------------------------------------------------------------------------
  35M       22   7:13 81.29%    5:59  70.07%    8:49 71.69%   6:36  72.53%   1:46  44.14%   1:46  61.83%   1:24  70.54%
  69M    21,22   9:23 81.80%   11:38  70.04%   14:48 72.07%  12:27  72.94%   2:31  43.75%   2:20  69.60%   1:47  70.74%
 142M 18,21,22  17:25 82.54%   20:10  71.01%   26:20 72.57%  21.42  73.50%   4:56  43.17%   4:33  69.72%   2:13  70.60%
 273M     2,22  31:54 83.42%   34:19  71.64%   53.16 73.15%  34.05  74.14%   6:35  42.17%   7:21  69.95%   2:20  70.09%
3080M      all      n/a            n/a           n/a            n/a              n/a            n/a        4:26  69.77%

Database and Index Times and Sizes

Some of the programs need either a sequence database, or a sequence index to be built. This table looks at the time and build sizes for these for chromosome 22, some data sets that are roughly 2x, 4x, and 8x as large as chromosome 22, and for the entire human genome. Splat spends significantly longer on this index/database building phase than the other programs (nothing comes for free) but still, this only needs to be done once per genome.

DNA      splat             i16           sufx         itsa         soap          maq         eland        bowtie      bfast
size   time  size      time   size    time  size   time  size   time  size   time  size   time  size   time  size   time  size
------------------------------------------------------------------------------------------------------------------------------
  35M  0:59    953M    3:27  4484M   0:59   328M    1:03   664M     0   0     0:02   25M   0:01   24M   1:45   37M 118:50  9312M 
  69M  1:39   1820M    4:36  4667M   2:02   649M    1:47   984M     0   0     0:04   48M   0:02   47M   3:36   66M 247:70 10019M 
 143M  3:30   3688M    7:10  5042M   3:24  1322M    4:42  1658M     0   0     0:07   86M   0:03   83M   8:21  127M
 272M  7:16   6901M    9:34  5678M   6:04  2473M    7:04  2809M     0   0     0:12  146M   0:05  140M  18:30  233M 
2858M 80:00* 72000M*   ?:?? ?????M 183:00 25945M  202:05 26280M     0   0     2:05 1554M   0:40 1470M

* Estimated - need machine with more RAM than we have (about the same amount of ram as the file size) 
  to make or use the index. Currently this restricts splat to be run one chromosome at a time on most
  machines.

Notes on Traversable Suffix Arrays

I'm experimenting around with a traversable suffix array (sufx) as an alternative to the "splix" index that splat currently users. The advantage is that the sufx index is less than half the size, taking up 9x as many byte as there are bases in the genove vs. 24x as many bytes for the splix index. If it can be made to be competitive in terms of speed therefore it would be very helpful, allowing us to do genome-wide runs on machines with 32 gig of RAM, that these days are relatively cheap (we have a cluster of 8 of them already).

10000 reads vs. chromosome 22 just taking the alignment time in milliseconds, not the index load time.
This seems be enough reads to get accurate timing information for the local suffix array vs. splat timings.
Note at the moment splat is searching both strands while the other programs are just searching one strand.
Since both strands really do need to be searched, the times on the non-splat programs are doubled in the
table below.  Times are a median of three runs in milliseconds.

                    chr22   18,21,22 2,22   hg18
program mismatch    time    time     time   time
------------------------------------------------
i16Find    0         140     212      636   6360**
itsaFind   0         136     158      204   1146 
itsaFind   1         704    1150     1422   5922#
sufxFind   0         220     288      342   1264*
sufxFind   1        2134    2438     4231   6336*
sufxFind   2       15638   25784    33692  93090*
splat      0         177     321      484  21080**
splat      1         600    1164     1956  19560**
splat      2        2878    6270    11245 112450**
splat	   0+gap     452    1132     2312
splat	   1+gap     628    1477     2558
splat	   2+gap    3809    8558    15247

* Based just on a single run, not median of 3.
**Estimates - based on having to run it in 10 parts.
# Redo after more debugging (output isn't quite right)

The exact matches are a little too fast to measure accurately, so going to 100,000 reads.

          chr22   21,22    18,21,22 2,22     hg18
program   time    time     time     time     time
-------------------------------------------------
i16Find	  1772    2356     3986     7436     n/a
itsaFind  1314    1378     1648
sufxFind  1866    2146     2504
splat     3878    5689    12236