Mapping Times and Accuracies


random10k test set - randomly generated "genome" of just 10k.  Reads are one 25-mer at each 
position.  Percentage numbers in table are the proportion of reads that were mapped by the various 
programs using the given settings.  All programs completed this short test quickly. This test
highlights splat's capability of handling single base insertions and deletions as well as the up to 
two substitutions per read that most of the programs handle.

sub  ins  del   splat    splat    soap     soap     soap     maq      eland    eland  bowtie
per  per  per   default  maxGap=0 default  -g 1     -g 1     default  default --multi default
read read read                                      -s 6
----------------------------------------------------------------------------------------------
0    0    0     100.00%  100.00%  100.00%  100.00%  100.00%  99.86%  100.00% 100.00%  100.00%
1    0    0     100.00%  100.00%  100.00%  100.00%  100.00%  99.86%  100.00% 100.00%  100.00%
2    0    0     100.00%  100.00%   96.37%   96.37%   97.70%  99.88%  100.00% 100.00%  100.00%
0    1    0     100.00%   27.08%   28.04%   91.10%   90.97%  22.69%   22.69%  22.69%   22.69%
1    1    0     100.00%   19.85%   17.45%   17.52%   17.47%  15.61%   15.61%  15.61%   15.61%
0    0    1     100.00%   31.90%   22.69%   82.20%   82.92%  27.94%   28.04%  28.04%   28.04%
1    0    1     100.00%   21.73%   15.61%   19.60%   19.60%  17.37%   17.45%  17.45%   17.45%

hardSim22 test set - chromosome 22 with simulated errors mapped back to chromosome 22.
This was generated with the maq simulate command with a relatively high error rate of about 
2 errors per read on average. 10% of errors are indels.  Splat maps significantly more reads
than the others.  The run-times are minutes:seconds.

               splat          soap          soap           maq           eland          eland          bowtie
	                     default         -g 1          default        default      --multi=100      default
reads      time mapped    time mapped    time mapped    time mapped    time mapped    time mapped    time mapped
----------------------------------------------------------------------------------------------------------------
1          0:11 100.00%   0:09 100.00%   0:08 100.00%   0:55 100.00%   0:22 100.00%   0:25 100.00%   0:02 100.00%
10         0:11  90.00%   0:08  80.00%   0:09  90.00%   0:55  80.00%   0:22  70.00%   0:24  80.00%   0:02  80.00%
100        0:13  79.00%   0:08  71.00%   0:08  72.00%   0:54  74.00%   0:21  40.00%   0:24  70.00%   0:02  71.00%
1000       0:11  78.20%   0:08  71.50%   0:10  72.30%   1:13  73.20%   0:22  46.30%   0:23  70.60%   0:02  71.80%
10000      0:16  75.84%   0:13  69.68%   0:14  70.99%   1:32  72.29%   0:23  43.67%   0:23  68.96%   0:03  70.13%
100000     0:33  76.34%   0:43  70.07%   0:58  71.66%   2:02  72.49%   0:31  44.21%   0:30  68.65%   0:11  70.54%
1000000    3:15  76.33%   5:59  70.07%   8:49  71.69%   6:36  72.53%   1:46  44.14%   1:46  61.83%   1:24  70.54%

tenMegVs22 test set - 10,000,000 chr22 reads with simulated errors averaging one per read, 
10% of errors are indels.
            
               splat          soap          soap           maq           eland          eland          bowtie
	      default       default         -g 1          default        default      --multi=100      default
reads      time mapped    time mapped    time mapped    time mapped    time mapped    time mapped    time mapped
----------------------------------------------------------------------------------------------------------------
10000000  39:58  95.49%  43:16  91.58%  49:52  93.27%  62:09  92.78%  13:11  75.28%  13:49  90.24%   8:47 91.91%

14vs22 test set - part of chromosome 14 that is a duplicon of chr22 repeat masked, and
non-masked parts shredded into non-overlapping 25-mers and mapped to chr22 (10252 reads)

               splat          soap          soap           maq           eland          eland          bowtie
	      default       default         -g 1          default        default      --multi=100      default
reads      time mapped    time mapped    time mapped    time mapped    time mapped    time mapped    time mapped
---------------------------------------------------------------------------------------------------------------
10252      0:14  96.20%   0:09  95.77%   0:09  96.03%   1:23  95.79%   0:18  90.17%   0:21  95.71    0:02 95.80%

bigChrom test set - 1 million simulated reads (same as in first test set) with simulated
error averaging 2 per read, 10% of errors are indels, against bigger and bigger sets of chromosomes.

                    splat          soap          soap           maq           eland          eland          bowtie
 	           default       default         -g 1          default        default      --multi=100      default
size    chroms   time mapped    time mapped    time mapped    time mapped    time mapped    time mapped   time mapped
--------------------------------------------------------------------------------------------------------------------
  35M       22   3:15 76.33%    5:59  70.07%    8:49 71.69%   6:36  72.53%   1:46  44.14%   1:46  61.83%   1:24  70.54%
  69M    21,22   4:20 76.70%   11:38  70.04%   14:48 72.07%  12:27  72.94%   2:31  43.75%   2:20  69.60%   1:47  70.74%
 142M 18,21,22   6.38 77.20%   20:10  71.01%   26:20 72.57%  21.42  73.50%   4:56  43.17%   4:33  69.72%   2:13  70.60%
 273M     2,22  10:24 77.80%   34:19  71.64%   53.16 73.15%  34.05  74.14%   6:35  42.17%   7:21  69.95%   2:20  70.09%
3080M      all      n/a            n/a           n/a            n/a              n/a            n/a        4:26  69.77%

Database and Index Times and Sizes

Some of the programs need either a sequence database, or a sequence index to be built. This table looks at the time and build sizes for these for chromosome 22, some data sets that are roughly 2x, 4x, and 8x as large as chromosome 22, and for the entire human genome. Splat spends significantly longer on this index/database building phase than the other programs (nothing comes for free) but still, this only needs to be done once per genome.

DNA      splat          soap          maq         eland         bowtie
size   time  size     time  size   time  size   time  size   time  size
------------------------------------------------------------------------
  35M  0:59    953M      0   0     0:02   25M   0:01   24M   1:45   37M   
  69M  1:39   1820M      0   0     0:04   48M   0:02   47M   3:36   66M
 143M  3:30   3688M      0   0     0:07   86M   0:03   83M   8:21  127M
 272M  7:16   6901M      0   0     0:12  146M   0:05  140M  18:30  233M 
2858M 80:00* 72000M*     0   0     2:05 1554M   0:40 1470M

* Estimated - need machine with more RAM than we have (about the same amount of ram as the file size) 
  to make or use the index. Currently this restricts splat to be run one chromosome at a time on most
  machines.