biobambam: tools for read pair collation based algorithms on BAM files

Table 6 Run-time comparison of bammarkduplicates2 and alternatives on compute farm nodes (part b)

Run-time comparison for BAM duplicate marking on server blades
Data set	Program	Memory/GB	Run-time/minutes
ERR328876	biobambam	0.45	212.38±2.22
	Picard	15.74	443.66±1.77
	Picard_3,16	15.74	253.92±1.67
	Picard_3,64	52.41	252.21±1.11
	bamUtil	1.20	207.29±2.33
ERR054938	biobambam	0.45	210.16±2.62
	Picard	15.87	575.35±3.07
	Picard_3,16	15.87	287.02±2.19
	Picard_3,64	54.87	285.06±1.27
	bamUtil	7.12	401.90±1.92
ERR328190	biobambam	0.45	289.00±2.34
	Picard		≥1440
	bamUtil	16.73	914.81±6.16
SRP017681	biobambam₂₀	0.45	388.82±2.57
	biobambam₂₄	6.31	332.38±3.32
	Picard	15.90	363.18±2.63
	Picard_3,16	15.90	288.95±1.32
	Picard_3,64	63.39	290.72±1.62
	bamUtil		≥1400
ERP001231	biobambam	0.45	729.98±3.36
	biobambam₈	0.45	674.99±11.93
	Picard		≥1440
	bamUtil	≥22.35
	bamUtil₈	23.85	916.62±4.71

Run-time comparison of biobambam’s bammarkduplicates2, Picard’s MarkDuplicates and bamUtil’s dedup for the data sets ERR328876, ERR054938, ERR328190, SRP017681 and ERP001231 described in Table2 on compute farm nodes. For the data set SRP017681 bammarkduplicates2 was run with a default hash table size of 2²⁰ and an increased size of 2²⁴ for comparison. For ERP001231 bamUtil was only capable of processing the file using 23.85 GB ≈ 25.6 · 10⁹ ≥ 24 · 10⁹ bytes of memory. In consequence we needed to reduce the number of concurrently running processes. We have reduced it to 8 instead of 10. For comparison the table also contains the run-time of bammarkduplicates2 for 8 instances running in parallel. Picard failed to process the data sets ERR328190 and ERP001231 within the 24 hour limit due to inefficient I/O. We have verified that these issues persist for larger amounts of memory. Picard used close to the offered 16 GB of memory for the data sets ERR328876, ERR054938 and SRP017681. We have verified that no significant improvement in speed was available through the usage of more memory. For this purpose we have run Picard on these data sets with 16 and 64 GB of memory with a reduced concurrency of 3 parallel running identical processes.

ISSN: 1751-0473