biobambam: tools for read pair collation based algorithms on BAM files

Background Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs. Results In this paper we introduce biobambam, a set of tools based on the efficient collation of alignments in BAM files by read name. The employed collation algorithm avoids time and space consuming sorting of alignments by read name where this is possible without using more than a specified amount of main memory. Using this algorithm tasks like duplicate marking in BAM files and conversion of BAM files to the FastQ format can be performed very efficiently with limited resources. We also make the collation algorithm available in the form of an API for other projects. This API is part of the libmaus package. Conclusions In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities our approach can often perform an equivalent task more efficiently in terms of the required main memory and run-time. Our BAM to FastQ conversion is faster than all widely known alternatives including Picard and bamUtil. Our duplicate marking is about as fast as the closest competitor bamUtil for small data sets and faster than all known alternatives on large and complex data sets.

This will place the new tools in the /usr/bin directory. The LaunchPad version comes with support for converting CRAM files to FastQ via the Staden package's io lib.
libmaus can be compiled on MacOS X in way very similar to the one shown for Linux. This may require installation of the boost libraries (see [1]).

Compiling programs using libmaus
The compiler and linker flags necessary for using libmaus can be obtained using the pkg-config tool. If libmaus is not installed in a system directory via LaunchPad, then pkg-config needs to be informed of it's location via the PKG CONFIG PATH environment variable. An example for the bash shell is export PKG_CONFIG_PATH=$HOME/libmaus/lib/pkgconfig:$PKG_CONFIG_PATH The compilation flags can then be obtained using The first two lines include header files from libmaus. The next five lines simplify notation in the following.
The collating input class can then be instantiated using collator_type C(cin,"tmpfile"); to read from the standard input channel cin. The second argument specifies the name of the file used to write alignments out to disk when the list L described in the main text overflows. The temporary file can be removed after all alignments have been extracted from the input stream. For the sake of convenience this can also be done automatically using TempFileRemovalContainer::addTempFile("tmpfile"); After the instantiation of the collator object pairs can be extracted using pair<alignment_ptr_type,alignment_ptr_type> P; while ( C.tryPair(P) ) if ( P.first && P.second ) { /* process pair */ cout << "Found pair with name " << P.first->getName() << endl; } The function tryPair of the collator class tries to extract pairs from the input BAM file. It returns true if any data could be extracted. The pair P will contain two pointers to alignments if this extraction was successful. In case there are single or orphan reads in the input one of the two pointers may be a null pointer (an orphan read is a read end such that the other end is missing from the file). A list with accessor functions for alignments with their respective return types is shown in Table 1. Header information like the length and name of reference sequences can be obtained by calling methods of the header object in the collation class.

BamHeader const & header = C.getHeader();
Some methods and return types of the BamHeader class can be found in Table 2.

A sample program for converting BAM to FastQ
The following code is a complete program for converting an input BAM file to FastQ while keeping only complete pairs.

Running bamtofastq
The bamtofastq tool reads it's input file from standard input by default. An input file name may be given which can be obtained using the faidx command of SAMtools. The name of the temporary file used can be given using the T key. By default this file is generated in the current working directory. The size of this file can be significant for high depth input files. If the gz key is set to 1, then all output streams will be compressed in gzip format. A comprehensive list of options can be obtained by calling

Running bammarkduplicates2
The tool bammarkduplicates2 can be run using bammarkduplicates2 I=input.bam O=output.bam M=output.metrics where input.bam is the input file, the output is written in BAM format to the file output.bam and a file containing some statistics about the number of duplicates detected is written to the file output.metrics.
bammarkduplicates2 can read an input file from standard input, but as it needs to perform more than one scan over the input it will effectively create a copy of the input file in a temporary file in this case. The prefix for the temporary files used can be set using the tmpfile key. Setting rmdup=1 causes bammarkduplicates2 to remove duplicate alignments when writing the output file. A non default compression level can be set using the level key. This may be helpful if the output is not stored but passed to another program. Calling