Modular and configurable optimal sequence alignment software: Cola
© Zamani et al.; licensee BioMed Central Ltd. 2014
Received: 18 February 2014
Accepted: 4 June 2014
Published: 9 June 2014
The fundamental challenge in optimally aligning homologous sequences is to define a scoring scheme that best reflects the underlying biological processes. Maximising the overall number of matches in the alignment does not always reflect the patterns by which nucleotides mutate. Efficiently implemented algorithms that can be parameterised to accommodate more complex non-linear scoring schemes are thus desirable.
We present Cola, alignment software that implements different optimal alignment algorithms, also allowing for scoring contiguous matches of nucleotides in a nonlinear manner. The latter places more emphasis on short, highly conserved motifs, and less on the surrounding nucleotides, which can be more diverged. To illustrate the differences, we report results from aligning 14,100 sequences from 3' untranslated regions of human genes to 25 of their mammalian counterparts, where we found that a nonlinear scoring scheme is more consistent than a linear scheme in detecting short, conserved motifs.
Cola is freely available under LPGL from https://github.com/nedaz/cola.
The fundamental question for optimal alignment of genomic sequences is how to define "optimality", with particular regard to biological relevance. Mathematically, optimality is determined by the underlying scoring function, and different linear and limited nonlinear schemes have been proposed [1–7]. To date, the Smith-Waterman gap affine (SWGA) method , a modification of the Smith-Waterman (SW) method  that applies additional penalties to alignment gaps regardless of the gap size itself, has remained the method of choice, and is utilised by common alignment tools . Here, we extend this method (SWGA+) by allowing for any arbitrary nonlinear match scoring function, enabling us to give higher weights to consecutive matching nucleotides, rather than optimising the total number of matches. The software we have developed, Cola (Contiguous optimal local aligner), is a C++ implementation of this algorithm, which, by applying a linear match scoring function, degenerates into the SWGA and SW schemes. Notable features include a constant factor difference between generalised and linear scoring, both in runtime and memory consumption. Also, banded alignments, i.e. limiting the search space to a band around the diagonal, allows for scaling linearly in time with sequence length. We also utilise a variation of an algorithm that significantly improves the space complexity from O(N2) to O(2 N). This is based on the check-pointing method introduced by Powell et al.  as an extension of the divide-and-conquer method introduced by Hirschberg .
Cola provides an Application Programmer’s Interface (API) for ease of integration into larger software packages, and is e.g. included in the recent versions of the whole genome synteny aligner Satsuma, and the universal genome coordinate mapper Kraken (Zamani et al., submitted). Depending on parameterisation, Cola is bit-compatible with the SW and SWGA methods.
The last element in Eq. 1, i.e., 0 is required for finding the local alignment as opposed the global. It translates into starting the alignment from any point that maximises the overall alignment score. Removing this element from the recurrence turns the algorithm into a global aligner rather than a local one.
To conceptualise the effect of the aforementioned recurrences, we can think of extending the SWGA edit-graph structure by appending nodes that correspond to the cost of getting from cell (i,j) to cell (i', j') with n contiguous matches. This translates to adding one node for each value of n at every cell (i,j) in the graph. Conceptually, the underlying edit-graph becomes three-dimensional, with the third dimension containing the nodes that track the path and cost of contiguous matching. For memory efficiency, Cola dynamically allocates the third dimension only in an on-demand fashion (see Figure 1b), and limits the depth to 1 in case a linear function is applied (e.g. to match the SW and SWGA schemes). The depth at each cell (i,j) demonstrates the various lengths of contiguity of matches for which the best path can be obtained from cell (0,0) to cell (i,j). Knowing the length of contiguity of matches at every given node allows for applying a scoring mechanism based on an arbitrary function of the matches. Cola implements a cubic function f (n) = n3 by default, as manually examined examples suggest good balance between a too steeply rising function (e.g. exponential function) and a lower degree polynomial (e.g. quadratic function) in combination with the penalties ρ m = 8, ρ o = 200, and ρ e = 20, which we determined empirically (2 consecutive matches equal 1 mismatch penalty, 3 consecutive matches score higher than a single gap extension penalty, etc.).
Performance and accuracy
While the true accuracy of a pairwise alignment is difficult to assess, we define several measures to highlight the difference as a result of linear or non-linear scoring: (i) the number of positions in which all or most genomes agree with the reference, representing events that could be less likely to occur by random chance; and (ii) the number of elements consisting of n consecutive nucleotides matching all or most genomes. As a test set, we choose 14,100 3' Un-translated Regions (3'UTRs) of human coding genes that are between 100 and 1000 in size, with 4.9 million nucleotides in total sequence, and their orthologous sequences in at least 25 mammalian genomes . 3'UTRs are known to contain short (6 nucleotides), and in some cases highly conserved recognition motifs required for post-translational processing of RNAs, including the poly-adenylation signal AATAAA.
Comparison of SW, SWGA, SW + and SWGA+ with regards to alignment consistency
Number of motifs
Sequence in motifs
Software availability and requirements
Cola is an object-oriented software written in C++, compiled using gcc, and runs on Linux operating systems. The software is modular and provides a collection of utilities for input data conversion. The source code is designed to be easily configurable allows for easy integration as the back-end into e.g. seed-finding methods, such as BLAST (Stephen F ).
The source code is freely available under LPGL. For code usage guide, command line parameters, and output file formats, see website at https://github.com/nedaz/cola.
This work was funded by a start-up grant from the Science for Life Laboratory (MGG), with support from the Bioinformatics Infrastructure for Life Sciences in Sweden (MPH).
- Altschul S, Erickson B: Optimal sequence alignment using affine gap costs. Bull Math Biol. 1986, 48 (5–6): 603-616. doi:10.1016/S0092-8240(86)90010-8View ArticlePubMedGoogle Scholar
- Altschul SF: Generalized affine gap costs for protein sequence alignment. Proteins. 1998, 32 (1): 88-96. 10.1002/(SICI)1097-0134(19980701)32:1<88::AID-PROT10>3.0.CO;2-J.View ArticlePubMedGoogle Scholar
- Gotoh O: An improved algorithm for matching biological sequences. J Mol Biol. 1982, 162 (3): 705-708. 10.1016/0022-2836(82)90398-9.View ArticlePubMedGoogle Scholar
- Miller W, Myers EW: Sequence comparison with concave weighting functions. Bull Math Biol. 1988, 50 (2): 97-120. 10.1007/BF02459948. doi:10.1007/BF02459948View ArticlePubMedGoogle Scholar
- Mott R: Local sequence alignments with monotonic gap penalties. Bioinformatics (Oxford, England). 1999, 15 (6): 455-462. 10.1093/bioinformatics/15.6.455.View ArticleGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48 (3): 443-453. 10.1016/0022-2836(70)90057-4.View ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147 (1): 195-197. 10.1016/0022-2836(81)90087-5. doi:10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Zachariah MA, Crooks GE, Holbrook SR, Brenner SE: A generalized affine gap model significantly improves protein sequence alignment accuracy. Proteins. 2005, 58 (2): 329-338. doi:10.1002/prot.20299View ArticlePubMedGoogle Scholar
- Powell DR, Allison L, Dix TI: A versatile divide and conquer technique for optimal string alignment. Inform Process Lett. 1999, 70 (3): 127-139. 10.1016/S0020-0190(99)00053-8. doi:10.1016/s0020-0190(99)00053-8View ArticleGoogle Scholar
- Hirschberg DS: A linear space algorithm for computing maximal common subsequences. Commun ACM. 1975, 18 (6): 341-343. 10.1145/360825.360861. doi:10.1145/360825.360861View ArticleGoogle Scholar
- Grabherr MG, Russell P, Meyer M, Mauceli E, Alföldi J, Di Palma F, Lindblad-Toh K: Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics. 2010, 26 (9): 1145-1151. 10.1093/bioinformatics/btq102. doi:10.1093/bioinformatics/btq102PubMed CentralView ArticlePubMedGoogle Scholar
- Lindblad-Toh K, Wade CM, Mikkelsen TS, Karlsson EK, Jaffe DB, Kamal M, Zody MC: Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature. 2005, 438 (7069): 803-819. 10.1038/nature04338. doi:10.1038/nature04338View ArticlePubMedGoogle Scholar
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Miller W: Human-mouse alignments with BLASTZ. Genom Res. 2003, 13 (1): 103-107. 10.1101/gr.809403. doi:10.1101/gr.809403View ArticleGoogle Scholar
- Altschul SF: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389. doi:10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.