CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes
© Galardini et al; licensee BioMed Central Ltd. 2011
Received: 18 March 2011
Accepted: 21 June 2011
Published: 21 June 2011
Recent developments in sequencing technologies have given the opportunity to sequence many bacterial genomes with limited cost and labor, compared to previous techniques. However, a limiting step of genome sequencing is the finishing process, needed to infer the relative position of each contig and close sequencing gaps. An additional degree of complexity is given by bacterial species harboring more than one replicon, which are not contemplated by the currently available programs. The availability of a large number of bacterial genomes allows geneticists to use complete genomes (possibly from the same species) as templates for contigs mapping.
Here we present CONTIGuator, a software tool for contigs mapping over a reference genome which allows the visualization of a map of contigs, underlining loss and/or gain of genetic elements and permitting to finish multipartite genomes. The functionality of CONTIGuator was tested using four genomes, demonstrating its improved performances compared to currently available programs.
Our approach appears efficient, with a clear visualization, allowing the user to perform comparative structural genomics analysis on draft genomes. CONTIGuator is a Python script for Linux environments and can be used on normal desktop machines and can be downloaded from http://contiguator.sourceforge.net.
KeywordsGenomics Genome finishing Software Structural genomics
In the recent years, the dropping cost of sequencing technologies allowed biologist to easily widen the number of genomic sequences available for the scientific community, especially for bacterial species; moreover, the number of phylogenetically related genomes has also dramatically increased: the number of genera having more than 10 complete genomic sequences is 29 and interestingly, looking at the 12 species with more than 10 genomes fully sequenced, all of them belongs to bacteria (GOLD database , November 2010), pointing out the great value of closely related genomes in the so-called bacterial comparative genomics. However, looking at the ongoing or draft genomic projects, within 14 bacterial species with more than 50 running genomic projects, a lack of finishing efforts can be seen, suggesting that the problems encountered while closing a genome (even bacterial ones) are still time-consuming and cannot be easily automated. In fact, to close gaps of draft genomes a series of PCR reactions has to be designed in an iterative fashion.
To overcome this problems many programs have been recently developed, using an approach where all the contigs obtained by the first automated assembly run are mapped to a reference closed genome (usually inside the same species or as close as possible) and a series of PCR primers are designed to fill putative gaps existing between the contigs, in an iterative fashion. These programs are Projector2 , PGA4genomics , OSlay , ABACAS  and other algorithms present in genome annotation tools ; all of them could be used both for genomic finishing and, simply, to infer the relative position of the contigs, but none of them allows the user in finding divergent regions in the reference and the new genome, which could be helpful in performing preliminary structural analysis; moreover, in the case of a genome composed by many replicons, there is no automated procedure for multipartite genomic organization, avoiding to place one contig in more than one replicon.
In order to try to solve these problems and help genomic scientists in performing comparative structural analyses reducing the time-consuming PCR-based finishing process, we developed CONTIGuator, a script which combines the routines of one of mostly used tools, ABACAS, refining the results with a contig profiling viewable with the Artemis comparison tool (ACT) , in which the putative PCR products for the subsequent step of the finishing process are also shown.
The outputs of the program are divided in different directories (one for each replicon), containing the primers sequences (if the option was selected) and a series of input files for the Artemis comparison tool (ACT): "Reference.fsa", containing the reference sequence, "PseudoContig.fsa", containing the pseudocontig sequence, "PseudoContig.crunch", which is the Artemis comparison file. As soon as these files are loaded into ACT, the user can add the "ReferenceHits.tab" and the "PseudoContig.tab" entry file to show the blast hits on the reference genome and the position of the contigs in the pseudocontig molecule; if the primers were generated, the "PCRProducts.tab" entry file can be added to the pseudocontig molecule to see the putative PCR products. Finally, if the reference genomes ptt files were present, the user can add the "ReferenceProteinHits.tab" entry file to show the position of the tblastn hits on the reference genome.
Results and discussion
Comparison between ABACAS and CONTIGuator performances over the four test genomes.
Gaps putatively closed
Using the four complete genomes, we compared the three programs that are able to generate PCR primers (CONTIGuator, ABACAS, Projector2) in terms of how many gaps the generated set of PCR primers would putatively close (Table 1). We checked if the relative position of the contigs on the reference genome was the same when the contigs were mapped on the complete genome: the gap was considered as "putatively closed" when two contigs were mapped near each other on the reference genome and a PCR pair was designed between the two adjacent contigs. CONTIGuator was proven to perform better than ABACAS, since it lead to 10 extra putative gap closures than ABACAS; the main reason is the higher number of PCR pairs generated (12) was able to close more gaps. The other gaps, that ABACAS couldn't close, were due to a different relative placement of the contigs on the reference genome, a placement that in CONTIGuator was automatically corrected by the blast-based contig profiling. In comparison with Projector2, CONTIGuator by far closed more gaps in Y. enteroclitica and S. meliloti, slightly less for B. microti and less for L. lactis; it should be pointed out that the graphical output of projector2 lacks the contig profiler approach of CONTIGuator and therefore no detailed structural features can be seen prior to genome finishing. Moreover, as pointed out earlier, ABACAS (and therefore also CONTIGuator) ensures the uniqueness of the primer pairs generated, thus putatively removing any ambiguous reactions. An higher number of gaps closed (putatively in this simulation) could lead to a lower number of iterations (input contigs, CONTIGuator annealing, PCR reactions, new set of contigs) and therefore could strongly reduce the efforts in closing one genome. In the cases analyzed in this study, the apparent contradiction of more PCR pairs designed in the first run of the program, may lead to less time needed to close all the gaps in the draft genome.
CONTIGuator is a powerful and fast algorithm for the bacterial genomes finishing process, providing a bigger PCR primers set able to close more gaps, and also giving clues on the relative position of the various contigs; moreover, CONTIGuator contigs profiling provides a high resolution map (viewable with the popular ACT program), highlighting regions of the reference genomes that are diverging from the assembled contigs. CONTIGuator indeed represents, before the end of the finishing phase, a powerful method for the investigation of the structural genomics based on draft genome sequences.
Availability and requirements
CONTIGuator is a Python script for Linux environments and can be downloaded from http://contiguator.sourceforge.net, with a GNU/GPLv.3.0 license. The Python interpreter is needed with the addition of the BioPython package , as well as the Perl interpreter, a copy of ABACAS (available here: http://abacas.sourceforge.net/), as well as Blast+, MUMmer and primer3; all these programs must be reachable from the command line. The Artemis comparison tool is needed to view the output files. All the software packages were tested with their latest version, although no malfunction was reported with the older versions. The program performance was tested on a normal desktop machine (Linux Ubuntu 10.04, 4CPU Intel 2.50 GHz, 4 GB RAM) with run times in the order of a few minutes (about 6 minutes with primer picking and about 30 seconds without).
We are grateful with Stephane Audic and Holger C Scholtz who provided us the unassembled contigs of Brucella microti, with Rakin Alexander and Julia Batzilla for the unassembled contigs of Yersinia enterocolitica, Roland Siezen for the unassembled contigs of Lactococcus lactis, Matteo Brilli for the critical reading of the manuscript, Marco Fondi and Florent Lassalle for the test sessions and Samuel Assefa for the feedbacks on ABACAS.
- Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides NC: The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 2006, 34: D332-334. 10.1093/nar/gkj145.PubMed CentralView ArticlePubMedGoogle Scholar
- van Hijum SA, Zomer AL, Kuipers OP, Kok J: Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005, 33: W560-566. 10.1093/nar/gki356.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao F, Hou H, Bao Q, Wu J: PGA4genomics for comparative genome assembly based on genetic algorithm optimization. Genomics. 2009, 94: 284-286. 10.1016/j.ygeno.2009.06.006.View ArticlePubMedGoogle Scholar
- Richter DC, Schuster SC, Huson DH: OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics. 2007, 23: 1573-1579. 10.1093/bioinformatics/btm153.View ArticlePubMedGoogle Scholar
- Assefa S, Keane TM, Otto TD, Newbold C, Berriman M: ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics. 2009, 25: 1968-1969. 10.1093/bioinformatics/btp347.PubMed CentralView ArticlePubMedGoogle Scholar
- Kislyuk AO, Katz LS, Agrawal S, Hagen MS, Conley AB, Jayaraman P, Nelakuditi V, Humphrey JC, Sammons SA, Govil D: A computational genomics pipeline for prokaryotic sequencing projects. Bioinformatics. 26: 1819-1826.Google Scholar
- Carver T, Berriman M, Tivey A, Patel C, Bohme U, Barrell BG, Parkhill J, Rajandream MA: Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database. Bioinformatics. 2008, 24: 2672-2676. 10.1093/bioinformatics/btn529.PubMed CentralView ArticlePubMedGoogle Scholar
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL: BLAST+: architecture and applications. BMC Bioinformatics. 2009, 10: 421-10.1186/1471-2105-10-421.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol. 2004, 5: R12-10.1186/gb-2004-5-2-r12.PubMed CentralView ArticlePubMedGoogle Scholar
- Koressaar T, Remm M: Enhancements and modifications of primer design program Primer3. Bioinformatics. 2007, 23: 1289-1291. 10.1093/bioinformatics/btm091.View ArticlePubMedGoogle Scholar
- Audic S, Lescot M, Claverie JM, Scholz HC: Brucella microti: the genome sequence of an emerging pathogen. BMC Genomics. 2009, 10: 352-10.1186/1471-2164-10-352.PubMed CentralView ArticlePubMedGoogle Scholar
- Lazaro FG, Rodriguez-Tarazona RE, Garcia-Rodriguez JA, Munoz-Bellido JL: Fluoroquinolone-resistant Brucella melitensis mutants obtained in vitro. Int J Antimicrob Agents. 2009, 34: 252-254. 10.1016/j.ijantimicag.2008.12.013.View ArticlePubMedGoogle Scholar
- Batzilla J, Hoper D, Antonenka U, Heesemann J, Rakin A: Complete genome sequence of Yersinia enterocolitica subsp. palearctica serogroup O:3. J Bacteriol.Google Scholar
- Thomson NR, Howard S, Wren BW, Holden MT, Crossman L, Challis GL, Churcher C, Mungall K, Brooks K, Chillingworth T, et al: The complete genome sequence and comparative genome analysis of the high pathogenicity Yersinia enterocolitica strain 8081. PLoS Genet. 2006, 2: e206-10.1371/journal.pgen.0020206.PubMed CentralView ArticlePubMedGoogle Scholar
- Siezen RJ, Bayjanov J, Renckens B, Wels M, van Hijum SA, Molenaar D, van Hylckama Vlieg JE: Complete genome sequence of Lactococcus lactis subsp. lactis KF147, a plant-associated lactic acid bacterium. J Bacteriol. 192: 2649-2650.Google Scholar
- Bolotin A, Mauger S, Malarme K, Ehrlich SD, Sorokin A: Low-redundancy sequencing of the entire Lactococcus lactis IL1403 genome. Antonie Van Leeuwenhoek. 1999, 76: 27-76. 10.1023/A:1002048720611.View ArticlePubMedGoogle Scholar
- Galardini M, Mengoni A, Brilli M, Pini F, Fioravanti A, Lucas S, Lapidus A, Cheng J-F, Goodwin L, Pitluck S, et al: Exploring the symbiotic pangenome of the nitrogen-fixing bacterium Sinorhizobium meliloti . BMC Genomics. 2011, 12: 253-10.1186/1471-2164-12-253.View ArticleGoogle Scholar
- Galibert F, Finan TM, Long SR, Puhler A, Abola P, Ampe F, Barloy-Hubler F, Barnett MJ, Becker A, Boistard P, et al: The composite genome of the legume symbiont Sinorhizobium meliloti. Science. 2001, 293: 668-672. 10.1126/science.1060966.View ArticlePubMedGoogle Scholar
- Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009, 25: 1422-1423. 10.1093/bioinformatics/btp163.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.