- Software review
- Open Access
CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes
Source Code for Biology and Medicinevolume 6, Article number: 11 (2011)
Recent developments in sequencing technologies have given the opportunity to sequence many bacterial genomes with limited cost and labor, compared to previous techniques. However, a limiting step of genome sequencing is the finishing process, needed to infer the relative position of each contig and close sequencing gaps. An additional degree of complexity is given by bacterial species harboring more than one replicon, which are not contemplated by the currently available programs. The availability of a large number of bacterial genomes allows geneticists to use complete genomes (possibly from the same species) as templates for contigs mapping.
Here we present CONTIGuator, a software tool for contigs mapping over a reference genome which allows the visualization of a map of contigs, underlining loss and/or gain of genetic elements and permitting to finish multipartite genomes. The functionality of CONTIGuator was tested using four genomes, demonstrating its improved performances compared to currently available programs.
Our approach appears efficient, with a clear visualization, allowing the user to perform comparative structural genomics analysis on draft genomes. CONTIGuator is a Python script for Linux environments and can be used on normal desktop machines and can be downloaded from http://contiguator.sourceforge.net.
In the recent years, the dropping cost of sequencing technologies allowed biologist to easily widen the number of genomic sequences available for the scientific community, especially for bacterial species; moreover, the number of phylogenetically related genomes has also dramatically increased: the number of genera having more than 10 complete genomic sequences is 29 and interestingly, looking at the 12 species with more than 10 genomes fully sequenced, all of them belongs to bacteria (GOLD database , November 2010), pointing out the great value of closely related genomes in the so-called bacterial comparative genomics. However, looking at the ongoing or draft genomic projects, within 14 bacterial species with more than 50 running genomic projects, a lack of finishing efforts can be seen, suggesting that the problems encountered while closing a genome (even bacterial ones) are still time-consuming and cannot be easily automated. In fact, to close gaps of draft genomes a series of PCR reactions has to be designed in an iterative fashion.
To overcome this problems many programs have been recently developed, using an approach where all the contigs obtained by the first automated assembly run are mapped to a reference closed genome (usually inside the same species or as close as possible) and a series of PCR primers are designed to fill putative gaps existing between the contigs, in an iterative fashion. These programs are Projector2 , PGA4genomics , OSlay , ABACAS  and other algorithms present in genome annotation tools ; all of them could be used both for genomic finishing and, simply, to infer the relative position of the contigs, but none of them allows the user in finding divergent regions in the reference and the new genome, which could be helpful in performing preliminary structural analysis; moreover, in the case of a genome composed by many replicons, there is no automated procedure for multipartite genomic organization, avoiding to place one contig in more than one replicon.
In order to try to solve these problems and help genomic scientists in performing comparative structural analyses reducing the time-consuming PCR-based finishing process, we developed CONTIGuator, a script which combines the routines of one of mostly used tools, ABACAS, refining the results with a contig profiling viewable with the Artemis comparison tool (ACT) , in which the putative PCR products for the subsequent step of the finishing process are also shown.
The CONTIGuator pipeline is detailed in Figure 1a; the user provides a fasta file, containing all the contigs, and one or more fasta files with the replicon(s) of the reference genome: the first step of the analysis comprises a megablast run (the user can both have installed the new Blast+  or the old Blast implementation ), which provides the contig-profiling. This step is useful for highlighting the parts of the reference and of the contigs that are homologous or divergent (some examples are given in Figure 1b, c and 1d); if the targeted genome is supposed to harbour more than one replicon, the program will ensure that each contig will be mapped only once. The intermediate output is then used as an input for the ABACAS perl script, one for each reference replicon, which performs a MUMmer  run (nucmer, delta-filter and show-tiling executables) and creates a first pseudocontig molecule. The first pseudocontig molecule is used and corrected by CONTIGuator to create the final pseudocontig molecule and the map. The contigs placed across the starting point of the molecule are splitted in order to generate a clearer map and the position of each contig is adjusted according to the blast output. This map will result in shorter gaps and therefore more PCR primers will be found in the subsequent step; as in ABACAS, the gaps between each contig is filled with Ns. The following step comprises the use of the Primer3 suite  to create a set of primers in order to fill the remaining gaps; the uniqueness of the primers is ensured by a nucmer scan over all the contigs given to the program. The last step is optional and needs the presence of the reference genome ptt files (usually available in the NCBI ftp repository) in order to use the protein sequences from non-homologous regions of the reference genomes for a tblastn run against the excluded contigs; the output will help the user to find those regions of the reference genome that may be still present in the draft genome, although with a lower level of homology and/or with a different arrangement.
The outputs of the program are divided in different directories (one for each replicon), containing the primers sequences (if the option was selected) and a series of input files for the Artemis comparison tool (ACT): "Reference.fsa", containing the reference sequence, "PseudoContig.fsa", containing the pseudocontig sequence, "PseudoContig.crunch", which is the Artemis comparison file. As soon as these files are loaded into ACT, the user can add the "ReferenceHits.tab" and the "PseudoContig.tab" entry file to show the blast hits on the reference genome and the position of the contigs in the pseudocontig molecule; if the primers were generated, the "PCRProducts.tab" entry file can be added to the pseudocontig molecule to see the putative PCR products. Finally, if the reference genomes ptt files were present, the user can add the "ReferenceProteinHits.tab" entry file to show the position of the tblastn hits on the reference genome.
Results and discussion
CONTIGuator was tested on four closed genomes for which the first draft assembly was available: the Brucella microti genome, which comprises two replicons of about 2 Mb and 1 Mb, respectively, whose draft assembly comprised 1539 contigs against the reference Brucella melitensis strain ATCC 23457  genome, which also comprises two replicons of the same size; the 864 draft contigs from the Yersinia enterocolitica subsp. palearctica Y11  genome against the 4.6 Mb long reference Yersinia enterocolitica subsp. enterocolitica 8081  genome, the 547 draft contigs from the Lactococcus lactis subsp. lactis KF147  genome against the 2.6 Mb long reference Lactococcus lactis subsp. lactis Il1403  genome and the 158 draft contigs from the Sinorhizobium meliloti BL225C  genome against the reference S. meliloti Rm1021 genome genome, which comprises three replicons of about 3.7 Mb, 1.7 Mb and 1.4 Mb, respectively. A test run of CONTIGuator with the default parameters was performed and compared with ABACAS, PGA4genomics, OSlay and Projector2, as well as the primer picking for CONTIGuator, ABACAS and Projector2; the results of this test are presented in Table 1. With respect to ABACAS, CONTIGuator allowed the mapping of 257 kbp on average in addition to the other programs; since the pseudocontig map is corrected with the megablast output, there is a substantial gain in the number of PCR primers obtained (an average of 21 PCR pairs more), with respect to PGA4genomics, CONTIGuator allowing the mapping of 176 kbp more on average. CONTIGuator mapped 182 kbp less than OSlay; however OSlay does not produce a set of primers. Looking at the comparison with Projector2, CONTIGuator allowed the mapping of 1841 kbp more than the previous software: such a dramatic difference is mainly due to a putative malfunction of projector2 when analyzing the genome of S. meliloti BL225C, although CONTIGuator mapped more base pairs for the other genomes as well; however a compensation in the number of PCR primers generated by Projector2 (an average of 52 more PCR pairs) was observed, which could be due to the fact that both ABACAS and CONTIGuator ensure the uniqueness of the primers. Moreover, in the B. microti and S. meliloti tests, CONTIGuator ensured that no contig was assigned to more than one replicon, while with the other programs many contigs were assigned to more than one replicon. The ACT visualization, as shown in Figure 1(b, c, d, e, f, g), allows the user to highlight which portions of the contigs and of the reference genome are divergent and syntenic, allowing a first glance in the structural features of the new genome; in Figure 1f and 1g two examples of ACT visualizations generated by CONTIGuator were reported, showing experimental verification of the assembly prediction with the full genome sequence.
Using the four complete genomes, we compared the three programs that are able to generate PCR primers (CONTIGuator, ABACAS, Projector2) in terms of how many gaps the generated set of PCR primers would putatively close (Table 1). We checked if the relative position of the contigs on the reference genome was the same when the contigs were mapped on the complete genome: the gap was considered as "putatively closed" when two contigs were mapped near each other on the reference genome and a PCR pair was designed between the two adjacent contigs. CONTIGuator was proven to perform better than ABACAS, since it lead to 10 extra putative gap closures than ABACAS; the main reason is the higher number of PCR pairs generated (12) was able to close more gaps. The other gaps, that ABACAS couldn't close, were due to a different relative placement of the contigs on the reference genome, a placement that in CONTIGuator was automatically corrected by the blast-based contig profiling. In comparison with Projector2, CONTIGuator by far closed more gaps in Y. enteroclitica and S. meliloti, slightly less for B. microti and less for L. lactis; it should be pointed out that the graphical output of projector2 lacks the contig profiler approach of CONTIGuator and therefore no detailed structural features can be seen prior to genome finishing. Moreover, as pointed out earlier, ABACAS (and therefore also CONTIGuator) ensures the uniqueness of the primer pairs generated, thus putatively removing any ambiguous reactions. An higher number of gaps closed (putatively in this simulation) could lead to a lower number of iterations (input contigs, CONTIGuator annealing, PCR reactions, new set of contigs) and therefore could strongly reduce the efforts in closing one genome. In the cases analyzed in this study, the apparent contradiction of more PCR pairs designed in the first run of the program, may lead to less time needed to close all the gaps in the draft genome.
CONTIGuator is a powerful and fast algorithm for the bacterial genomes finishing process, providing a bigger PCR primers set able to close more gaps, and also giving clues on the relative position of the various contigs; moreover, CONTIGuator contigs profiling provides a high resolution map (viewable with the popular ACT program), highlighting regions of the reference genomes that are diverging from the assembled contigs. CONTIGuator indeed represents, before the end of the finishing phase, a powerful method for the investigation of the structural genomics based on draft genome sequences.
Availability and requirements
CONTIGuator is a Python script for Linux environments and can be downloaded from http://contiguator.sourceforge.net, with a GNU/GPLv.3.0 license. The Python interpreter is needed with the addition of the BioPython package , as well as the Perl interpreter, a copy of ABACAS (available here: http://abacas.sourceforge.net/), as well as Blast+, MUMmer and primer3; all these programs must be reachable from the command line. The Artemis comparison tool is needed to view the output files. All the software packages were tested with their latest version, although no malfunction was reported with the older versions. The program performance was tested on a normal desktop machine (Linux Ubuntu 10.04, 4CPU Intel 2.50 GHz, 4 GB RAM) with run times in the order of a few minutes (about 6 minutes with primer picking and about 30 seconds without).
Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides NC: The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 2006, 34: D332-334. 10.1093/nar/gkj145.
van Hijum SA, Zomer AL, Kuipers OP, Kok J: Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005, 33: W560-566. 10.1093/nar/gki356.
Zhao F, Hou H, Bao Q, Wu J: PGA4genomics for comparative genome assembly based on genetic algorithm optimization. Genomics. 2009, 94: 284-286. 10.1016/j.ygeno.2009.06.006.
Richter DC, Schuster SC, Huson DH: OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics. 2007, 23: 1573-1579. 10.1093/bioinformatics/btm153.
Assefa S, Keane TM, Otto TD, Newbold C, Berriman M: ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics. 2009, 25: 1968-1969. 10.1093/bioinformatics/btp347.
Kislyuk AO, Katz LS, Agrawal S, Hagen MS, Conley AB, Jayaraman P, Nelakuditi V, Humphrey JC, Sammons SA, Govil D: A computational genomics pipeline for prokaryotic sequencing projects. Bioinformatics. 26: 1819-1826.
Carver T, Berriman M, Tivey A, Patel C, Bohme U, Barrell BG, Parkhill J, Rajandream MA: Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database. Bioinformatics. 2008, 24: 2672-2676. 10.1093/bioinformatics/btn529.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL: BLAST+: architecture and applications. BMC Bioinformatics. 2009, 10: 421-10.1186/1471-2105-10-421.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.
Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol. 2004, 5: R12-10.1186/gb-2004-5-2-r12.
Koressaar T, Remm M: Enhancements and modifications of primer design program Primer3. Bioinformatics. 2007, 23: 1289-1291. 10.1093/bioinformatics/btm091.
Audic S, Lescot M, Claverie JM, Scholz HC: Brucella microti: the genome sequence of an emerging pathogen. BMC Genomics. 2009, 10: 352-10.1186/1471-2164-10-352.
Lazaro FG, Rodriguez-Tarazona RE, Garcia-Rodriguez JA, Munoz-Bellido JL: Fluoroquinolone-resistant Brucella melitensis mutants obtained in vitro. Int J Antimicrob Agents. 2009, 34: 252-254. 10.1016/j.ijantimicag.2008.12.013.
Batzilla J, Hoper D, Antonenka U, Heesemann J, Rakin A: Complete genome sequence of Yersinia enterocolitica subsp. palearctica serogroup O:3. J Bacteriol.
Thomson NR, Howard S, Wren BW, Holden MT, Crossman L, Challis GL, Churcher C, Mungall K, Brooks K, Chillingworth T, et al: The complete genome sequence and comparative genome analysis of the high pathogenicity Yersinia enterocolitica strain 8081. PLoS Genet. 2006, 2: e206-10.1371/journal.pgen.0020206.
Siezen RJ, Bayjanov J, Renckens B, Wels M, van Hijum SA, Molenaar D, van Hylckama Vlieg JE: Complete genome sequence of Lactococcus lactis subsp. lactis KF147, a plant-associated lactic acid bacterium. J Bacteriol. 192: 2649-2650.
Bolotin A, Mauger S, Malarme K, Ehrlich SD, Sorokin A: Low-redundancy sequencing of the entire Lactococcus lactis IL1403 genome. Antonie Van Leeuwenhoek. 1999, 76: 27-76. 10.1023/A:1002048720611.
Galardini M, Mengoni A, Brilli M, Pini F, Fioravanti A, Lucas S, Lapidus A, Cheng J-F, Goodwin L, Pitluck S, et al: Exploring the symbiotic pangenome of the nitrogen-fixing bacterium Sinorhizobium meliloti . BMC Genomics. 2011, 12: 253-10.1186/1471-2164-12-253.
Galibert F, Finan TM, Long SR, Puhler A, Abola P, Ampe F, Barloy-Hubler F, Barnett MJ, Becker A, Boistard P, et al: The composite genome of the legume symbiont Sinorhizobium meliloti. Science. 2001, 293: 668-672. 10.1126/science.1060966.
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009, 25: 1422-1423. 10.1093/bioinformatics/btp163.
We are grateful with Stephane Audic and Holger C Scholtz who provided us the unassembled contigs of Brucella microti, with Rakin Alexander and Julia Batzilla for the unassembled contigs of Yersinia enterocolitica, Roland Siezen for the unassembled contigs of Lactococcus lactis, Matteo Brilli for the critical reading of the manuscript, Marco Fondi and Florent Lassalle for the test sessions and Samuel Assefa for the feedbacks on ABACAS.
The authors declare that they have no competing interests.
MG wrote the script, tested it and contributed in drafting the manuscript. EGB, MB and AM conceived the idea and contributed in drafting the manuscript. All authors have given final approval of the version to be published.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.