SNIT: SNP identification for strain typing
© Satya et al; licensee BioMed Central Ltd. 2011
Received: 31 May 2011
Accepted: 8 September 2011
Published: 8 September 2011
With ever-increasing numbers of microbial genomes being sequenced, efficient tools are needed to perform strain-level identification of any newly sequenced genome. Here, we present the SNP identification for strain typing (SNIT) pipeline, a fast and accurate software system that compares a newly sequenced bacterial genome with other genomes of the same species to identify single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels). Based on this information, the pipeline analyzes the polymorphic loci present in all input genomes to identify the genome that has the fewest differences with the newly sequenced genome. Similarly, for each of the other genomes, SNIT identifies the input genome with the fewest differences. Results from five bacterial species show that the SNIT pipeline identifies the correct closest neighbor with 75% to 100% accuracy. The SNIT pipeline is available for download at http://www.bhsai.org/snit.html
Rapid and accurate identification of an infectious agent is of the utmost importance for the surveillance and treatment of infectious diseases. Traditionally, strain typing has been performed using assays that probe a few previously known polymorphic loci . However, due to the inherent limitations of using only a few loci, these methods offer low specificity.
Because of the rapid decrease in costs of genome sequencing, strain typing can now be performed in silico by first sequencing the sample, and then comparing the genome sequence with other available genomes of the same species to identify the closest strain. This approach has the potential to offer a much higher specificity because it uses the entire genome rather than a few pre-selected loci. Moreover, a comprehensive listing of all polymorphisms in a newly sequenced genome might also be useful in predicting the virulence or pathogenicity of the new strain.
Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation. Many previous methods have used "in-house" pipelines to identify and catalog the SNPs between pathogens of the same species [2, 3]. However, with the exception of SNPsFinder  and inGAP , these pipelines are seldom publicly available. The SNPsFinder pipeline is a Web-based application that requires users to upload the genome sequences that need to be compared, which might be time consuming when a large number of genomes are involved. In addition, the use of a public server is not desirable if confidentiality of the data is a concern. The inGAP pipeline provides many useful functionalities for the analysis of next-generation sequencing data, however, the SNP identification routines do not scale well with the number of genomes because of their reliance on multiple sequence alignments. In our comparative investigation, inGAP successfully produced SNPs for four Shigella flexneri genomes, but repeatedly crashed when run for seven Burkholderia mallei genomes (the Results Section contains details of the configuration of the systems on which these comparisons were performed).
Here, we present the SN P I dentification for Strain T yping (SNIT) pipeline, a computationally efficient, light-weight application that analyzes multiple genomes and identifies SNPs and small indels. The pipeline has many advantages: 1) it is a stand-alone application with a graphical user interface (GUI) that runs on the user's workstation, thus eliminating issues of data confidentiality; 2) it is accurate, fast, and highly scalable, owing to the use of pairwise alignments to achieve the basic functionality of SNP finding; and 3) it automatically identifies the closest neighbor for each genome without the need for manual processing of the SNP data.
The polymorphisms from the individual pairwise alignments are then combined into a single table that contains the position of each polymorphism in the reference genome and the individual variants in each of the input genomes. In compiling these tables, any position in the query genome that is not part of a filtered alignment with the reference is considered as missing (i.e., part of a large insertion or deletion) in the query genome. Various filters can be applied in building this table, including requirements on the length of conserved sequence on either side of a polymorphism and the selection of only those polymorphisms that are present in all input genomes. The numbers of differentiating SNP/indel loci between each pair of input genomes are computed by comparing the corresponding columns in this table. For each input genome, the pipeline analyzes the polymorphic loci present in all input genomes and reports the genome with the fewest differences as the closest neighbor.
Accuracy and efficiency with draft and complete genomes
Input parameters used for testing SNIT
Minimum MUMmer cluster length
Minimum MUMmer exact match
Maximum MUMmer gap
Minimum large indel size
Minimum conserved flank length
Summary of the results for five different bacterial species
The SNIT pipeline correctly identified the closest neighbors for 100% of the genomes in four out of the five test cases, including clonal species, such as Bacillus anthracis and Francisella tularensis. For the fifth test case, B. pseudomallei, the accuracy was 75% (15 out of 20). The lower accuracy for B. pseudomallei is not surprising, because the strains of this species are highly divergent, with horizontal transfer playing a significant role in their evolutionary history. A more sophisticated approach than simple SNP and indel counts would be necessary for accurate typing of such highly divergent species as B. pseudomallei. The details of the genomes and phylogenies used in these comparisons are provided in Additional file 1.
Accuracy with next-generation sequencing data
To test the applicability of SNIT to assemblies generated from next-generation sequencing (NGS) data, we selected the recently published Yersinia pestis KIM D27 genome . The Y. pestis D27 strain is a derivative of Y. pestis KIM 10 strain (accession no. NC_004088). The Y. pestis KIM D27 draft genome (accession no. ADDC00000000) was generated from a hybrid assembly of reads generated from 454 XLR Titanium and Illumina Genome Analyzer IIx platforms. We configured a SNIT run with a total of 21 Y. pestis genomes, which included both draft and finished genomes. In the first run, we selected the Y. pestis KIM D27 draft genome as the reference. In this run, SNIT correctly identified the Y. pestis KIM 10 strain as the closest neighbor for Y. pestis KIM D27. Next, we repeated the run with Y. pestis CO92 selected as the reference genome. Again, SNIT correctly identified Y. pestis KIM 10 as the closest neighbor. These results suggest that the SNIT pipeline can be applied to assemblies generated from NGS data.
Performance on larger data sets
To test the efficiency of the pipeline on even larger data sets, we ran SNIT with 50 arbitrarily selected Escherichia coli genomes downloaded from PATRIC . For these 50 genomes, the pipeline completed the analysis (with the TRF option selected) in 145 min. However, nearly 110 of these 145 min were spent in running TRF on the input genomes. The pipeline completed the analysis in less than 32 min without the TRF option. These results indicate that the SNP pipeline can handle large data sets of 50 (or more) genomes.
In principle, the SNIT pipeline can be applied to contigs obtained from the sequencing of clinical samples, to perform strain-level identification of the pathogens present in the sample. The accuracy of such analysis will depend on the fraction of the target pathogen's genome covered by the contigs and the overall diversity among the different strains of the pathogen. However, the provided options to filter low-quality bases should reduce the effect of sequencing errors and, because SNIT's SNP identification is relative to the compared sequenced genomes, any remaining sequencing errors in the target sequence should not constitute a significant problem.
The efficiency of the SNIT pipeline stems from the use of pairwise alignments based on exact matches. However, this approach limits the application of the pipeline to bacterial and eukaryotic genomes. Due to the high variability in viral genomes, multiple genome alignments, possibly in the amino acid domain, will be necessary to identify discriminative polymorphisms for strain identification. Similar to other reference-based pairwise alignment approaches, such as SNPsFinder, the SNIT pipeline can only report SNP loci that can be mapped to the reference genome. While this capability is sufficient for strain typing, it should be noted that the pipeline is not intended to provide a comprehensive list of all SNPs among the input genomes. For instance, in the case of two genomes that share a large insertion compared with the reference genome, the variations within this large insertion would not be reported by SNIT unless one of them was selected as the reference. Hence, the SNIT pipeline is not ideal for use with strains with significant contributions from large insertions, deletions, or horizontal transfer events in their evolutionary history.
In general, we do not expect the performance of the pipeline to be drastically different on NGS data. While it is true that the error rate is higher for NGS data, sequencing errors should only have minimal, second-order effects on the overall results. This is because SNIT performs and reports the results of relative analysis, and it is highly unlikely that the same sequencing error would be repeated in two genomes, to make them appear closer than they should be. In addition, SNIT provides options to ignore variations at low-quality bases and at either end of contigs, which would help eliminate at least some of the sequencing errors from the analysis.
The results presented here indicate that the SNIT pipeline is highly accurate in identifying the closest neighbor even in cases of clonal species, such as Bacillus anthracis, Francisella tularensis, and B. mallei. Therefore, the pipeline can be useful as a rapid, automated tool for identifying the closest neighbor of a newly sequenced genome. The SNP identification modules from SNIT have been incorporated as part of the TOFI  and TOPSI  pipelines for designing pathogen diagnostic assays with strain-specific signatures.
Availability and requirements
Project name: SNIT
Project home page: http://www.bhsai.org/snit.html
Operating systems: Linux
Programming language: Perl
Other requirements: MUMmer 3.22 or greater, BioPerl, Tandem Repeat Finder (TRF) and Java Runtime Environment (JRE) 1.5 or greater
This work was sponsored by the U.S. DoD High Performance Computing Modernization Program, under the High Performance Computing Software Applications Institutes Initiative.
The opinions and assertions contained herein are the private views of the authors and are not to be construed as official or as reflecting the views of the U.S. Army or the U.S. Department of Defense. This paper has been approved for public release with unlimited distribution.
- Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, Urwin R, Zhang Q, Zhou J, Zurth K, Caugant DA, et al: Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci USA. 1998, 95: 3140-3145. 10.1073/pnas.95.6.3140.PubMed CentralView ArticlePubMedGoogle Scholar
- Foster JT, Beckstrom-Sternberg SM, Pearson T, Beckstrom-Sternberg JS, Chain PS, Roberto FF, Hnath J, Brettin T, Keim P: Whole-genome-based phylogeny and divergence of the genus Brucella. J Bacteriol. 2009, 191: 2864-2870. 10.1128/JB.01581-08.PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson T, Giffard P, Beckstrom-Sternberg S, Auerbach R, Hornstra H, Tuanyok A, Price EP, Glass MB, Leadem B, Beckstrom-Sternberg JS, et al: Phylogeographic reconstruction of a bacterial species with high levels of lateral gene transfer. BMC Biol. 2009, 7: 78-10.1186/1741-7007-7-78.PubMed CentralView ArticlePubMedGoogle Scholar
- Song J, Xu Y, White S, Miller KW, Wolinsky M: SNPsFinder--a web-based application for genome-wide discovery of single nucleotide polymorphisms in microbial genomes. Bioinformatics. 2005, 21: 2083-2084. 10.1093/bioinformatics/bti176.View ArticlePubMedGoogle Scholar
- Qi J, Zhao F, Buboltz A, Schuster SC: inGAP: an integrated next-generation genome analysis pipeline. Bioinformatics. 2010, 26: 127-129. 10.1093/bioinformatics/btp615.PubMed CentralView ArticlePubMedGoogle Scholar
- Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999, 27: 573-580. 10.1093/nar/27.2.573.PubMed CentralView ArticlePubMedGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol. 2004, 5: R12-10.1186/gb-2004-5-2-r12.PubMed CentralView ArticlePubMedGoogle Scholar
- Champion MD, Zeng Q, Nix EB, Nano FE, Keim P, Kodira CD, Borowsky M, Young S, Koehrsen M, Engels R, et al: Comparative genomic characterization of Francisella tularensis strains belonging to low and high virulence subspecies. PLoS Pathog. 2009, 5: e1000459-10.1371/journal.ppat.1000459.PubMed CentralView ArticlePubMedGoogle Scholar
- Larsson P, Elfsmark D, Svensson K, Wikstrom P, Forsman M, Brettin T, Keim P, Johansson A: Molecular evolutionary consequences of niche restriction in Francisella tularensis, a facultative intracellular pathogen. PLoS Pathog. 2009, 5: e1000472-10.1371/journal.ppat.1000472.PubMed CentralView ArticlePubMedGoogle Scholar
- Van Ert MN, Easterday WR, Huynh LY, Okinaka RT, Hugh-Jones ME, Ravel J, Zanecki SR, Pearson T, Simonson TS, U'Ren JM, et al: Global genetic population structure of Bacillus anthracis. PLoS ONE. 2007, 2: e461-10.1371/journal.pone.0000461.PubMed CentralView ArticlePubMedGoogle Scholar
- Ye C, Lan R, Xia S, Zhang J, Sun Q, Zhang S, Jing H, Wang L, Li Z, Zhou Z, et al: Emergence of a new multidrug-resistant serotype × variant in an epidemic clone of Shigella flexneri. J Clin Microbiol. 2010, 48: 419-426. 10.1128/JCM.00614-09.PubMed CentralView ArticlePubMedGoogle Scholar
- Losada L, Varga JJ, Hostetler J, Radune D, Kim M, Durkin S, Schneewind O, Nierman WC: Genome sequencing and analysis of Yersina pestis KIM D27, an avirulent strain exempt from select agent regulation. PLoS ONE. 2011, 6: e19054-10.1371/journal.pone.0019054.PubMed CentralView ArticlePubMedGoogle Scholar
- Snyder EE, Kampanya N, Lu J, Nordberg EK, Karur HR, Shukla M, Soneja J, Tian Y, Xue T, Yoo H, et al: PATRIC: the VBI PathoSystems Resource Integration Center. Nucleic Acids Res. 2007, 35: D401-406. 10.1093/nar/gkl858.PubMed CentralView ArticlePubMedGoogle Scholar
- Vijaya Satya R, Zavaljevski N, Kumar K, Bode E, Padilla S, Wasieloski L, Geyer J, Reifman J: In silico microarray probe design for diagnosis of multiple pathogens. BMC Genomics. 2008, 9: 496-10.1186/1471-2164-9-496.PubMed CentralView ArticlePubMedGoogle Scholar
- Vijaya Satya R, Kumar K, Zavaljevski N, Reifman J: A high-throughput pipeline for the design of real-time PCR signatures. BMC Bioinformatics. 2010, 11: 340-10.1186/1471-2105-11-340.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.