- Software review
- Open Access
A software pipeline for processing and identification of fungal ITS sequences
Source Code for Biology and Medicine volume 4, Article number: 1 (2009)
Fungi from environmental samples are typically identified to species level through DNA sequencing of the nuclear ribosomal internal transcribed spacer (ITS) region for use in BLAST-based similarity searches in the International Nucleotide Sequence Databases. These searches are time-consuming and regularly require a significant amount of manual intervention and complementary analyses. We here present software – in the form of an identification pipeline for large sets of fungal ITS sequences – developed to automate the BLAST process and several additional analysis steps. The performance of the pipeline was evaluated on a dataset of 350 ITS sequences from fungi growing as epiphytes on building material.
The pipeline was written in Perl and uses a local installation of NCBI-BLAST for the similarity searches of the query sequences. The variable subregion ITS2 of the ITS region is extracted from the sequences and used for additional searches of higher sensitivity. Multiple alignments of each query sequence and its closest matches are computed, and query sequences sharing at least 50% of their best matches are clustered to facilitate the evaluation of hypothetically conspecific groups. The pipeline proved to speed up the processing, as well as enhance the resolution, of the evaluation dataset considerably, and the fungi were found to belong chiefly to the Ascomycota, with Penicillium and Aspergillus as the two most common genera. The ITS2 was found to indicate a different taxonomic affiliation than did the complete ITS region for 10% of the query sequences, though this figure is likely to vary with the taxonomic scope of the query sequences.
The present software readily assigns large sets of fungal query sequences to their respective best matches in the international sequence databases and places them in a larger biological context. The output is highly structured to be easy to process, although it still needs to be inspected and possibly corrected for the impact of the incomplete and sometimes erroneously annotated fungal entries in these databases. The open source pipeline is available for UNIX-type platforms, and updated releases of the target database are made available biweekly. The pipeline is easily modified to operate on other molecular regions and organism groups.
The fungal kingdom is thought to comprise upwards of 1.5 million species, a figure that contrasts sharply with the 97,000 species of fungi described so far [1, 2]. The discrepancy between the estimated and the known fungal diversity is largely ascribable to the non-conspicuous nature of fungal life; most species are readily visible, and thus amenable for collection, only when they form fruiting-bodies or other propagation structures [3, 4]. Yet fungi are ubiquitously present in all biota as mycelia or other, non-hyphal life stages, and studies have revealed that fungi may account for as much as 90% of the living biomass of some ecosystems . The important ecological roles played by fungi in terms of nutrient recycling, symbiosis, and parasitism call for a deeper, taxonomy-oriented understanding of the mycoflora of the world, and one far more detailed than any fruiting-body based field inventory could ever bring about [6, 7]. Mycologists of today thus face the daunting task of locating and identifying large sets of fungi from non-trivial substrates such as soil, decaying wood, and plant debris, often in the complete absence of any physical manifestation of the fungi present.
The last ten years have seen a surge in the use of DNA sequences as a means to seek to identify fungi from various substrates and in different life stages to species level [8, 9]. The most frequently sequenced region for such purposes is the internal transcribed spacer (ITS) region of the nuclear ribosomal DNA . Though it does not come without complications , this ~550 base-pair (bp), transcribed but non-coding region is present in upwards of 250 copies per cell , which makes it relatively straightforward to amplify even from scanty input material. The high level of variability in two of its three subregions (ITS1 and ITS2) coupled with the very low variability of the intercalary third subregion (5.8S) make the region an advantageous choice for assignment of a query sequence to species level or a higher fungal taxon [13, 14]. Such inferences are typically achieved through similarity searches using BLAST  against the International Nucleotide Sequence Databases (INSD: GenBank, EMBL, DDBJ; ).
While BLAST is a powerful and versatile software package, its use over the web is dependent on server load and web traffic intensity. To compile and interpret its output in the case of multiple query sequences tends to be time-consuming and involve a significant amount of manual intervention. It is furthermore often necessary to re-run a certain proportion of the query sequences after the removal of overly conserved sequence segments – here the nuclear small subunit (nSSU/16S/18S), the 5.8S, and the nuclear large subunit (nLSU/25S/28S) – in order to obtain fine-grained detail [13, 17–19]. It may even be necessary to compile a multiple alignment of certain query sequences and their respective best matches and analyse the alignment under more powerful optimality criteria than similarity – such as phylogenetic analysis – in order to obtain sufficient resolution [cf. [20–23]]. To process an authentic set of fungal sequences obtained through environmental sampling thus often proves a lengthy and laborious process. The present study attempts a remedy in the form of a Perl-based software pipeline for rapid BLAST processing of large sets (10–100,000 s) of fungal query sequences. The pipeline streamlines functions such as automated ITS2-based BLAST runs for finer precision, the computation of multiple alignments for subsequent use, and the generation of comprehensible and easily analysed summaries of the results. The pipeline is designed to work in any UNIX-type environment (including MacOS X, Linux, and BSD) for which NCBI-BLAST, HMMER , and one or more of Clustal W , DIALIGN-TX , and MAFFT  are available.
To evaluate the performance of the pipeline on a large, authentic dataset, 350 ITS sequences were obtained from fungi growing on buildings in Sweden. Fungal growth on building material causes significant economic loss and is frequently connected to the Sick Building Syndrome . The present sampling was carried out as a part of a larger study to characterize such fungal communities; detailed knowledge of the taxonomic affiliations of these fungi is likely to form a key element in assessing health risks caused by mass occurrence of fungi in moist damaged buildings .
Implementation and methods
The pipeline [Additional file 1] was written in Perl 5  and acts as a wrapper for NCBI-BLAST, HMMER, and Clustal W/DIALIGN-TX/MAFFT. By default, all fungal ITS sequences annotated as such in INSD are used as target database (approximately 110,000 sequences as of December 2008). Optionally, the user may choose to include only sequences identified to species level in the database. Tailored, biweekly updated versions of these databases are made available by the authors at .
For each query sequence, blastn of the NCBI-BLAST package is run locally, and the result is parsed to retrieve the 15 (default; adjustable) best (topmost) matches in INSD. An attempt is then made to locate and extract the ITS2 from the same query sequence using HMMER and the Markov models employed by Nilsson et al . If successful, the ITS2 of the query sequence is used as the query in a new BLAST run against the same database, again with the 15 (default) best matches saved. The fifteen best matches of the first BLAST run are aligned jointly with the query sequence using Clustal W (default), DIALIGN-TX, or MAFFT to present the sequences in a more integrative context than the pairwise alignments of the BLAST output. The default value of the number of sequences to retain in the above steps was set to 15 as a product of the present focus on very similar sequences and the computational capacity at hand.
Upon completion of the BLAST runs, all query sequences that share at least 50% (default; adjustable) of their 15 best BLAST matches are clustered and aligned accordingly. Conspecific and closely related sequences are likely to be grouped into clusters in this step. Similarly, all sequences not a member of any such group are listed – in all likelihood, these singleton sequences represent species present only once among the query sequences. Ample screen output is given for each of the query sequences, and the results are summarized in a joint tab-separated file for viewing in any spreadsheet-type program (e.g., Open Office or Excel). The pipeline is released under the GNU-GPL software license version 2 and makes use of only freely available software. Although distributed over the web, the pipeline does not require an Internet connection to run.
The fungi were isolated from different kinds of building materials, e.g. wood, fibreboard, and linoleum mat, with varying extent of mould growth on the surface. The surface of the sample was scraped with a sterile needle and the scraped product was mixed with distilled water. The suspension was spread out on Petri dishes (Ø 9 cm) containing 2% malt extract agar (MEA) or dichloran 18% – glycerol agar (DG18). The samples were incubated in daylight at room temperature (20°C). Subsequently, fungal colonies were sub-cultured to new Petri dishes (Ø 6 cm) as part of the cleaning process. A total of 350 sequences were extracted using DNeasy Plant Mini Kit (QIAGEN, Hilden); amplified using READY-TO-GO PCR beads (QIAGEN) and the ITS1F and ITS4 primers; purified using the QiaQuick Spin Procedure Kit (QIAGEN); and sequenced using the ITS1F and ITS4 primers and the CEQ 2000XL DNA Analysis System (Beckman Coulter, Fullerton). Full details of the laboratory procedures are available in . The sequences were pooled and fed to the pipeline using its default settings. Decisions of taxonomic affiliations were done considering the results from both the full ITS region and from the ITS2 region, including studying the alignments in non-trivial cases, with priority given to the ITS2 data where the results differed.
Results and discussion
The 350 query sequences from mould growth on building material were found to be distributed among 42 distinct groups, each composed of mutually similar sequences to the exclusion of the other groups, and 27 singletons that did not match any other query sequence closely. The absolute majority (98%) of the fungal species recovered were found to belong to the Ascomycota, with a modest 2% of the sequences belonging to the Zygomycota (Fig. 1; Additional file 2). The two most common genera, Penicillium and Aspergillus, were totally dominant among the sequences with an occurrence of 58% and 16%, respectively. The most frequently occurring Penicillium species were P. chrysogenum, P. corylophilum, P. glabrum, and P. commune. The three latter species have as a rule not been identified to species level – but rather only to genus level – in prior biodiversity studies of building material due to the considerable difficulties in telling them apart using morphological analysis [33, 34], a shortcoming that molecular data and the present software package are set to address. Aspergillus versicolor, A. niger, and A. ustus were the three most common species of the genus Aspergillus, which again represents a finer view than that presented in many surveys based on traditional techniques. Indeed, the view of the building mycoflora as revealed by the software pipeline shares many overall similarities with those from similar studies  but stands out through the level of resolution obtained on the species level [36, 37]. The taxonomic and nomenclatural issues surrounding Penicillium and Aspergillus are far from trivial, however, and the lack of inclusive modern revisions and well-identified reference sequences are likely to have influenced the results of the present and many other efforts.
The evaluation dataset took two hours to run on a single-core 2.3 GHz Pentium 4 workstation. This is a considerable improvement in speed over any attempt at performing the corresponding BLAST runs over the web at INSD – particularly if complementary analyses have to be performed for some subset of the query sequences – and additional speed gains are to be attained if the pipeline is run on multiple-core processors. Manual inspection of the results revealed that the function to extract, and base the BLAST run on, the ITS2 gave a different taxonomic result than did the complete ITS region for about 10% of the cases (Additional file 2). This observation highlights the usefulness of excluding the very conserved genes encoding for the ribosomal small subunit, 5.8S, and large subunit, particularly when sequences from species not present in the target database are used as query. The magnitude of the discrepancy between the taxonomic signal obtained from the ITS2 and from the entire ITS region (including any portions of the flanking genes) can be expected to vary among different groups of fungi.
The largest shortcoming of the present pipeline, as indeed with all sequence-based identification procedures, is the wanting state of taxonomic coverage and reliability in the public sequence databases. These vary among different taxonomic groups but form a tangible problem for the fungal entries. A mere 13,300 (14%) of the 97,000 described species of fungi are represented by at least one fully identified ITS sequence in INSD, a number that, in turn, pales in comparison with the estimated 1.5 million extant species of fungi. More than 10% of the fungal INSD sequences under study have, furthermore, been shown to be incorrectly identified to species level [38, 39]. Adding to the complexity, 39% of the fungal ITS sequences in INSD are not identified to species level in the first place, and their sheer number often serves to mask the presence of explanatory, fully identified sequences in the BLAST hit list [40–42]. This is, however, a problem to which the present study offers a partial remedy in the form of the option to use only fully identified sequences in the BLAST reference database.
We see the presented pipeline as a way to speed up the processing of large numbers of query sequences and to obtain extended functionality helpful for complementary, in-depth analyses. As with BLAST itself, however, the pipeline should be used on the understanding that definite assignment of a query sequence to species level is not amenable to algorithmic interpretation other than in a trivial sense, and in consideration of the fact that additional analysis steps may be needed, particularly for problematic query sequences.
Availability and requirements
Project name: FungalITSPipeline
Project home page: http://www.emerencia.org/fungalitspipeline.html
Operating systems: UNIX-type (MacOS X, Linux, UNIX/BSD)
Programming language: Perl
Other requirements: NCBI-BLAST, HMMER, optionally at least one of Clustal W, DIALIGN-TX, and MAFFT
Licence: GNU GPL 2
Any restrictions to use by non-academics: None other than those imposed by GNU GPL 2
Notes: The pipeline is available at the above address together with its documentation, the evaluation dataset, and biweekly updated BLAST database files. All these files are also bundled as additional data files with this manuscript. The pipeline is straightforward to adapt for other DNA regions and organism groups; new BLAST databases need to be assembled and the ITS2 mode should be turned off (or HMMs for the new gene need to be constructed).
Hawksworth DL: The magnitude of fungal diversity: the 1.5 million species estimate revisited. Mycol Res. 2001, 105: 1422-1432. 10.1017/S0953756201004725.
Kirk PM, Cannon PF, Minter DW, Stalpers JA: Dictionary of the Fungi. 2008, Wallingford: CABI Publishing, 10
Horton TR, Bruns TD: The molecular revolution in ectomycorrhizal ecology: peeking in the black-box. Mol Ecol. 2001, 10: 1855-1871. 10.1046/j.0962-1083.2001.01333.x.
Hibbett DS: After the gold rush, or before the flood? Evolutionary morphology of mushroom-forming fungi (Agaricomycetes) in the early 21st century. Mycol Res. 2007, 111: 1001-1018. 10.1016/j.mycres.2007.01.012.
Moore D, Novak LA: Essential Fungal Genetics. 2002, New York: Springer-Verlag
Blackwell M, Hibbett DS, Taylor JW, Spatafora JW: Research coordination networks: a phylogeny for kingdom Fungi (Deep Hypha). Mycologia. 2006, 98: 829-837. 10.3852/mycologia.98.6.829.
Taylor DL, McCormick MK: Internal transcribed spacer primers and sequences for improved characterization of basidiomycetous orchid mycorrhizas. New Phytol. 2008, 177: 1020-1033. 10.1111/j.1469-8137.2007.02320.x.
Kõljalg U, Larsson K-H, Abarenkov K, Nilsson RH, Alexander IJ, Eberhardt U, Erland S, Høiland K, Kjøller R, Larsson E, Pennanen T, Sen R, Taylor AFS, Tedersoo L, Vrålstad T, Ursing BM: UNITE: a database providing web-based methods for the molecular identification of ectomycorrhizal fungi. New Phytol. 2005, 166: 1063-1068. 10.1111/j.1469-8137.2005.01376.x.
Seifert KA, Samson RA, deWaard JR, Houbraken J, Levesque CA, Molcalvo J-M, Louis-Seize G, Hebert PDN: Prospects for fungus identification using CO1 DNA barcodes, with Penicillium as a test case. Proc Natl Acad Sci USA. 2007, 104: 3901-3906. 10.1073/pnas.0611691104.
Hibbett DS, Nilsson RH, Snyder M, Fonseca M, Costanzo J, Shonfeld M: Automated phylogenetic taxonomy: an example in the homobasidiomycetes (mushroom-forming fungi). Syst Biol. 2005, 54: 660-668. 10.1080/10635150590947104.
Avis PG, Dickie IA, Mueller GM: A 'dirty' business: testing the limitations of terminal restriction fragment length polymorphism (TRFLP) analysis of soil fungi. Mol Ecol. 2006, 15: 873-882. 10.1111/j.1365-294X.2005.02842.x.
Vilgalys R, Gonzalez D: Organization of ribosomal DNA in the basidiomycete Thanathephorus praticola. Curr Genet. 1990, 18: 277-280. 10.1007/BF00318394.
Bruns TD, Shefferson RP: Evolutionary studies of ectomycorrhizal fungi: recent advances and future directions. Can J Bot. 2004, 82: 1122-1132. 10.1139/b04-021.
Nilsson RH, Kristiansson E, Ryberg M, Hallenberg N, Larsson K-H: Intraspecific ITS variability in the kingdom Fungi as expressed in the international sequence databases and its implications for molecular species identification. Evol Bioinform Online. 2008, 4: 193-201. [http://www.la-press.com/intraspecific-its-variability-in-the-kingdom-fungi-as-expressed-in-the-a823]
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res. 2008, 35: D25-D30.
Leaw SN, Chang HC, Sun HF, Barton R, Bouchara J-P, Chang TC: Identification of medically important yeast species by sequence analysis of the internal transcribed spacer regions. J Clin Microbiol. 2006, 44: 693-699. 10.1128/JCM.44.3.693-699.2006.
Hau J, Muller M, Pagni M: HitKeeper, a generic software package for hit list management. Source Code Biol Med. 2007, 2: 2-10.1186/1751-0473-2-2.
Gollapudi R, Revanna KV, Hemmerich C, Schaack S, Dong Q: BOV – a web-based BLAST output visualization tool. BMC Genomics. 2008, 9: 414-10.1186/1471-2164-9-414.
Nilsson RH, Larsson K-H, Ursing BM: galaxie – CGI scripts for sequence identification through automated phylogenetic analysis. Bioinformatics. 2004, 20: 1447-1452. 10.1093/bioinformatics/bth119.
Nilsson RH, Rajashekar B, Larsson KH, Ursing BM: galaxieEST: addressing EST identity through automated phylogenetic analysis. BMC Bioinformatics. 2004, 5: 87-10.1186/1471-2105-5-87.
Giancarlo R, Siragusa A, Siragusa E, Utro F: A basic analysis toolkit for biological sequences. Algorithms Mol Biol. 2007, 2: 10-10.1186/1748-7188-2-10.
Hanekamp K, Bohnebeck U, Beszteri B, Valentin K: PhyloGena – a user-friendly system for automated phylogenetic annotation of unknown sequences. Bioinformatics. 2007, 23: 793-801. 10.1093/bioinformatics/btm016.
Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14: 755-763. 10.1093/bioinformatics/14.9.755.
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 1997, 25: 4876-4882. 10.1093/nar/25.24.4876.
Subramanian AR, Kaufmann M, Morgenstern B: DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol Biol. 2008, 3: 6-10.1186/1748-7188-3-6.
Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005, 33: 511-518. 10.1093/nar/gki198.
Skorge TD, Eagan TML, Eide GE, Gulsvik A, Bakke S: Indoor exposures and respiratory symptoms in a Norwegian community sample. Thorax. 2005, 60: 937-942. 10.1136/thx.2004.025973.
Flannigan B: Air sampling for fungi in indoor environments. J Aerosol Sci. 1997, 28: 381-392. 10.1016/S0021-8502(96)00441-7.
Wall L, Christiansen T, Orwant J: Programming Perl. 2000, Sebastopol: O'Reilly Media, 3
The Fungal ITS Pipeline. [http://www.emerencia.org/fungalitspipeline.html]
Paulus B, Nilsson RH, Hallenberg N: Phylogenetic studies in Hypochnicium (Basidiomycota), with special emphasis on species from New Zealand. New Zeal J Bot. 2007, 45: 139-150.
Beguin H, Nolard N: Mould biodiversity in homes I. Air and surface analysis of 130 dwellings. Aerobiologia. 1994, 10: 157-166. 10.1007/BF02459231.
Hyvärinen A, Reponen T, Husman T, Ruuskanen J, Nevalainen A: Compostion of fungal flora in mold problem houses determined with four different methods. Proceedings of the 6th International Conference on Indoor Air Quality and Climate (Particles, microbes, radon): 4–8 July 1993; Helsinki. Edited by: Kalliokoski P, Jantunen M, Seppänen O. 1993, University of Technology, 4: 273-278.
Pietarinen VM, Rintala H, Hyvärainen H, Lignell U, Karkkainen P, Nevalainen A: Quantitative PCR analysis of fungi and bacteria in building materials and comparison to culture-based analysis. J Environ Monit. 2008, 10: 655-663. 10.1039/b716138g.
Gravesen S, Nielsen PA, Iversen R, Nielsen KF: Microfungal contamination of damp buildings – examples of risk constructions and risk materials. Environ Health Perspect. 1999, 107: 505-508.
Hyvärinen A, Meklin T, Vepsäläinen A, Nevalainen A: Fungi and actinobacteria in moisture-damaged building materials – concentrations and diversity. Int Biodeterior Biodegrad. 2002, 49: 27-37. 10.1016/S0964-8305(01)00103-2.
Bridge PD, Roberts PJ, Spooner BM, Panchal G: On the unreliability of published DNA sequences. New Phytol. 2003, 160: 43-48. 10.1046/j.1469-8137.2003.00861.x.
Nilsson RH, Ryberg M, Kristiansson E, Abarenkov K, Larsson K-H, Kõljalg U: Taxonomic reliability of DNA sequences in public sequences databases: a fungal perspective. PLoS ONE. 2006, 1: e59-10.1371/journal.pone.0000059.
Nilsson RH, Kristiansson E, Ryberg M, Larsson KH: Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi. BMC Bioinformatics. 2005, 6: 178-10.1186/1471-2105-6-178.
Ryberg M, Nilsson RH, Kristiansson E, Jacobsson S, Larsson E: Mining metadata from unidentified ITS sequences in GenBank: a case study in Inocybe (Basidiomycota). BMC Evol Biol. 2008, 8: 50-10.1186/1471-2148-8-50.
Ryberg M, Kristiansson E, Sjökvist E, Nilsson RH: An outlook on the fungal ITS sequences in GenBank and the introduction of a web-based tool for the exploration of fungal diversity. New Phytol. 2009, 181: 471-477. 10.1111/j.1469-8137.2008.02667.x.
Financial support from FORMAS to NH and GB, and from Kapten Carl Stenholms Donationsfond to RHN and MR is gratefully acknowledged. Figure 1 was designed in collaboration with Bomb Mediaproduktion. RHN acknowledges infrastructure support from the Fungi in Boreal Forests network. Five anonymous referees are acknowledged for valuable input on the manuscript and software. Only freely available software was used in the making of this study.
The authors declare that they have no competing interests.
RHN wrote the chief part of the pipeline, to which MR and EK contributed with subroutines and suggestions. GB and NH undertook the sampling and DNA sequencing of the fungi of the evaluation dataset and were responsible for the study design. All authors contributed to the manuscript and approved its final version.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Nilsson, R.H., Bok, G., Ryberg, M. et al. A software pipeline for processing and identification of fungal ITS sequences. Source Code Biol Med 4, 1 (2009) doi:10.1186/1751-0473-4-1
- Internal Transcribe Spacer
- Additional Data File
- Query Sequence
- Software Pipeline
- Evaluation Dataset