- Software review
- Open Access
MALINA: a web service for visual analytics of human gut microbiota whole-genome metagenomic reads
Source Code for Biology and Medicinevolume 7, Article number: 13 (2012)
Whole-genome sequencing of environmental samples is producing data at an increasing pace. With the advent of high-throughput next-generation sequencing (NGS) technologies, a deeper insight into phylogenetic and functional composition of metagenomes has become feasible. The research community has a need for robust data analysis tools that allow efficient description of composition, classification and clustering coupled with comprehensive visualization of results, while providing means for comparative analysis within the context of all accumulated metagenomic data for same type of environment. There is a number of existing web services (including CAMERA, IMG/M, MG-RAST, METAGENassist and others) and stand-alone applications (including QIIME and SmashCommunity) that integrate data visualization and statistical analysis functionalities with databases of publicly available metagenomic data, allowing the user to compare his/her own samples with those of other researches. However, the number of pre-loaded human gut metagenomic samples in the repertoire of these tools is limited.
The human gut microbiome is one of the most extensively studied subjects in metagenomic research. It is of particular interest to scientists because of its significant role in host health status. Representative reference genomes for many taxa have been sequenced, and a catalogue of prevalent gut microbial genes has already been established. MALINA exploits this accumulated knowledge in the form of reference sequence sets to provide a means for analyzing human gut whole-genome reads within the context of world public metagenomic datasets. The inclusion of a vast set of existing human gut metagenomic datasets allows the user to check which datasets are most similar to his/her own data, and, if present, to examine the metadata of those and pose hypotheses based on similarities. Features of MALINA and existing software allowing human gut whole-genome metagenomic reads analysis are compared in Table 1.
The MALINA workflow is shown in Figure1. As input, MALINA accepts short nucleotide reads of length starting with 35 bp. Color-space (SOLiD), as well as long (such as Sanger, 454) reads are supported. To our knowledge, MALINA is the first metagenomic analysis web-service supporting SOLiD color-space reads. It is beneficial, considering the increasing volume of metagenomic data sequenced using this technology. Files with reads are uploaded by FTP. Through the web interface, the user creates groups of samples, with each sample including one or more read sets. The files for a given sample are associated with appropriate read sets and prepared for analysis.
The MALINA analysis pipeline characterizes metagenomic composition in two ways: phylogenetic and functional - by assessing relative abundance of microbial genera and genes, correspondingly. In case of genes, total metabolic potential of all microbes is described. The quantitative profiling is based on alignment of reads to reference genomes and gene catalogue. The genome catalogue contains more than 440 genomes of human gastrointestinal bacteria obtained from HMP, NCBI and relevant studies of human gut microbiome. The gene catalogue of prevalent human gut microbial genes discovered by MetaHIT project consists of 3.3 million genes. After the reads are aligned to reference set, the resulting position-wise coverage of each sequence is normalized by its length and total number of reads in read set. Summed over genera (for genomes) or functional groups (clusters of orthologous groups, COGs) for genes, it yields relative abundance of phylogenetic and functional units. For functional profiling, COG annotation from MetaHIT gene catalogue is used. Each metagenomic read set is thus described by two feature vectors.
Feature vectors of the read sets selected by the user are subject to statistical analysis and visualization: boxplots of the most abundant genera/COGs, principal components analysis (PCA), clustering (partitioning around medoids [PAM] and hierarchical clustering), multidimensional scaling (MDS) and between-class analysis (BCA) for the results of clustering. PCA plot shows 2D projection of feature vectors along the directions of maximum variance in the data, with arrows showing genera “drivers” that contribute most strongly to variation between samples. For COG groups, PCA plots are constructed separately for several functional classes: antibiotic resistance (COGs were collected from ARDB database), transcription factors (COGs selected from total COG list according to description, i.e. “transcription regulator/factor/repressor”) and vitamin metabolism (COGs from KEGG vitamin synthesis pathways). PAM clustering calculates the optimal number of clusters and assigns the samples to the clusters. BCA is a special case of PCA with respect to an instrumental variable (that is represented by cluster number here) producing plot that highlights differences between the clusters. The second implemented clustering algorithm, hierarchical clustering, produces dendrogram heatmap of abundance. Sample visualizations produced by MALINA are shown in Figure2. Moreover, statistical analysis includes detection of genera and gene categories discriminatory among the clusters by the Mann–Whitney test and Random Forests algorithm as well as taxa co-occurrence analysis based on the abundance values correlation.
The plots can be downloaded as PDF files, and the relative abundance, clustering results and other output tables can be downloaded as tabulated text files. All features of MALINA are available without registration, via guest account. The user can register a dedicated account for free to provide privacy of uploaded data and results, as well as “analysis complete” notifications by e-mail.
An important functionality of MALINA is that besides the user’s own data, it is possible to co-analyze it with cohorts from large existing human gut metagenomic studies: 85 Illumina samples and 37 Sanger samples from MetaHIT study, as well as 139 Illumina samples from HMPDACC and 96 SOLiD samples from a new, previously unpublished Russian metagenomic study. Thus clustering functionality is of particular interest to researchers exploring human gut microbiota in relation to the concept of enterotypes across a large number of samples. A stand-alone analysis of pre-loaded datasets is also available.
The user interface is implemented using Ext JS framework. Read alignment is performed using Bowtie. In the interest of performance, MALINA does not filter reads using raw quality score, as the experience showed that filtration does not significantly increase the fraction of mapped reads. However, such preprocessing can be performed by the user manually. Coverage statistics are calculated using BEDtools. Statistical analysis is implemented in R using ade4, cluster, ecodist, fpc and randomForest packages. The pipeline steps are integrated using Oracle database, Microsoft .NET framework and Python.
MALINA allows an easy and intuitive way to infer metagenomic composition from reads and to analyze similarity of samples and organization into clusters within the global context of human gut metagenomic datasets. The features include statistical analysis methods like clustering and group comparison, as well as illustrative visualizations of phylogenetic and functional composition. The support for color-space SOLiD reads is a unique feature that makes MALINA a particularly valuable service to the growing community of researchers using SOLiD technology for metagenomic analysis.
The reference gene catalogue used in MALINA will be updated regularly as new version of MetaHIT data becomes available. In the future, it is planned that additional detailed metadata will be associated with the samples, allowing the user to check if newly sequenced samples are similar to certain groups distinguished by medical, ethno-geographic or dietary factors. The further development of the web service will include updates of the human gut samples database from Russian population as well as from other new studies. Support for diverse types of environment profiles besides human gut and additional methods for statistical analysis and visualization will be added.
Availability and requirements
Project name: MALINA
Operating system: platform independent web site
Other requirements: None
Any restrictions to use by non-academics: None
This work was supported by State Contracts 16.512.11.2111, 16.552.11.7034 and RFBR Grant 12-07-90008.
Human Microbiome Project Data Acquisition and Coordination Center
Principal Components Analysis.
Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M: CAMERA: a community resource for metagenomics. PLoS Biol. 2007, 5: e75-10.1371/journal.pbio.0050075.
Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Chu K, Dalevi D, Chen IA, Grechkin Y, Dubchak I, Anderson I, Lykidis A, Mavromatis K, Hugenholtz P, Kyrpides NC: IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Res. 2008, 36: D534-D538.
Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA: The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinforma. 2008, 9: 386-10.1186/1471-2105-9-386.
Arndt D, Xia J, Liu Y, Zhou Y, Guo AC, Cruz JA, Sinelnikov I, Budwill K, Nesbo CL, Wishart DS: METAGENassist: a comprehensive web server for comparative metagenomics. Nucleic Acids Res. 2012, 40: 88-95. 10.1093/nar/gkr734.
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, Mcdonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R: QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010, 7: 335-336. 10.1038/nmeth.f.303.
Arumugam M, Harrington ED, Foerstner KU, Raes J, Bork P: SmashCommunity: a metagenomic annotation and analysis tool. Bioinformatics. 2010, 26: 2977-2978. 10.1093/bioinformatics/btq536.
Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T, Mende DR, Li J, Xu J, Li S, Li D, Cao J, Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P, Bertalan M, Batto JM, Hansen T, Le Paslier D, Linneberg A, Nielsen HB, Pelletier E, Renault P: A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010, 464: 59-65. 10.1038/nature08821.
Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science. 1997, 278: 631-637. 10.1126/science.278.5338.631.
Kaufman L, Rousseeuw PJ: Finding groups in data: An introduction to Cluster Analysis. 1990, New York: Wiley
Torgerson WS: Theory & methods of scaling. 1958, New York: Wiley
Dray S, Dufour AB: The ade4 package: implementing the duality diagram for ecologists. J Stat Softw. 2007, 22 (4): 1-20.
KEGG: Kyoto Encyclopedia of Genes and Genomes.http://www.genome.jp/kegg.
Frank W: Individual comparisons by ranking methods. Biom Bull. 1945, 1 (6): 80-83. 10.2307/3001968.
Breiman L: Random forests. Mach Learn. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.
Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, Mende DR, Fernandes GR, Tap J, Bruls T, Batto JM, Bertalan M, Borruel N, Casellas F, Fernandez L, Gautier L, Hansen T, Hattori M, Hayashi T, Kleerebezem M, Kurokawa K, Leclerc M, Levenez F, Manichanh C, Nielsen HB, Nielsen T, Pons N, Poulain J, Qin J, Sicheritz-Ponten T, Tims S: Enterotypes of the human gut microbiome. Nature. 2011, 473: 174-180. 10.1038/nature09944.
Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.
Quinlan A, Hall I: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26: 841-842. 10.1093/bioinformatics/btq033.
R Development Core Team: R: A Language and Environment for Statistical Computing. 2010, Vienna, Austria: R Foundation for Statistical Computing
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K: Cluster: cluster analysis basics and extensions.http://cran.r-project.org/web/packages/cluster/index.html.
Goslee SC, Urban DL: The ecodist package for dissimilarity-based analysis of ecological data. J Stat Softw. 2007, 22 (7): 1-19.
Hennig C: fpc: Flexible procedures for clustering.http://cran.r-project.org/web/packages/fpc/index.html.
Liaw A, Wiener M: Classification and regression by randomForest. R News. 2002, 2 (3): 18-22.
We thank Prof. Vadim Govorun who participated in study design and coordination. Russian metagenomic samples were sequenced by Genomic Center of Research Institute of Physico-Chemical Medicine.
The authors declare that they have no competing interests.
ESK, OVS, AKL and IYK extracted metagenomic DNA, prepared the libraries for sequencing and performed sequencing on SOLiD 4, yielding readsets used as part of web-service. DGA, AVT and MSB designed the pipeline, performed coding and processed sequenced data. IAA designed the database and developed Web-interface. ASP maintained the database, developed statistical analysis and visualization module. AVT, AVP and ASP wrote the manuscript text. All authors read and approved the final manuscript.