MADAM - An open source meta-analysis toolbox for R and Bioconductor

Background Meta-analysis is a major theme in biomedical research. In the present paper we introduce a package for R and Bioconductor that provides useful tools for performing this type of work. One idea behind the development of MADAM was that many meta-analysis methods, which are available in R, are not able to use the capacities of parallel computing yet. In this first version, we implemented one meta-analysis method in such a parallel manner. Additionally, we provide tools for combining the results from a set of methods in an ensemble approach. Functionality for visualization of results is also provided. Results The presented package enables the carrying out of meta-analysis either by providing functions directly or by wrapping them to existing implementations. Overall, five different meta-analysis methods are now usable through MADAM, along with another three methods for combining the corresponding results. Visualizing the results is eased by three included functions. For developing and testing meta-analysis methods, a mock up data generator is integrated. Conclusions The use of MADAM enables a user to focus on one package, in turn enabling them to work with the same data types across a set of methods. By making use of the snow package, MADAM can be made compatible with an existing parallel computing infrastructure. MADAM is open source and freely available within CRAN http://cran.r-project.org.


Background
R http://www.r-project.org and the associated Bioconductor [1] project have become one of the most widespread tools for bioinformatics in recent years. At approximately the same time, dozens of newly developed meta-analysis methods have been published [2][3][4][5][6]. For most of these methods, an implementation exists, but some still lack publicly available implementions in R. In the present paper, we present the MADAM package, which provides new implementations of two methods [5,6]. In addition, we also implemented wrappers to the existing meta-analysis methods [2,4].
Since new methods and algorithms need to be tested and validated as well, we integrated a validation model for generating simulated data. For integrating results of different meta-analysis methods using an ensemble approach, three of these methods are also implemented in MADAM. For visualizing the meta-analysis results, we implemented three further methods.
The Meta-Analysis Data Aggregation Methods (MADAM) toolbox is helpful to both developers of meta-analysis methods as well as to biologists, physicians, and bioinformaticians who conduct research across studies. Here, we present version 1.0 of the package, in turn describing a sample work flow and illustrating its use via some examples. MADAM is available using CRAN, and can be installed using R itself: > install.packages("MADAM") After the installation it can be loaded into the current workspace: > library(MADAM)

Creation of random validation data
The implementation and validation of meta-analysis methods are eased when referring to data with known content. If the rate of true and false findings can be assessed, parameterizing and fine tuning can be facilitated more effectively. The MADAM package facilitates the creation of mock up data in a simple way. First, we create a sample set of studies that will be re-used further on in the present paper. We assume to have three studies with two experimental factors ("group 1", "group 2"), and a number of 10, 5, and 27 samples in group 1, respectively 17, 6, and 33 in the second group. Furthermore, we will only use 100 features to be simulated, for performance and demonstration reasons. We additionally need a data structure to store the class labels within, which will be referred to as cl further on. A vector cl.val contains the labels that are being used.
To make the example reproducible we set a seed first. > set.seed(1234) > gg <-100 > ii <-3 > kk <-c(10, 17, 5, 6, 27, 33) > cl.val <-c(1, 2) > A <-generateRandomMAData(g = gg, i = ii, k = kk) > cl <-lapply(A, function(a) { + factor(as.numeric(a$group)) The list A now contains a list of ExpressionSet objects that can be re-used for further analysis in the MADAM package. We now can also see that five features are up regulated in the second group (U) and another five are down regulated (D).

Checking the quality of the study data
When it comes to biomedical data, one is often confronted with data thate are of imperfect quality. We included a set of methods for dealing with this problem in the present package. In the current implementation, two methods are provided in order to deal with the problems of data with no variance and missing values for a feature of interest.
Sometimes variance of a feature is zero. Such non existing variation is biological highly unlikely and might indicate problematic data. Since in standard single data set analysis features with a low variance are likely to be filtered they normally do not cause any problems. In the case of meta-analysis, where typically no such filtering is applied, these seldom events may cause problems for further analysis. Performing a t-test would result in non numeric results, since the standard error would be zero as well. We, therefore, added a method corrVar that enables a user to correct for variances of zero. For this correction, we add a vector θ to the feature value a, where θ~N(μ, s), for each group separately. The default for μ is zero, and s is the mean of all of the features' standard deviations within the corresponding group.
To simulate this behavior, we manually set some of the entries in A to have a variance of zero: > Another problem that often occurs in the data of biomedical measurements are missing values. The sources of missing values are widespread, potentially originating from measurements being below the limit of detection or due to other technical issues. Simply not having measured a single feature by an assay is another potential source for missing values. In any case, we distinguish here between all the measurements missing for a certain feature (indicating a problem of a systematic nature) or just a sparse distribution of missing values through out the whole frame of measurements. While we do not yet provide solutions for the first type of error in the MADAM package, we do provide functions for dealing with sparsely distributed missing values.
The function corrNA provides two different imputation approaches for missing values. The first method substitutes missing values up to a ratio of r within one group. For this substitution, we replace all of the missing values again by θ, but this time μ being the feature's mean value within the group, and s being the standard deviation within the group. The other option for detecting features with replaceable missing values, is to allow all but a number of j features to be missing within a group, which of course can be a rather soft criterion. For both cases, we remove the feature from further analysis if the criterion is not met. For demonstration purposes, we replace some of the values within the first ExpressionSet with missing values (detonated as NA in R).
> As additional method for imputing values, a knn based imputation method was enabled in MADAM, as implemented in the impute package [7]. This methods imputes missing values based on information from neighbouring features. This method is selected when calling the corrNA function with the method="knn" parameter: > A.knn <-corrNA(A, method="knn", exclude = TRUE) starting NA correction... For more details on this imputation method, see the corresponding help file in R.

Significance based meta-analysis methods
The MADAM package contains two meta-analysis methods that are based on exploiting information about a feature's significance [5,6]. Both of these methods make use of the formula that was given by * ( ) , with n being the number of studies [8].
We implemented two methods for assessing the significance for S. The first method permutates the p-values as reported by the single studies and calculates a newŜ with these random combinations [6]. The ratio ofŜ larger than S divided by the number of permutations than gives the p-value. Alternatively, it is possible to derive the p-value directly from S as it follows a c 2 distribution with 2n degrees of freedom [5]. Since the first method is computationally demanding, we enabled it to be used on a cluster environment by using the snow package by Luke Tierney, which enables the carrying out of this ofthis meta-analysis in a time saving way http://www.cs. uiowa.edu/~luke/R/cluster/cluster.html. For both methods we correct for multiple testing using the FDR [9].
For deriving the p-values, we implemented a simple method for performing t-tests on a list of Expression-Sets. One can choose between the standard null hypothesis "greater", "less", or "two.sided". The resulting pvalues can be stored in a matrix for later use. For replacing the p-values of zero one can give a value of choice, otherwise 1 * 10 6 is used.
For the presented example, we assume a cluster with 10 available nodes and running an MPI based system.
For saving time, we reduced the number of permutations.

Performing an ensemble approach
It is possible to combine the results from a set of metaanalysis methods in an ensemble approach. In the MADAM package, we implemented three methods for performing this task. One of the methods uses rank products: We multiply all of the ranks and divide them by the number of genes in the meta-analysis. We then shuffle the ranks again randomly and count the number of random rank products that are smaller than the expected one. Dividing this ratio by the number of permutations gives the corresponding p-value [4]. Again, this p-value was then corrected for multiple testing by using the qvalue package [9]. In the following example, we re-used the ranks reported for under expressed features in the second group as reported from the Effect Size method and one of the significance based methods. The permutation parameters are again set very low, but this is only carried out for demonstration of the methods. We re-use the cluster object that was created before, as the ensemble methods are implemented to work on such an infrastructure as well. A second ensemble method, using information on significances in order to integrate the results from the three methods, is demonstrated hereinafter.

Visualization of the results
We integrated three methods for visualizing the results of the meta-analyses. The first one clusters the ranks of For visualizing the outcome of a meta-analysis, one often wants to plot the reported significances against their ranks. The function plotFDR enables this to be carried out in a simple manner, combining information from various methods. Coloring and selection of the line types is carried out automatically according to whether a reported result comes from a single study analysis, a meta-analysis, or an ensemble approach.

Figure 2
Volcano plot for meta-analysis. Similar to the volcano plots that are used in classical microarray analysis, the function plotMAVolcano enables the plotting of a volcano plot for metaanalysis. This is performed by plotting an effect size against a significance. Since two separate null hypotheses might be investigated in the meta-analysis, the form of the plot differs from the classical volocano-like shape. A user might also provide a list of interesting features to be highlighted automatically. the different methods hierarchically by using euclidean distances. The second one plots significances versus the rank of the specific significance. One can provide the type of the method, by which the significances were derived, thereby changing the coloring and line types automatically as can been seen in Figure 1: > signi <-cbind(Effect Size' = res.es.g2down$FDR, Rhodes = res.rhodes.g2down$q.value, + 'Rank Product' = res.rp.g2down$q.value, 'Ensemble RP' = res.ens.rp$q.value, + Ensemble FM' = res.ens.fm$q.value) > plotFDR(signi, title="Significance VS rank (down regulation)", + types = c("m", "m", "m", "e", "e")) plotting FDR... Another plotting function implemented in the MADAM package enables the plotting of a degenerated volcano plot (Figure 2). Degeneration in this context refers to the fact that the testing of two separate null hypotheses results in plots that lose their typical volcano like shape. Nevertheless, since in this plot the effect size is plotted against a significance, we still refer to it as a volcano plot: > volc <-data.frame(res.es.g2down$MUvals, res.rhodes. g2down$p.value, + row.names = rownames(res.es.g2down)) > volc.points <-volc [grep("_D", rownames(volc)), ] > plotMAVolcano(volc, title="Volcano plot (down regulation)", + points = volc.points)

Conclusions
The MADAM package helps those researchers in biology and medicine who utilize meta-analysis methods in their research. This package enables one to use the rich arsenal of the available methods in R, by simultaneously working with one data type only. The possibility to use parallel computing infrastructure facilitates the efficient use of meta-analysis even more. To make use of the supercomupting abilites within MADAM, the user has to establish an infrastructure compatible to the technologies enabled in the snow package. The sample infrastructure used in this paper was based on 4 × quadcore CPUs, running on Ubuntu 9.04 with Open MPI 1.3.2.

Outlook
We are currently planning the implementation of some other existing meta-analysis methods in a parallel manner so that they can be used in various cluster environments. Furthermore, the methods for dealing with missing values will be enhanced.