 Software review
 Open Access
 Published:
Permutation – based statistical tests for multiple hypotheses
Source Code for Biology and Medicinevolume 3, Article number: 15 (2008)
Abstract
Background
Genomics and proteomics analyses regularly involve the simultaneous test of hundreds of hypotheses, either on numerical or categorical data. To correct for the occurrence of false positives, validation tests based on multiple testing correction, such as Bonferroni and Benjamini and Hochberg, and resampling, such as permutation tests, are frequently used. Despite the known power of permutationbased tests, most available tools offer such tests for either ttest or ANOVA only. Less attention has been given to tests for categorical data, such as the Chisquare. This project takes a first step by developing an opensource software tool, Ptest, that addresses the need to offer public software tools incorporating these and other statistical tests with options for correcting for multiple hypotheses.
Results
This study developed a publicdomain, userfriendly software whose purpose was twofold: first, to estimate test statistics for categorical and numerical data; and second, to validate the significance of the test statistics via Bonferroni, Benjamini and Hochberg, and a permutation test of numerical and categorical data. The tool allows the calculation of Chisquare test for categorical data, and ANOVA test, Bartlett's test and ttest for paired and unpaired data. Once a test statistic is calculated, Bonferroni, Benjamini and Hochberg, and a permutation tests are implemented, independently, to control for Type I errors. An evaluation of the software using different public data sets is reported, which illustrates the power of permutation tests for multiple hypotheses assessment and for controlling the rate of Type I errors.
Conclusion
The analytical options offered by the software can be applied to support a significant spectrum of hypothesis testing tasks in functional genomics, using both numerical and categorical data.
Background
Current statistical inference problems in areas such as genomics and proteomics regularly involve the simultaneous test of hundreds of null hypotheses. This strategy has allowed scientists to unveil important cues on the mechanisms involved in the development of deadly diseases. For example, Barth et al. (2006) [1] analysed gene expression patterns related to dilated cardiomyopathy (DCM) and identified specific gene regulatory relationships relevant to this disease condition. By means of Significant Analysis of Microarray (SAM) and Nearest Shrunken Centroid (NSC), 27 genes, whose expression profiles were sufficient to differentiate between DCMs and nonfailing hearts samples, were identified. Mathur et al. (2005) [2] analysed antibody arrays and identified potential candidates for ischemic preconditioningassociated vascular growth pathways. Potential candidates were identified by applying a cutoff threshold value that filtered out nonsignificant probes. When dealing with these and related types of data, many hypotheses are tested and each test has a specified Type I (i.e. false positive) error probability, which is associated with the chance of committing Type I errors [3]. Therefore, it is important to define an appropriate Type I error threshold, as well as selecting an effective multiple testing procedure to control this error rate and account for the joint distribution of the test statistics.
To correct for the occurrence of false positives, validation tests based on multiple testing corrections and resampling techniques (i.e. permutationbased test) are frequently used. Although both strategies aim to control Type I error, these techniques implement different approaches to estimating errors and rejecting null hypotheses. Traditional multipletesting corrections, such as Bonferroni and variations, adjust Pvalues derived from multiple statistical tests to correct for the occurrence of false positives [4]. The Benjamini and Hochberg (B&H) ranks Pvalues in an ascending order, multiplies them by the number of features, and divides them by their corresponding rank [5]. The permutation test resamples N times the total number of observations, in a population sample, to build an empirical estimate of the null distribution from which the test statistic has been drawn [6]. In the end, the application of these methods leads to either the rejection or acceptance of the null hypothesis. The Bonferroni correction is known to be extremely conservative. It can lead to Type II (i.e. false negative) errors of unacceptable levels, which may contribute to publication bias and the exclusion of potentially relevant hypotheses (e.g. significant differential expression between patient groups or genotypephenotype associations) [7]. In contrast, B&H is less stringent, which may lead to the selection of more false positives [5]. Unlike Bonferroni and B&H, permutation tests do not use individual association scores based on familywise corrections [8]. Instead, permutationbased tests estimate statistical significance directly from the data being analysed. More importantly, irregularities of the observed data are maintained in the permuted data sets and are included in the estimation of the permutation probability [9].
To date, permutation tests have become widely accepted and recommended in studies that involved multiple statistical testing [3, 6, 7]. Despite its power, current available tools, such TIGR MeV [10], offer permutation tests to estimate Pvalues for either ttest or ANOVA only. Another example is GeneSpring [11] that offers a permutation test for multiple testing for either ttest or ANOVA test statistics only. These and other tools published do not offer multipletesting solutions for categorical data, such as the Chisquare. This test is appropriate for the analysis of SNPs (single nucleotide polymorphisms) data to identify significant patterns of genetic variability, i.e. variationphenotype associations. Another important statistical significance assessment technique not available in wellknown opensource tools is the Bartlett test, which may be used for testing equality of variances or the significance of data dispersion differences across groups. Moreover, the Bartlett test should also be used before attempting the calculation of either ANOVA or ttest, as they assume that variances are equal across groups or samples.
Given the evident need to offer software tools incorporating such statistical tests with options for correcting for multiple tests, this study takes a first step by developing a publicdomain, userfriendly software with the following functionality. The tool allows the calculation of Chisquare test for categorical data, ANOVA test, Bartlett's test and ttest for paired and unpaired data. Once a test statistic is calculated, Bonferroni, B&H and a permutation tests are implemented, independently, to control for Type I errors. Pvalues from the permutation test were estimated as follow, using the data encoding format shown in Figure 1:
First, test statistic and corresponding Pvalue are calculated on the original data set. Data are permuted at random B times and test statistics are calculated on each permuted data set. Third, permuted distribution is calculated by: counting the times (K) the statistic value obtained in the original data set was smaller than the statistic value obtained from the permuted data sets, and dividing that value by the number of random permutations i.e. K/B. Results are stored in a text file for subsequent analyses. Table 1 offers guidelines for the selection of the most appropriate statistical test under this system.
Implementation
The software is a Javabased, commandline tool [see Additional files 1 and 2]. Input data are presented in a plain text file, where rows represent samples and columns represent features (Figure 1). The maximum number of groups to be compared is two, with two exceptions: the Chisquare test, for categorical data, and the ANOVA test for numerical data, which permit the comparison of more than two more groups. These requirements have been defined because they cover most of the typical multipletesting applications in gene expression and SNPs data analysis. New functionality (e.g. Windows interface or other relevant tests) could be added based on future requirements and additional external user feedback.
Statistical tests
The tool has at the user's disposal the following statistic tests: Student's test for numerical data, two classes; Bartlett's test for numerical data, two classes; ANOVA test for numerical data, more than three classes; and Chisquare for categorical data, two or more classes. For detailed information about each test, please refer to National Institute of Standards and Technology/Semiconductor Manufacturing Technology eHandbook of Statistical Methods [12].
Multiple hypotheses testing procedures
Given (N) number of samples, (C) number of classes, (F) number of features, (S) significance level, (B) number of permutations, and (T(obs)) test statistic, validation methods are as described bellow:
Multiple testing correction: Pvalues, according to test statistic and degrees of freedom (N2), were obtained and adjusted under Bonferroni and B&H multiple testing corrections B[5]. Permutation test: test statistic is estimated from original data set T(obs); sample's labels are shuffled B times and T(obs)s' are obtained; if T(obs) <T(obs)' a counter T(per) is increased by 1. The probability that T(obs) occurred by chance alone is: T(per)/B.
Software usage
Typical usage involves a user providing the following information: file name containing the data to be analysed, the name of new file where results are to be stored, the selection of test statistic to be calculated, the significance level at which the null hypothesis is to be rejected, and the number permutated data sets to be created for the estimation of the nullhypothesis distribution (Figure 2) [see Additional file 3]. Depending on the test statistic to be calculated, the user may need to provide additional information in a few steps. For example, if the ttest is selected, the user should indicate whether samples (i.e. groups being compared) are independent or not (paired). The user is also allowed to specify which type of distribution should be used: one or twotailed distribution. Once the required information is provided, the tool performs the analysis and displays those features whose raw Pvalues are below the significance level, their corrected Pvalue after Bonferroni correction, their corrected Pvalue after B&H correction, and their corrected Pvalues after performing the permutation test.
Figure 3 is a pseudocode representation of the multiple testing correction procedure implemented.
Results
To illustrate some of the advantages of using the permutationbased test for multiple hypotheses validation, this section summarises examples of analyses using publicly available data. This includes a comparison with results obtained when Bonferroni correction was applied (Table 2).
Testing data sets
Three data sets were used in the analysis:

1)
A microarray data set generated by a study in dilated cardiomyopathy was obtained from the GEO (Gene Expression Omnibus) [13], accession number GDS2205 (for numerical data analysis) and composed of 12 samples: 5 from nonfailing hearts and 7 from DCM patients.

2)
A genotype data set (for categorical data analysis) was obtained from the Single Nucleotide Polymorphism database (SNPdb) [14]. This data set was composed of 34 samples, 10 from AfricanAmerican people, 12 from EuropeanAmerican people and 11 from HanChinese people.

3)
A microarray data set, oligo array, generated by a study in heart failure was obtained from the GEO, accession number GDS1362, was composed of 37 samples: 7, 20 and 10 samples were obtained from nonfailing hearts, DCM heart, and Ischemic cardiomyopathy (ICM) patients respectively.
Data preprocessing
Microarray data: probe sets with absent calls in more than 50% of their transcripts were discarded. Transcripts of probe sets corresponding to similar gene symbols were averaged. Data were normalised per chip and then per gene. Values were transformed using the mean and standard deviation of the row (per gene) or column (per chip). Genotype data did not require preprocessing.
Statistical analyses
The first analysis calculated Bartlett's test statistic to determine whether the variances of two experimental groups, from a microarray data set [see Additional file 4], were equal. The null hypothesis of this analysis was that there was no significant difference between the variances of the two groups, and the significance level to reject the null hypothesis was set to 0.05. Data set was composed of 12 samples: 5 and 7 samples were obtained from nonfailing hearts and DCM patients respectively. Out of 8068 genes, 526 genes were found to be statistically significant (P < 0.05, before correction for multipletesting), one gene was under the significance level after correcting with Bonferroni, and one gene was under the significance level after correcting with B&H. However, after performing the permutation test, 327 genes were found significantly differentially expressed (P < 0.05). That is, the two group samples being compared exhibit equal variances, which is commonly expected in typical microarray data analyses.
The second analysis implemented the ttest (type: two sample equal variances; number of distribution tails: twotailed): equal variances and two tailed) to estimate the potential statistical significant difference between the means of two (normally distributed) experimental groups, from the same microarray data set analysed above [see Additional file 5]. The null hypothesis of this analysis was that there was no difference between the means of the two groups, and the significance level to reject the null hypothesis was set to 0.05. In this case, the raw Pvalues of 1413 genes were under the significance level (P < 0.05), 39 genes were under the significance level after correcting with B&H, and only two genes were under the significance level after correcting with Bonferroni. In this case, results were consistent with our expectations: B&H identified more genes than Bonferroni did, which shows that the former tends to be less stringent. After performing the permutation test, 1398 genes were found significantly differentially expressed (P < 0.05). In addition, we noted that the raw Pvalues of some of the genes filtered out by Bonferroni were well below the significance level, i.e. they were potentially significant under a less conservative correction approach. For example, raw Pvalues of ACVR1 and CFHR1 were 0.0004 and 0.004, respectively, and their Pvalues after Bonferroni correction were above 0.9. However, based on the permutationbased test, these two genes fall below the significance threshold (corrected P values: 0.0001 and 0.001 for ACVR1 and CFHR1, respectively). This, as expected, shows the statistical power of permutationbased procedures for multiple testing.
The third analysis implemented the Chisquare test on categorical data derived from a genetic variation data set (SNPs) [see Additional file 6]. The problem was to determine statistically significant genetic variations among the SNPs of three ethnic groups: AfricanAmerican, EuropeanAmerican and Chinese. The data encode genotype values for each SNP under each group [15]. This data set was composed of 34 samples: 10 from AfricanAmericans, 12 from EuropeanAmericans and 11 from HanChinese people. The null hypothesis of this analysis was that there were no genetic differential variations among the three groups, and the significance level to reject the null hypothesis was set to 0.05. In this case the raw Pvalues of 153 SNPs, out of 334, were under the significance level (P < 0.05). Bonferroni correction identified only eight SNPs, whose Pvalues were below significance level, and B&H correction identified 131 SNPs, whose Pvalues were below significance level. In contrast, the permutation test identified more features than B&H: 153 SNPs with significant Pvalues. These results are consistent with the results reported by Carlson, et al. (2003) [16], which found that only 48% of the SNPs were shared by AfricanAmericans and EuropeanAmericans. In our study, the permutationbased adjustment found that 55% of SNPs showed no significant differences among the three populations been analysed. These results again confirm the statistical power of permutationbased procedures for multiple testing.
A fourth analysis implemented the ANOVA test to estimate the potential statistical significant difference between the means of three (normally distributed) experimental groups. Samples in this data set were obtained from heart tissue of healthy donors, as well as from donors suffering from either dilated or ischemic cardiomyopathy [see Additional file 7]. We used the ANOVA test to look for possible outstanding differences among the three populations evaluated, because ttest is designed to perform pairwise comparisons, only. The null hypothesis of this ANOVA analysis was that there were no differences between the means of the three groups, and the significance level to reject the null hypothesis was set to 0.05. In this case, the raw Pvalues of 6371 genes were under the significance level (P < 0.05), 3331 genes were under the significance level after correcting with B&H, and only nine genes were under the significance level after correcting with Bonferroni. After performing the permutation test, 6262 genes were found significantly differentially expressed (P < 0.05). The genes reported as significantly differentially after correcting via Bonferroni were not included in the set of potentially significant genes detected by the permutation test. In addition, we compared our results against those previously reported by Kittleson, et al. (2005) [17] and found that most genes reported by them as significantly differentially expressed were also below significant level when our permutation test was performed, or when Pvalues were corrected via the B&H method. In contrast, only one of the genes reported by Kittleson's was also below significant level after we corrected with Bonferroni. Perhaps this analysis showed the real strength that the permutation test has to identify potential biomarkers of disease.
Conclusion
The techniques for multiple testing offered here through a platformindependent tool are relevant to a variety of data analysis tasks in biology and medicine. The results also allowed us to illustrate the power of a permutation test for multiple hypotheses assessment procedures and for controlling the rate of Type I errors. We also demonstrated that even when Pvalues were corrected via B&H, which is considered a less stringent method as opposed to Bonferroni, a number of potentially significant features were dismissed. The software is easy to use and it offers the basis for future extensions. Another key contribution is the implementation of multiple hypotheses statistical testing techniques for both numerical and categorical data. The analytical options offered can be applied to support a significant spectrum of hypothesis testing tasks in functional genomics, e.g. fast detection of significantly differentially expressed genes and genotypes. Moreover, to the best of our knowledge, this is the first opensource software tool freely available for supporting less traditional genomic applications, such as the detection of betweengroup differences on the basis of SNPs. In this area multipletesting procedures have traditionally relied on very stringent adjustment approaches (e.g. Bonferroni).
Despite its simplicity, in terms of usability, this tool in comparison with others, such as GeneSpring and TIGR MeV, offers the following advantages: Freelyavailable, as TIGR MeV does, no computational installation cost, easy to use, computationally inexpensive. Moreover it allows the calculation of traditional statistical tests and multiple testing with categorical data, as well as test and distributionindependent permutationbased tests.
We expect to continue expanding the tool with alternative statistical significance measures, such as Fisher's exact test, Z or Wald scores. We will welcome additional user's feedback after the publication of this article.
Availability and requirements
Project name: Permutationbased statistical tests for multiple hypotheses
Project home page: http://rosalind.infj.ulst.ac.uk/CWB/Ptest.html
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 1.5.1 or higher
License: None
Any restrictions to use by nonacademics: None
Abbreviations
 ANOVA:

Analysis of variance
 DCM:

Dilated CardioMyopathy
 SAM:

Significant Analysis of Microarray
 NSC:

Nearest Shrunken Centroid
 SNP:

Single nucleotide polymorphisms
References
 1.
Barth AS, Kuner R, Buness A, Ruschhaupt M, Merk S, Zwermann L, Kääb S, Kreuzer E, Steinbeck G, Mansmann U, Poustka A, Nabauer M, Sültmann H: Identification of a common gene expression signature in dilated cardiomyopathy across independent microarray studies. J Am Coll Cardiol. 2006, 48: 16107. 10.1016/j.jacc.2006.07.026.
 2.
Mathur P, Kaga S, Zhan L, Das DK, Maulik N: Potential candidates for ischemic preconditioningassociated vascular growth pathways revealed by antibody array. Am J Physiol Heart Circ Physiol. 2005, 288 (6): H300610. 10.1152/ajpheart.01203.2004.
 3.
Dudoit S, Shaffer JP, Boldrick JC: Multiple hypotheses testing in microarray experiments. Statistical Science. 2003, 18 (1): 71103. 10.1214/ss/1056397487.
 4.
Feilotter H: A Biologist's guide to analysis of DNA microarray data. Am J Hum Genet. 2002, 71 (6): 14831484. 10.1086/344458.
 5.
Multiple Testing Corrections. [http://www.chem.agilent.com/cag/bsp/sig/downloads/pdf/mtc.pdf]
 6.
Belmonte M, YurgelunTodd D: Permutation testing made practical for functional magnetic resonance image analysis. IEEE Trans Med Imaging. 2001, 20 (3): 2438. 10.1109/42.918475.
 7.
Nakagawa S: A farewell to Bonferroni: the problems of low statistical power and publication bias. Behav Ecol. 2004, 15: 10441045. 10.1093/beheco/arh107.
 8.
Kimmel G, Jordan MI, Halperin E, Shamir R, Karp RM: A randomization test for controlling population stratification in wholegenome association studies. Am J Hum Genet. 2007, 81 (5): 895905. 10.1086/521372.
 9.
Cheverud JM: A simple correction for multiple comparisons in interval mapping genome scans. Heredity. 2001, 87: 5258. 10.1046/j.13652540.2001.00901.x.
 10.
Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, Sturn A, Snuffin M, Rezantsev A, Popov D, Ryltsov A, Kostukovich E, Borisovsky I, Liu Z, Vinsavich A, Trush V, Quackenbush J: TM4: a free, opensource system for microarray data management and analysis. Biotechniques. 2003, 34 (2): 3748.
 11.
GeneSpring GX Software. [http://www.chem.agilent.com/Scripts/PDS.asp?lPage=27881]
 12.
NIST/SEMATECH eHandbook of Statistical Methods. [http://www.itl.nist.gov/div898/handbook]
 13.
Gene Expression Omnibus (GEO). [http://www.ncbi.nlm.nih.gov/geo]
 14.
Single Nucleotide Polymorphism database (SNPdb). [http://www.ncbi.nlm.nih.gov/projects/SNP/index.html]
 15.
Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR: Wholegenome patterns of common DNA variation in three human populations. Science. 2005, 307 (5712): 10523. 10.1126/science.1105436.
 16.
Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA: Additional SNPs and linkagedisequilibrium analyses are necessary for wholegenome association studies in humans. Nat Genet. 2003, 33 (4): 51821. 10.1038/ng1128.
 17.
Kittleson MM, Minhas KM, Irizarry RA, Ye SQ, Edness G, Breton E, Conte JV, Tomaselli G, Garcia JG, Hare JM: Gene expression analysis of ischemic and nonischemic cardiomyopathy: shared and distinct genes in the development of heart failure. Physiol Genomics. 2005, 21 (3): 299307. 10.1152/physiolgenomics.00255.2004.
Acknowledgements
We thank the two anonymous reviewers for their comments, which allowed us to improve the quality of the manuscript and software. This work was supported in part by a grant from EUFP6, CARDIOWORKBENCH project http://www.medinfo.dist.unige.it/CWB1/, to FA.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
AC codesigned the software, wrote and implemented all source code, coevaluated their outcomes and cowrote the manuscript. FA conceived the original study, contributed to the testing and evaluation phases, and cowrote the manuscript. HW codesigned the software, coevaluated their outcomes and cowrote the manuscript. HZ codesigned the software, coevaluated their outcomes and cowrote the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Permutation Test
 Multiple Hypothesis
 Multiple Testing Correction
 Near Shrunken Centroid
 Statistical Significance Measure