Permutation – based statistical tests for multiple hypotheses
© Camargo et al; licensee BioMed Central Ltd. 2008
Received: 30 July 2008
Accepted: 21 October 2008
Published: 21 October 2008
Genomics and proteomics analyses regularly involve the simultaneous test of hundreds of hypotheses, either on numerical or categorical data. To correct for the occurrence of false positives, validation tests based on multiple testing correction, such as Bonferroni and Benjamini and Hochberg, and re-sampling, such as permutation tests, are frequently used. Despite the known power of permutation-based tests, most available tools offer such tests for either t-test or ANOVA only. Less attention has been given to tests for categorical data, such as the Chi-square. This project takes a first step by developing an open-source software tool, Ptest, that addresses the need to offer public software tools incorporating these and other statistical tests with options for correcting for multiple hypotheses.
This study developed a public-domain, user-friendly software whose purpose was twofold: first, to estimate test statistics for categorical and numerical data; and second, to validate the significance of the test statistics via Bonferroni, Benjamini and Hochberg, and a permutation test of numerical and categorical data. The tool allows the calculation of Chi-square test for categorical data, and ANOVA test, Bartlett's test and t-test for paired and unpaired data. Once a test statistic is calculated, Bonferroni, Benjamini and Hochberg, and a permutation tests are implemented, independently, to control for Type I errors. An evaluation of the software using different public data sets is reported, which illustrates the power of permutation tests for multiple hypotheses assessment and for controlling the rate of Type I errors.
The analytical options offered by the software can be applied to support a significant spectrum of hypothesis testing tasks in functional genomics, using both numerical and categorical data.
Current statistical inference problems in areas such as genomics and proteomics regularly involve the simultaneous test of hundreds of null hypotheses. This strategy has allowed scientists to unveil important cues on the mechanisms involved in the development of deadly diseases. For example, Barth et al. (2006)  analysed gene expression patterns related to dilated cardiomyopathy (DCM) and identified specific gene regulatory relationships relevant to this disease condition. By means of Significant Analysis of Microarray (SAM) and Nearest Shrunken Centroid (NSC), 27 genes, whose expression profiles were sufficient to differentiate between DCMs and non-failing hearts samples, were identified. Mathur et al. (2005)  analysed antibody arrays and identified potential candidates for ischemic preconditioning-associated vascular growth pathways. Potential candidates were identified by applying a cut-off threshold value that filtered out non-significant probes. When dealing with these and related types of data, many hypotheses are tested and each test has a specified Type I (i.e. false positive) error probability, which is associated with the chance of committing Type I errors . Therefore, it is important to define an appropriate Type I error threshold, as well as selecting an effective multiple testing procedure to control this error rate and account for the joint distribution of the test statistics.
To correct for the occurrence of false positives, validation tests based on multiple testing corrections and re-sampling techniques (i.e. permutation-based test) are frequently used. Although both strategies aim to control Type I error, these techniques implement different approaches to estimating errors and rejecting null hypotheses. Traditional multiple-testing corrections, such as Bonferroni and variations, adjust P-values derived from multiple statistical tests to correct for the occurrence of false positives . The Benjamini and Hochberg (B&H) ranks P-values in an ascending order, multiplies them by the number of features, and divides them by their corresponding rank . The permutation test re-samples N times the total number of observations, in a population sample, to build an empirical estimate of the null distribution from which the test statistic has been drawn . In the end, the application of these methods leads to either the rejection or acceptance of the null hypothesis. The Bonferroni correction is known to be extremely conservative. It can lead to Type II (i.e. false negative) errors of unacceptable levels, which may contribute to publication bias and the exclusion of potentially relevant hypotheses (e.g. significant differential expression between patient groups or genotype-phenotype associations) . In contrast, B&H is less stringent, which may lead to the selection of more false positives . Unlike Bonferroni and B&H, permutation tests do not use individual association scores based on family-wise corrections . Instead, permutation-based tests estimate statistical significance directly from the data being analysed. More importantly, irregularities of the observed data are maintained in the permuted data sets and are included in the estimation of the permutation probability .
To date, permutation tests have become widely accepted and recommended in studies that involved multiple statistical testing [3, 6, 7]. Despite its power, current available tools, such TIGR MeV , offer permutation tests to estimate P-values for either t-test or ANOVA only. Another example is GeneSpring  that offers a permutation test for multiple testing for either t-test or ANOVA test statistics only. These and other tools published do not offer multiple-testing solutions for categorical data, such as the Chi-square. This test is appropriate for the analysis of SNPs (single nucleotide polymorphisms) data to identify significant patterns of genetic variability, i.e. variation-phenotype associations. Another important statistical significance assessment technique not available in well-known open-source tools is the Bartlett test, which may be used for testing equality of variances or the significance of data dispersion differences across groups. Moreover, the Bartlett test should also be used before attempting the calculation of either ANOVA or t-test, as they assume that variances are equal across groups or samples.
Statistical tests provided by the Ptest software.
To compare two unpaired groups
To compare two paired groups
To compare two or more unmatched groups
To compare two groups
To compare three or more unmatched group
The software is a Java-based, command-line tool [see Additional files 1 and 2]. Input data are presented in a plain text file, where rows represent samples and columns represent features (Figure 1). The maximum number of groups to be compared is two, with two exceptions: the Chi-square test, for categorical data, and the ANOVA test for numerical data, which permit the comparison of more than two more groups. These requirements have been defined because they cover most of the typical multiple-testing applications in gene expression and SNPs data analysis. New functionality (e.g. Windows interface or other relevant tests) could be added based on future requirements and additional external user feedback.
The tool has at the user's disposal the following statistic tests: Student's test for numerical data, two classes; Bartlett's test for numerical data, two classes; ANOVA test for numerical data, more than three classes; and Chi-square for categorical data, two or more classes. For detailed information about each test, please refer to National Institute of Standards and Technology/Semiconductor Manufacturing Technology e-Handbook of Statistical Methods .
Multiple hypotheses testing procedures
Given (N) number of samples, (C) number of classes, (F) number of features, (S) significance level, (B) number of permutations, and (T(obs)) test statistic, validation methods are as described bellow:
Multiple testing correction: P-values, according to test statistic and degrees of freedom (N-2), were obtained and adjusted under Bonferroni and B&H multiple testing corrections B. Permutation test: test statistic is estimated from original data set T(obs); sample's labels are shuffled B times and T(obs)s' are obtained; if T(obs) <T(obs)' a counter T(per) is increased by 1. The probability that T(obs) occurred by chance alone is: T(per)/B.
Results of analyses of statistical tests.
Feature selection according to
Multiple test correction
Single nucleotide polymorphisms (SNP) Categorical
Testing data sets
A microarray data set generated by a study in dilated cardiomyopathy was obtained from the GEO (Gene Expression Omnibus) , accession number GDS2205 (for numerical data analysis) and composed of 12 samples: 5 from non-failing hearts and 7 from DCM patients.
A genotype data set (for categorical data analysis) was obtained from the Single Nucleotide Polymorphism database (SNPdb) . This data set was composed of 34 samples, 10 from African-American people, 12 from European-American people and 11 from Han-Chinese people.
A microarray data set, oligo array, generated by a study in heart failure was obtained from the GEO, accession number GDS1362, was composed of 37 samples: 7, 20 and 10 samples were obtained from non-failing hearts, DCM heart, and Ischemic cardiomyopathy (ICM) patients respectively.
Microarray data: probe sets with absent calls in more than 50% of their transcripts were discarded. Transcripts of probe sets corresponding to similar gene symbols were averaged. Data were normalised per chip and then per gene. Values were transformed using the mean and standard deviation of the row (per gene) or column (per chip). Genotype data did not require pre-processing.
The first analysis calculated Bartlett's test statistic to determine whether the variances of two experimental groups, from a microarray data set [see Additional file 4], were equal. The null hypothesis of this analysis was that there was no significant difference between the variances of the two groups, and the significance level to reject the null hypothesis was set to 0.05. Data set was composed of 12 samples: 5 and 7 samples were obtained from non-failing hearts and DCM patients respectively. Out of 8068 genes, 526 genes were found to be statistically significant (P < 0.05, before correction for multiple-testing), one gene was under the significance level after correcting with Bonferroni, and one gene was under the significance level after correcting with B&H. However, after performing the permutation test, 327 genes were found significantly differentially expressed (P < 0.05). That is, the two group samples being compared exhibit equal variances, which is commonly expected in typical microarray data analyses.
The second analysis implemented the t-test (type: two sample equal variances; number of distribution tails: two-tailed): equal variances and two tailed) to estimate the potential statistical significant difference between the means of two (normally distributed) experimental groups, from the same microarray data set analysed above [see Additional file 5]. The null hypothesis of this analysis was that there was no difference between the means of the two groups, and the significance level to reject the null hypothesis was set to 0.05. In this case, the raw P-values of 1413 genes were under the significance level (P < 0.05), 39 genes were under the significance level after correcting with B&H, and only two genes were under the significance level after correcting with Bonferroni. In this case, results were consistent with our expectations: B&H identified more genes than Bonferroni did, which shows that the former tends to be less stringent. After performing the permutation test, 1398 genes were found significantly differentially expressed (P < 0.05). In addition, we noted that the raw P-values of some of the genes filtered out by Bonferroni were well below the significance level, i.e. they were potentially significant under a less conservative correction approach. For example, raw P-values of ACVR1 and CFHR1 were 0.0004 and 0.004, respectively, and their P-values after Bonferroni correction were above 0.9. However, based on the permutation-based test, these two genes fall below the significance threshold (corrected P values: 0.0001 and 0.001 for ACVR1 and CFHR1, respectively). This, as expected, shows the statistical power of permutation-based procedures for multiple testing.
The third analysis implemented the Chi-square test on categorical data derived from a genetic variation data set (SNPs) [see Additional file 6]. The problem was to determine statistically significant genetic variations among the SNPs of three ethnic groups: African-American, European-American and Chinese. The data encode genotype values for each SNP under each group . This data set was composed of 34 samples: 10 from African-Americans, 12 from European-Americans and 11 from Han-Chinese people. The null hypothesis of this analysis was that there were no genetic differential variations among the three groups, and the significance level to reject the null hypothesis was set to 0.05. In this case the raw P-values of 153 SNPs, out of 334, were under the significance level (P < 0.05). Bonferroni correction identified only eight SNPs, whose P-values were below significance level, and B&H correction identified 131 SNPs, whose P-values were below significance level. In contrast, the permutation test identified more features than B&H: 153 SNPs with significant P-values. These results are consistent with the results reported by Carlson, et al. (2003) , which found that only 48% of the SNPs were shared by African-Americans and European-Americans. In our study, the permutation-based adjustment found that 55% of SNPs showed no significant differences among the three populations been analysed. These results again confirm the statistical power of permutation-based procedures for multiple testing.
A fourth analysis implemented the ANOVA test to estimate the potential statistical significant difference between the means of three (normally distributed) experimental groups. Samples in this data set were obtained from heart tissue of healthy donors, as well as from donors suffering from either dilated or ischemic cardiomyopathy [see Additional file 7]. We used the ANOVA test to look for possible outstanding differences among the three populations evaluated, because t-test is designed to perform pair-wise comparisons, only. The null hypothesis of this ANOVA analysis was that there were no differences between the means of the three groups, and the significance level to reject the null hypothesis was set to 0.05. In this case, the raw P-values of 6371 genes were under the significance level (P < 0.05), 3331 genes were under the significance level after correcting with B&H, and only nine genes were under the significance level after correcting with Bonferroni. After performing the permutation test, 6262 genes were found significantly differentially expressed (P < 0.05). The genes reported as significantly differentially after correcting via Bonferroni were not included in the set of potentially significant genes detected by the permutation test. In addition, we compared our results against those previously reported by Kittleson, et al. (2005)  and found that most genes reported by them as significantly differentially expressed were also below significant level when our permutation test was performed, or when P-values were corrected via the B&H method. In contrast, only one of the genes reported by Kittleson's was also below significant level after we corrected with Bonferroni. Perhaps this analysis showed the real strength that the permutation test has to identify potential biomarkers of disease.
The techniques for multiple testing offered here through a platform-independent tool are relevant to a variety of data analysis tasks in biology and medicine. The results also allowed us to illustrate the power of a permutation test for multiple hypotheses assessment procedures and for controlling the rate of Type I errors. We also demonstrated that even when P-values were corrected via B&H, which is considered a less stringent method as opposed to Bonferroni, a number of potentially significant features were dismissed. The software is easy to use and it offers the basis for future extensions. Another key contribution is the implementation of multiple hypotheses statistical testing techniques for both numerical and categorical data. The analytical options offered can be applied to support a significant spectrum of hypothesis testing tasks in functional genomics, e.g. fast detection of significantly differentially expressed genes and genotypes. Moreover, to the best of our knowledge, this is the first open-source software tool freely available for supporting less traditional genomic applications, such as the detection of between-group differences on the basis of SNPs. In this area multiple-testing procedures have traditionally relied on very stringent adjustment approaches (e.g. Bonferroni).
Despite its simplicity, in terms of usability, this tool in comparison with others, such as GeneSpring and TIGR MeV, offers the following advantages: Freely-available, as TIGR MeV does, no computational installation cost, easy to use, computationally inexpensive. Moreover it allows the calculation of traditional statistical tests and multiple testing with categorical data, as well as test- and distribution-independent permutation-based tests.
We expect to continue expanding the tool with alternative statistical significance measures, such as Fisher's exact test, Z or Wald scores. We will welcome additional user's feedback after the publication of this article.
Availability and requirements
Project name: Permutation-based statistical tests for multiple hypotheses
Project home page: http://rosalind.infj.ulst.ac.uk/CWB/Ptest.html
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 1.5.1 or higher
Any restrictions to use by non-academics: None
Analysis of variance
Significant Analysis of Microarray
Nearest Shrunken Centroid
Single nucleotide polymorphisms
We thank the two anonymous reviewers for their comments, which allowed us to improve the quality of the manuscript and software. This work was supported in part by a grant from EU-FP6, CARDIOWORKBENCH project http://www.medinfo.dist.unige.it/CWB1/, to FA.
- Barth AS, Kuner R, Buness A, Ruschhaupt M, Merk S, Zwermann L, Kääb S, Kreuzer E, Steinbeck G, Mansmann U, Poustka A, Nabauer M, Sültmann H: Identification of a common gene expression signature in dilated cardiomyopathy across independent microarray studies. J Am Coll Cardiol. 2006, 48: 1610-7. 10.1016/j.jacc.2006.07.026.View ArticlePubMedGoogle Scholar
- Mathur P, Kaga S, Zhan L, Das DK, Maulik N: Potential candidates for ischemic preconditioning-associated vascular growth pathways revealed by antibody array. Am J Physiol Heart Circ Physiol. 2005, 288 (6): H3006-10. 10.1152/ajpheart.01203.2004.View ArticlePubMedGoogle Scholar
- Dudoit S, Shaffer JP, Boldrick JC: Multiple hypotheses testing in microarray experiments. Statistical Science. 2003, 18 (1): 71-103. 10.1214/ss/1056397487.View ArticleGoogle Scholar
- Feilotter H: A Biologist's guide to analysis of DNA microarray data. Am J Hum Genet. 2002, 71 (6): 1483-1484. 10.1086/344458.PubMed CentralView ArticleGoogle Scholar
- Multiple Testing Corrections. [http://www.chem.agilent.com/cag/bsp/sig/downloads/pdf/mtc.pdf]
- Belmonte M, Yurgelun-Todd D: Permutation testing made practical for functional magnetic resonance image analysis. IEEE Trans Med Imaging. 2001, 20 (3): 243-8. 10.1109/42.918475.View ArticlePubMedGoogle Scholar
- Nakagawa S: A farewell to Bonferroni: the problems of low statistical power and publication bias. Behav Ecol. 2004, 15: 1044-1045. 10.1093/beheco/arh107.View ArticleGoogle Scholar
- Kimmel G, Jordan MI, Halperin E, Shamir R, Karp RM: A randomization test for controlling population stratification in whole-genome association studies. Am J Hum Genet. 2007, 81 (5): 895-905. 10.1086/521372.PubMed CentralView ArticlePubMedGoogle Scholar
- Cheverud JM: A simple correction for multiple comparisons in interval mapping genome scans. Heredity. 2001, 87: 52-58. 10.1046/j.1365-2540.2001.00901.x.View ArticlePubMedGoogle Scholar
- Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, Sturn A, Snuffin M, Rezantsev A, Popov D, Ryltsov A, Kostukovich E, Borisovsky I, Liu Z, Vinsavich A, Trush V, Quackenbush J: TM4: a free, open-source system for microarray data management and analysis. Biotechniques. 2003, 34 (2): 374-8.PubMedGoogle Scholar
- GeneSpring GX Software. [http://www.chem.agilent.com/Scripts/PDS.asp?lPage=27881]
- NIST/SEMATECH e-Handbook of Statistical Methods. [http://www.itl.nist.gov/div898/handbook]
- Gene Expression Omnibus (GEO). [http://www.ncbi.nlm.nih.gov/geo]
- Single Nucleotide Polymorphism database (SNPdb). [http://www.ncbi.nlm.nih.gov/projects/SNP/index.html]
- Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR: Whole-genome patterns of common DNA variation in three human populations. Science. 2005, 307 (5712): 1052-3. 10.1126/science.1105436.View ArticleGoogle Scholar
- Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA: Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat Genet. 2003, 33 (4): 518-21. 10.1038/ng1128.View ArticlePubMedGoogle Scholar
- Kittleson MM, Minhas KM, Irizarry RA, Ye SQ, Edness G, Breton E, Conte JV, Tomaselli G, Garcia JG, Hare JM: Gene expression analysis of ischemic and nonischemic cardiomyopathy: shared and distinct genes in the development of heart failure. Physiol Genomics. 2005, 21 (3): 299-307. 10.1152/physiolgenomics.00255.2004.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.