Permutation – based statistical tests for multiple hypotheses
 Anyela Camargo^{1},
 Francisco Azuaje^{2},
 Haiying Wang^{3}Email author and
 Huiru Zheng^{3}
DOI: 10.1186/17510473315
© Camargo et al; licensee BioMed Central Ltd. 2008
Received: 30 July 2008
Accepted: 21 October 2008
Published: 21 October 2008
Abstract
Background
Genomics and proteomics analyses regularly involve the simultaneous test of hundreds of hypotheses, either on numerical or categorical data. To correct for the occurrence of false positives, validation tests based on multiple testing correction, such as Bonferroni and Benjamini and Hochberg, and resampling, such as permutation tests, are frequently used. Despite the known power of permutationbased tests, most available tools offer such tests for either ttest or ANOVA only. Less attention has been given to tests for categorical data, such as the Chisquare. This project takes a first step by developing an opensource software tool, Ptest, that addresses the need to offer public software tools incorporating these and other statistical tests with options for correcting for multiple hypotheses.
Results
This study developed a publicdomain, userfriendly software whose purpose was twofold: first, to estimate test statistics for categorical and numerical data; and second, to validate the significance of the test statistics via Bonferroni, Benjamini and Hochberg, and a permutation test of numerical and categorical data. The tool allows the calculation of Chisquare test for categorical data, and ANOVA test, Bartlett's test and ttest for paired and unpaired data. Once a test statistic is calculated, Bonferroni, Benjamini and Hochberg, and a permutation tests are implemented, independently, to control for Type I errors. An evaluation of the software using different public data sets is reported, which illustrates the power of permutation tests for multiple hypotheses assessment and for controlling the rate of Type I errors.
Conclusion
The analytical options offered by the software can be applied to support a significant spectrum of hypothesis testing tasks in functional genomics, using both numerical and categorical data.
Background
Current statistical inference problems in areas such as genomics and proteomics regularly involve the simultaneous test of hundreds of null hypotheses. This strategy has allowed scientists to unveil important cues on the mechanisms involved in the development of deadly diseases. For example, Barth et al. (2006) [1] analysed gene expression patterns related to dilated cardiomyopathy (DCM) and identified specific gene regulatory relationships relevant to this disease condition. By means of Significant Analysis of Microarray (SAM) and Nearest Shrunken Centroid (NSC), 27 genes, whose expression profiles were sufficient to differentiate between DCMs and nonfailing hearts samples, were identified. Mathur et al. (2005) [2] analysed antibody arrays and identified potential candidates for ischemic preconditioningassociated vascular growth pathways. Potential candidates were identified by applying a cutoff threshold value that filtered out nonsignificant probes. When dealing with these and related types of data, many hypotheses are tested and each test has a specified Type I (i.e. false positive) error probability, which is associated with the chance of committing Type I errors [3]. Therefore, it is important to define an appropriate Type I error threshold, as well as selecting an effective multiple testing procedure to control this error rate and account for the joint distribution of the test statistics.
To correct for the occurrence of false positives, validation tests based on multiple testing corrections and resampling techniques (i.e. permutationbased test) are frequently used. Although both strategies aim to control Type I error, these techniques implement different approaches to estimating errors and rejecting null hypotheses. Traditional multipletesting corrections, such as Bonferroni and variations, adjust Pvalues derived from multiple statistical tests to correct for the occurrence of false positives [4]. The Benjamini and Hochberg (B&H) ranks Pvalues in an ascending order, multiplies them by the number of features, and divides them by their corresponding rank [5]. The permutation test resamples N times the total number of observations, in a population sample, to build an empirical estimate of the null distribution from which the test statistic has been drawn [6]. In the end, the application of these methods leads to either the rejection or acceptance of the null hypothesis. The Bonferroni correction is known to be extremely conservative. It can lead to Type II (i.e. false negative) errors of unacceptable levels, which may contribute to publication bias and the exclusion of potentially relevant hypotheses (e.g. significant differential expression between patient groups or genotypephenotype associations) [7]. In contrast, B&H is less stringent, which may lead to the selection of more false positives [5]. Unlike Bonferroni and B&H, permutation tests do not use individual association scores based on familywise corrections [8]. Instead, permutationbased tests estimate statistical significance directly from the data being analysed. More importantly, irregularities of the observed data are maintained in the permuted data sets and are included in the estimation of the permutation probability [9].
To date, permutation tests have become widely accepted and recommended in studies that involved multiple statistical testing [3, 6, 7]. Despite its power, current available tools, such TIGR MeV [10], offer permutation tests to estimate Pvalues for either ttest or ANOVA only. Another example is GeneSpring [11] that offers a permutation test for multiple testing for either ttest or ANOVA test statistics only. These and other tools published do not offer multipletesting solutions for categorical data, such as the Chisquare. This test is appropriate for the analysis of SNPs (single nucleotide polymorphisms) data to identify significant patterns of genetic variability, i.e. variationphenotype associations. Another important statistical significance assessment technique not available in wellknown opensource tools is the Bartlett test, which may be used for testing equality of variances or the significance of data dispersion differences across groups. Moreover, the Bartlett test should also be used before attempting the calculation of either ANOVA or ttest, as they assume that variances are equal across groups or samples.
Statistical tests provided by the Ptest software.
Goal  Measure  Data type  Test 

To compare two unpaired groups  Mean  Numerical  unpaired ttest 
To compare two paired groups  Mean  Numerical  paired ttest 
To compare two or more unmatched groups  Proportions  Categorical  Chisquare test 
To compare two groups  Variance  Numerical  Bartlett 
To compare three or more unmatched group  Mean  Numerical  ANOVA 
Implementation
The software is a Javabased, commandline tool [see Additional files 1 and 2]. Input data are presented in a plain text file, where rows represent samples and columns represent features (Figure 1). The maximum number of groups to be compared is two, with two exceptions: the Chisquare test, for categorical data, and the ANOVA test for numerical data, which permit the comparison of more than two more groups. These requirements have been defined because they cover most of the typical multipletesting applications in gene expression and SNPs data analysis. New functionality (e.g. Windows interface or other relevant tests) could be added based on future requirements and additional external user feedback.
Statistical tests
The tool has at the user's disposal the following statistic tests: Student's test for numerical data, two classes; Bartlett's test for numerical data, two classes; ANOVA test for numerical data, more than three classes; and Chisquare for categorical data, two or more classes. For detailed information about each test, please refer to National Institute of Standards and Technology/Semiconductor Manufacturing Technology eHandbook of Statistical Methods [12].
Multiple hypotheses testing procedures
Given (N) number of samples, (C) number of classes, (F) number of features, (S) significance level, (B) number of permutations, and (T(obs)) test statistic, validation methods are as described bellow:
Multiple testing correction: Pvalues, according to test statistic and degrees of freedom (N2), were obtained and adjusted under Bonferroni and B&H multiple testing corrections B[5]. Permutation test: test statistic is estimated from original data set T(obs); sample's labels are shuffled B times and T(obs)s' are obtained; if T(obs) <T(obs)' a counter T(per) is increased by 1. The probability that T(obs) occurred by chance alone is: T(per)/B.
Software usage
Results
Results of analyses of statistical tests.
Test  Data description  Groups  Features  Samples  Feature selection according to  

raw Pvalue  Multiple test correction  
Bonferroni  B&H  PT  
Bartlett  Microarray Numerical  2  8068  12  526  1  1  327 
ttest  Microarray Numerical  2  8068  12  1413  2  39  1398 
Chisquare  Single nucleotide polymorphisms (SNP) Categorical  3  334  33  153  8  131  153 
ANOVA  Microarray Numerical  3  14976  37  6371  9  3331  6262 
Testing data sets
 1)
A microarray data set generated by a study in dilated cardiomyopathy was obtained from the GEO (Gene Expression Omnibus) [13], accession number GDS2205 (for numerical data analysis) and composed of 12 samples: 5 from nonfailing hearts and 7 from DCM patients.
 2)
A genotype data set (for categorical data analysis) was obtained from the Single Nucleotide Polymorphism database (SNPdb) [14]. This data set was composed of 34 samples, 10 from AfricanAmerican people, 12 from EuropeanAmerican people and 11 from HanChinese people.
 3)
A microarray data set, oligo array, generated by a study in heart failure was obtained from the GEO, accession number GDS1362, was composed of 37 samples: 7, 20 and 10 samples were obtained from nonfailing hearts, DCM heart, and Ischemic cardiomyopathy (ICM) patients respectively.
Data preprocessing
Microarray data: probe sets with absent calls in more than 50% of their transcripts were discarded. Transcripts of probe sets corresponding to similar gene symbols were averaged. Data were normalised per chip and then per gene. Values were transformed using the mean and standard deviation of the row (per gene) or column (per chip). Genotype data did not require preprocessing.
Statistical analyses
The first analysis calculated Bartlett's test statistic to determine whether the variances of two experimental groups, from a microarray data set [see Additional file 4], were equal. The null hypothesis of this analysis was that there was no significant difference between the variances of the two groups, and the significance level to reject the null hypothesis was set to 0.05. Data set was composed of 12 samples: 5 and 7 samples were obtained from nonfailing hearts and DCM patients respectively. Out of 8068 genes, 526 genes were found to be statistically significant (P < 0.05, before correction for multipletesting), one gene was under the significance level after correcting with Bonferroni, and one gene was under the significance level after correcting with B&H. However, after performing the permutation test, 327 genes were found significantly differentially expressed (P < 0.05). That is, the two group samples being compared exhibit equal variances, which is commonly expected in typical microarray data analyses.
The second analysis implemented the ttest (type: two sample equal variances; number of distribution tails: twotailed): equal variances and two tailed) to estimate the potential statistical significant difference between the means of two (normally distributed) experimental groups, from the same microarray data set analysed above [see Additional file 5]. The null hypothesis of this analysis was that there was no difference between the means of the two groups, and the significance level to reject the null hypothesis was set to 0.05. In this case, the raw Pvalues of 1413 genes were under the significance level (P < 0.05), 39 genes were under the significance level after correcting with B&H, and only two genes were under the significance level after correcting with Bonferroni. In this case, results were consistent with our expectations: B&H identified more genes than Bonferroni did, which shows that the former tends to be less stringent. After performing the permutation test, 1398 genes were found significantly differentially expressed (P < 0.05). In addition, we noted that the raw Pvalues of some of the genes filtered out by Bonferroni were well below the significance level, i.e. they were potentially significant under a less conservative correction approach. For example, raw Pvalues of ACVR1 and CFHR1 were 0.0004 and 0.004, respectively, and their Pvalues after Bonferroni correction were above 0.9. However, based on the permutationbased test, these two genes fall below the significance threshold (corrected P values: 0.0001 and 0.001 for ACVR1 and CFHR1, respectively). This, as expected, shows the statistical power of permutationbased procedures for multiple testing.
The third analysis implemented the Chisquare test on categorical data derived from a genetic variation data set (SNPs) [see Additional file 6]. The problem was to determine statistically significant genetic variations among the SNPs of three ethnic groups: AfricanAmerican, EuropeanAmerican and Chinese. The data encode genotype values for each SNP under each group [15]. This data set was composed of 34 samples: 10 from AfricanAmericans, 12 from EuropeanAmericans and 11 from HanChinese people. The null hypothesis of this analysis was that there were no genetic differential variations among the three groups, and the significance level to reject the null hypothesis was set to 0.05. In this case the raw Pvalues of 153 SNPs, out of 334, were under the significance level (P < 0.05). Bonferroni correction identified only eight SNPs, whose Pvalues were below significance level, and B&H correction identified 131 SNPs, whose Pvalues were below significance level. In contrast, the permutation test identified more features than B&H: 153 SNPs with significant Pvalues. These results are consistent with the results reported by Carlson, et al. (2003) [16], which found that only 48% of the SNPs were shared by AfricanAmericans and EuropeanAmericans. In our study, the permutationbased adjustment found that 55% of SNPs showed no significant differences among the three populations been analysed. These results again confirm the statistical power of permutationbased procedures for multiple testing.
A fourth analysis implemented the ANOVA test to estimate the potential statistical significant difference between the means of three (normally distributed) experimental groups. Samples in this data set were obtained from heart tissue of healthy donors, as well as from donors suffering from either dilated or ischemic cardiomyopathy [see Additional file 7]. We used the ANOVA test to look for possible outstanding differences among the three populations evaluated, because ttest is designed to perform pairwise comparisons, only. The null hypothesis of this ANOVA analysis was that there were no differences between the means of the three groups, and the significance level to reject the null hypothesis was set to 0.05. In this case, the raw Pvalues of 6371 genes were under the significance level (P < 0.05), 3331 genes were under the significance level after correcting with B&H, and only nine genes were under the significance level after correcting with Bonferroni. After performing the permutation test, 6262 genes were found significantly differentially expressed (P < 0.05). The genes reported as significantly differentially after correcting via Bonferroni were not included in the set of potentially significant genes detected by the permutation test. In addition, we compared our results against those previously reported by Kittleson, et al. (2005) [17] and found that most genes reported by them as significantly differentially expressed were also below significant level when our permutation test was performed, or when Pvalues were corrected via the B&H method. In contrast, only one of the genes reported by Kittleson's was also below significant level after we corrected with Bonferroni. Perhaps this analysis showed the real strength that the permutation test has to identify potential biomarkers of disease.
Conclusion
The techniques for multiple testing offered here through a platformindependent tool are relevant to a variety of data analysis tasks in biology and medicine. The results also allowed us to illustrate the power of a permutation test for multiple hypotheses assessment procedures and for controlling the rate of Type I errors. We also demonstrated that even when Pvalues were corrected via B&H, which is considered a less stringent method as opposed to Bonferroni, a number of potentially significant features were dismissed. The software is easy to use and it offers the basis for future extensions. Another key contribution is the implementation of multiple hypotheses statistical testing techniques for both numerical and categorical data. The analytical options offered can be applied to support a significant spectrum of hypothesis testing tasks in functional genomics, e.g. fast detection of significantly differentially expressed genes and genotypes. Moreover, to the best of our knowledge, this is the first opensource software tool freely available for supporting less traditional genomic applications, such as the detection of betweengroup differences on the basis of SNPs. In this area multipletesting procedures have traditionally relied on very stringent adjustment approaches (e.g. Bonferroni).
Despite its simplicity, in terms of usability, this tool in comparison with others, such as GeneSpring and TIGR MeV, offers the following advantages: Freelyavailable, as TIGR MeV does, no computational installation cost, easy to use, computationally inexpensive. Moreover it allows the calculation of traditional statistical tests and multiple testing with categorical data, as well as test and distributionindependent permutationbased tests.
We expect to continue expanding the tool with alternative statistical significance measures, such as Fisher's exact test, Z or Wald scores. We will welcome additional user's feedback after the publication of this article.
Availability and requirements
Project name: Permutationbased statistical tests for multiple hypotheses
Project home page: http://rosalind.infj.ulst.ac.uk/CWB/Ptest.html
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 1.5.1 or higher
License: None
Any restrictions to use by nonacademics: None
Abbreviations
 ANOVA:

Analysis of variance
 DCM:

Dilated CardioMyopathy
 SAM:

Significant Analysis of Microarray
 NSC:

Nearest Shrunken Centroid
 SNP:

Single nucleotide polymorphisms
Declarations
Acknowledgements
We thank the two anonymous reviewers for their comments, which allowed us to improve the quality of the manuscript and software. This work was supported in part by a grant from EUFP6, CARDIOWORKBENCH project http://www.medinfo.dist.unige.it/CWB1/, to FA.
Authors’ Affiliations
References
 Barth AS, Kuner R, Buness A, Ruschhaupt M, Merk S, Zwermann L, Kääb S, Kreuzer E, Steinbeck G, Mansmann U, Poustka A, Nabauer M, Sültmann H: Identification of a common gene expression signature in dilated cardiomyopathy across independent microarray studies. J Am Coll Cardiol. 2006, 48: 16107. 10.1016/j.jacc.2006.07.026.View ArticlePubMed
 Mathur P, Kaga S, Zhan L, Das DK, Maulik N: Potential candidates for ischemic preconditioningassociated vascular growth pathways revealed by antibody array. Am J Physiol Heart Circ Physiol. 2005, 288 (6): H300610. 10.1152/ajpheart.01203.2004.View ArticlePubMed
 Dudoit S, Shaffer JP, Boldrick JC: Multiple hypotheses testing in microarray experiments. Statistical Science. 2003, 18 (1): 71103. 10.1214/ss/1056397487.View Article
 Feilotter H: A Biologist's guide to analysis of DNA microarray data. Am J Hum Genet. 2002, 71 (6): 14831484. 10.1086/344458.PubMed CentralView Article
 Multiple Testing Corrections. [http://www.chem.agilent.com/cag/bsp/sig/downloads/pdf/mtc.pdf]
 Belmonte M, YurgelunTodd D: Permutation testing made practical for functional magnetic resonance image analysis. IEEE Trans Med Imaging. 2001, 20 (3): 2438. 10.1109/42.918475.View ArticlePubMed
 Nakagawa S: A farewell to Bonferroni: the problems of low statistical power and publication bias. Behav Ecol. 2004, 15: 10441045. 10.1093/beheco/arh107.View Article
 Kimmel G, Jordan MI, Halperin E, Shamir R, Karp RM: A randomization test for controlling population stratification in wholegenome association studies. Am J Hum Genet. 2007, 81 (5): 895905. 10.1086/521372.PubMed CentralView ArticlePubMed
 Cheverud JM: A simple correction for multiple comparisons in interval mapping genome scans. Heredity. 2001, 87: 5258. 10.1046/j.13652540.2001.00901.x.View ArticlePubMed
 Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, Sturn A, Snuffin M, Rezantsev A, Popov D, Ryltsov A, Kostukovich E, Borisovsky I, Liu Z, Vinsavich A, Trush V, Quackenbush J: TM4: a free, opensource system for microarray data management and analysis. Biotechniques. 2003, 34 (2): 3748.PubMed
 GeneSpring GX Software. [http://www.chem.agilent.com/Scripts/PDS.asp?lPage=27881]
 NIST/SEMATECH eHandbook of Statistical Methods. [http://www.itl.nist.gov/div898/handbook]
 Gene Expression Omnibus (GEO). [http://www.ncbi.nlm.nih.gov/geo]
 Single Nucleotide Polymorphism database (SNPdb). [http://www.ncbi.nlm.nih.gov/projects/SNP/index.html]
 Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR: Wholegenome patterns of common DNA variation in three human populations. Science. 2005, 307 (5712): 10523. 10.1126/science.1105436.View Article
 Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA: Additional SNPs and linkagedisequilibrium analyses are necessary for wholegenome association studies in humans. Nat Genet. 2003, 33 (4): 51821. 10.1038/ng1128.View ArticlePubMed
 Kittleson MM, Minhas KM, Irizarry RA, Ye SQ, Edness G, Breton E, Conte JV, Tomaselli G, Garcia JG, Hare JM: Gene expression analysis of ischemic and nonischemic cardiomyopathy: shared and distinct genes in the development of heart failure. Physiol Genomics. 2005, 21 (3): 299307. 10.1152/physiolgenomics.00255.2004.View ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.