ParaHaplo: A program package for haplotypebased wholegenome association study using parallel computing
 Kazuharu Misawa^{1}Email author and
 Naoyuki Kamatani^{2}
DOI: 10.1186/1751047347
© Misawa and Kamatani; licensee BioMed Central Ltd. 2009
Received: 26 August 2009
Accepted: 21 October 2009
Published: 21 October 2009
Abstract
Background
Since more than a million singlenucleotide polymorphisms (SNPs) are analyzed in any given genomewide association study (GWAS), performing multiple comparisons can be problematic. To cope with multiplecomparison problems in GWAS, haplotypebased algorithms were developed to correct for multiple comparisons at multiple SNP loci in linkage disequilibrium. A permutation test can also control problems inherent in multiple testing; however, both the calculation of exact probability and the execution of permutation tests are timeconsuming. Faster methods for calculating exact probabilities and executing permutation tests are required.
Methods
We developed a set of computer programs for the parallel computation of accurate Pvalues in haplotypebased GWAS. Our program, ParaHaplo, is intended for workstation clusters using the Intel Message Passing Interface (MPI). We compared the performance of our algorithm to that of the regular permutation test on JPT and CHB of HapMap.
Results
ParaHaplo can detect smaller differences between 2 populations than SNPbased GWAS. We also found that parallelcomputing techniques made ParaHaplo 100fold faster than a nonparallel version of the program.
Conclusion
ParaHaplo is a useful tool in conducting haplotypebased GWAS. Since the data sizes of such projects continue to increase, the use of fast computations with parallel computingsuch as that used in ParaHaplowill become increasingly important. The executable binaries and program sources of ParaHaplo are available at the following address: http://sourceforge.jp/projects/parallelgwas/?_sl=1
Background
Recent advances in highthroughput genotyping technologies have allowed us to test allele frequency differences between case and control populations on a genomewide scale [1]. Genomewide association studies (GWAS) are used to compare the frequency of alleles or genotypes of a particular variant between disease cases and controls, across a given genome. A common approach is to test for differences in the allele frequencies of every singlenucleotide polymorphism (SNP) between the case and the control populations, by using the chisquare test [2–4]. The chisquare test uses the Pearson score, which increases as the difference in allele frequency between 2 populations increase. The chisquare test evaluates the Pearson score by way of the chisquare distribution.
One crucial problem in conducting SNPbased GWAS is performing corrections for multiple comparisons. A Bonferroni correction for a Pvalue is usually used to account for multiple testing under the assumption that all SNPs are independent. When SNP loci are in linkage disequilibrium, Bonferroni corrections are known to be too conservative and SNPbased GWAS may exclude truly significant SNPs [5, 6].
To address the multiplecomparison problem in GWAS, Misawa et al. [5] have developed new algorithms to correct for multiple comparisons at multiple SNP loci in a linkage disequilibrium, by treating linked loci as one haplotype block. This approach can be referred to as haplotypebased GWAS. In the present study, a haplotype refers to a list of alleles at multiple linked polymorphic loci, while a haplotype copy denotes a list of alleles within a gamete. Misawa et al. [5] developed a method of calculating the exact probability of a typeI error of haplotypebased GWAS, under the conditions that the haplotype frequencies in the population are known and the number of haplotype copies in the sample follows a multinomial distribution. Since this algorithm calculates all possible terms, the complexity of the computational time of this exact test is O(2n_{1}! 2 n_{2}!), where n_{1} is the sample size of the case population and n_{2} is the sample size of the control population. When the numbers of cases and controls exceed 50, such exact probabilities cannot be calculated, since they require too much time. As an alternative method, Misawa et al. [5] developed algorithms to asymptotically calculate the typeI error rates using a Markovchain Monte Carlo (MCMC) sampler that provides a good approximation to values calculated by the exact method. The computational complexity of the MCMC algorithm is O(Nnm), where N is the number of generations, n is the total sample size, n = n_{1} + n_{2}, and m is the number of loci.
The permutation test can also mitigate haplotypebased GWAS [6]. In the standard permutation test (SPT) for SNPbased GWAS, the test proceeds as follows. First, the Pearson score is calculated from the allele frequencies of the 2 populations at an SNP site; this score is the observed value of the Pearson score, S. Next, the 2 populations are pooled. The Pearson score is then calculated from the allele frequencies and recorded by randomly dividing these pooled values into two groups of size, n_{1} and n_{2}. The onesided Pvalue of the test is calculated as the proportion of sampled permutations where the Pearson score was greater than or equal to S. When SPT is applied to haplotypebased GWAS, haplotype copies of 2 populations are permuted, and Pearson scores are calculated for each SNP. The time complexity of the algorithm is O(Nnm). To search for SNPs whose Pvalues are lower than p, at least 1/p permutations are needed; therefore, the time complexity can be written as O(nm/p). For instance, to reach a P value of 10^{6} in a study that contains 1,000 cases and 1,000 controls with 10,000 loci, 10^{13} basic computer operations are required. Obviously, scaling up to larger studies comprising 100,000 loci is completely unattainable [6]. Therefore, to obtain small Pvalue bounds, one must expend a great deal of computational effort. By using importance sampling, Kimmel and Shamir [6] developed the rapid association test (RAT); much less effort is needed to achieve accurate and very low Pvalues by RAT than by SPT. The complexity of the running time for RAT is O(nb + N nc), where N is the number of permutations drawn by RAT, b is a predefined sampling constant, and c is the upper bound on the distance in SNPs between linked loci.
When the penetrance of a disease is small, a large number of SNPs from a large number of individuals from both case and control populations must be genotyped, to detect diseaseassociated genes [7]. There are currently more than a million SNPs for which accurate and complete genotypes have been obtained [8, 9]; thus, neither the MCMC algorithm [5] nor the RAT algorithm [6] can obtain a haplotypeGWAS result.
To conduct haplotypeGWAS within a short time period, we developed ParaHaplo, a parallelcomputation program that performs precisely these 2 functions for GWAS. ParaHaplo is based on data parallelism, a programming technique used to split large datasets into smaller datasets that can be run in a parallel, concurrent fashion [10]. ParaHaplo was developed on the basis of the Intel Message Passing Interface (MPI) and runs on PC clusters. ParaHaplo is a set of computer programs for SPT, RAT, MCMC, and the exact test, based on parallel computation. ParaHaplo is intended for use on workstation clusters using the Intel MPI, as well as on singleprocessor machines.
Using ParaHaplo, we conducted haplotypebased GWAS and SNPbased GWAS, to determine differences between Japanese in Tokyo, Japan (JPT) and Han Chinese in Beijing, China (CHB) of the HapMap dataset [11], because there are known to be small differences between JPT and CHB [9, 12]. We compared the speed of calculation of our algorithm with that of the regular permutation test on chromosome 22 of JPT and CHB of HapMap.
Implementation
Software overview
ParaHaplo supports the HapMap data format [8], as well as those of the Dhaplo DB [13] and BioBank Japan [8]. ParaHaplo requires an input file of the haplotype block boundary, as well as 2 datasets of population data. ParaHaplo can conduct either haplotypebased GWAS or SNPbased GWAS; in the case of the former, the data must be phased. ParaHaplo tests differences in allele frequency between 2 populations, e.g., a case population and a control population. ParaHaplo outputs the Pearson score for a chisquare test; a user can create ParaHaplo output by using a commandline option.
The ParaHaplo package includes both the calculation of exact probabilities and the algorithm to calculate asymptotically typeI error rates using a MCMC sampler [2]. Permutation tests SPT and RAT are also included in the ParaHaplo package. RAT is a fast algorithm for computing Pvalues in association studies and is based on an importance sampling developed by Kimmel and Shamir (2006) [3]. We did not incorporate the original source programs of Kimmel and Shamir (2006) into the package. The global typeI error is then obtained from the local typeI error by using a Bonferroni correction, because different haplotype blocks are considered independent of each other.
Parallel computing using MPI methods
ParaHaplo is implemented in an MPIC multithreaded package. The MPI package allows us to construct parallel computing programs on multiprocessors. The genomewide polymorphism data are broken into userdefined haplotype blocks, and the MPI Bcast function is used to distribute a single set of haplotype block data into each processor. The haplotype frequency data of 1 haplotype block are analyzed by a single processor; in this step, the probability of a local typeI error is calculated, given the significance level at each SNP locus.
Once the analysis of each haplotype block is complete, the results are compiled into a single genomewide dataset by using the MPIGatherv function. ParaHaplo is compatible with OpenMPI version 1.2.5, as well as MPICH version 1.2.7p1. Users can compile the source with a GCC compiler or an Intel C compiler. For singleprocessor machines, both the C and Java programs are also available.
Method
Hardware
When computational time was measured, we used a CentOS PC cluster at RIKEN comprising 1,024 nodes, each of which had a 1.6GHz Core2duo processor. On this PC cluster, 2,048 threads can be processed in parallel. The program was compiled by an Intel C compiler. Numbers of processing unit(s) used were 1, 64, 128, 256, 512, 768, and 1536.
Example data
As an example of GWAS, we applied ParaHaplo to compare genomewide haplotype frequencies between JPT and CHB of HapMap [11]; the number of individuals therein were 44 and 45, respectively. In this study, we conducted 100,000 generations for RAT. Haplotype blocks were obtained as LD blocks, according to the method of Gabriel et al. [14] and using the Haploview program [15]. The entire genomes of JPT + CHB were divided into 106,149 haplotype blocks by Haploview [15].
Results
Haplotypebased GWAS between JPT and CHB
List of haplotype blocks whose haplotype frequencies are significantly different (P < 0.01) between CHB and JPT.
Chromosome  Position  Haplotype block  Number of SNPs  HighScore SNP  Score  Global Pvalue  Gene Name  Biological Function  

2  24871212  2486928924909832  26  rs41523444  33.0  0.00774  CENPO  intron  
2  205189112  205070386205212654  69  rs12621708  33.2  0.00906  PARD3B  intron  
5  18748116  1874054318750444  5  rs11959018  32.1  0.00277  
7  134114362  134099273134115148  10  rs3807337  29.5  0.007  CALD1  intron  
8  81952476  8191293481997651  35  rs10957985  33.9  *  0.0083  
9  103425694  103425657103427263  5  rs10115450  43.5  *  0.00001  GRIN3A  intron 
11  61080499  6104963361102485  18  rs4939526  32.1  0.00421  SYT7  intron  
11  115945221  115943365115950641  8  rs4938285  31.3  0.00819  
12  87163821  8714970787191176  12  rs11104775  30.4  0.00762  
13  72589237  7258914972599947  3  rs1333099  32.1  0.00186  
15  59355415  5930635459363106  22  rs7175875  32.7  0.00294  
18  74579176  7456264574583266  10  rs5022079  30.0  0.00648  
22  35929011  3592743635929568  3  rs229562  28.0  0.00711 
We found 5 genes on haplotype blocks whose haplotype frequencies were significantly different between JPT and CHB as shown in Table 1. According to OMIM [16], CENPO is a subunit of a CENPHCENPIassociated centromeric complex that targets CENPA to centromeres and is required for proper kinetochore function and mitotic progression [17]. In Drosophila, PARD3B regulates cell polarization and is precisely regulated by 2 apically localized multiprotein signaling complexes that are tethered by Inscuteable, which regulates the apical localization [18]. CALD1 is a potential actomyosin regulatory protein found in smooth muscle and nonmuscle cells [19]. GRIN3A encodes a subunit of the Nmethyldaspartate (NMDA) receptors; it functions in physiological and pathological processes in the central nervous system [20]. SYT7 is a brainspecific, calciumdependent phospholipidbinding protein that plays a role in synaptic exocytosis and neurotransmitter release [21]
Calculation time
Elapsed times and speedups obtained with ParaHaplo on the HapMap 3 JPT data and CHB of chromosome 22
Number of Processing Units  Calculation Time  Speed Ratio ^{a}  

1  1 h  19 m  58 s  1 
64  3 m  41 s  22  
128  2 m  1 s  40  
256  1 m  25 s  56  
512  53 s  91  
768  47 s  101  
1536  41 s  116 
Discussion
We developed ParaHaplo, a set of computer programs for the parallel computation of accurate Pvalues in haplotypebased GWAS. ParaHaplo is intended for use in workstation clusters using the Intel MPI. By using ParaHaplo, we conducted haplotypebased GWAS as well as SNPbased GWAS, to find differences between JPT and CHB of the HapMap dataset [11].
Differences between Japanese in Tokyo, Japan, and Han Chinese in Beijing, China
We compared the performance of our algorithm with that of the regular permutation test in comparing JPT and CHB of HapMap. By using haplotypebased GWAS, a total of 13 haplotype blocks were found to exhibit significant differences in haplotype frequency between JPT and CHB; meanwhile, by using SNPbased GWAS, only 2 SNPs were significantly different. The results suggest that ParaHaplo can detect smaller differences between 2 populations than SNPbased GWAS. Accounting for differences in substructure is necessary for improving error rates in association studies [22, 23].
Natural selection is considered to be one of the causes of change in allele frequency; however, these haplotype blocks did not overlap the regions suspected of being influenced by natural selection Table 1 lists the genes on the haplotype blocks that have SNPs in which the JPT and CHP haplotype frequencies were significantly different (P < 0.01). It is unclear that the biological functions of the genes are different among JPT and CHB people.
Differences could also be caused by genetic drift [24], which brings about a change in allele frequency over time in a population, as a result of random sampling and chance. Archeological data suggest that there were probably 2 migratory waves of incoming people to Japan, both from the Asian continent. The first migration took place about 38,00037,000 years ago, before the Pleistocene land bridges became submerged. The last ice age ended and sea levels increased around 12,000 years ago; at this point, the Japanese people became isolated from the people of mainland Asia [25]. In the 12,000 years since then, the allele frequencies of the haplotype blocks listed on Table 1 may have changed due to genetic drift. Further studies may be necessary to determine whether these differences were maintained by natural selection, genetic drift, or both.
It is generally difficult to assess how many steps are necessary for the convergence of the MCMC algorithm and RAT. In this study, we conducted 7 runs of RAT for 100,000 generations by using different sets of processing units (table 2). We found the results of these runs were essentially the same; therefore, we considered 100,000 generations to comprise a sufficiently large dataset. We recommend monitoring convergence by comparing several independent runs.
Parallel computation of haplotypebased GWAS
The results showed that the parallel computing of ParaHaplo was 100fold faster than nonparallel programs. In this paper, we used only 89 JPT + CHB individuals whose genotypes had been determined by the HapMap project [11]. When a single processor was used, RAT for chromosome 22 took more than 1 h; if 9,000 individuals were analyzed under the same conditions, it would take approximately 5 days. In contrast, when multiple processors were used, RAT for chromosome 22 took less than 1 min; an analysis of 9,000 individuals under the same conditions would take approximately 1 h.
There are 1,536 haplotype blocks in chromosome 22. The speedup ratio was only 116 because of variations in the LD block size. Since ParaHaplo is based on data parallelism, the computational times of each of the RAT, SPT, and MCMC methods was proportional to the number of SNPs within the LD block [5, 6]; as a result, a large LD block becomes a computational bottleneck. To archive faster parallel computing of haplotypebased GWAS, further studies into more finegrained parallelization is required.
Conclusion
The results showed that the parallel computing of ParaHaplo was 100fold faster than that of nonparallel programs when the number of processors is sufficient. There are more than a million SNPs for which accurate and complete genotypes have been obtained, and thousands of people are now being genotyped [8, 9]. Since the data sizes of such projects continue to increase, the use of fast computations with parallel computingsuch as that used in ParaHaplowill become increasingly important.
Availability and requirements

Project name: ParaHaplo

Project home page: http://sourceforge.jp/projects/parallelgwas/?_sl=1

Operating systems: Platform independent

Programming language: Java and C

Other requirements: OpenMPI version 1.2.5, or MPICH version 1.2.7p1

License: MIT license

Any restrictions to use by nonacademics: License required
List of abbreviations used
 RAT:

Rapid Association Test
 SPT:

Standard Permutation Test
 MCMC:

Markovchain Monte Carlo
 JPT:

Japanese Tokyo
 CHB:

Han Chinese Beijing.
Declarations
Acknowledgements
We thank Drs. Yumi YamaguchiKabata, Akihiro Fujimoto, and Tatsuhiko Tsunoda for their useful comments. The present study was supported in part by grants from the Research Project for Personalized Medicine (MEXT). This work was supported by the National Project "Nextgeneration Integrated Living Matter Simulation" of the Ministry of Education, Culture, Sports, Science, and Technology (MEXT).
Authors’ Affiliations
References
 Hirschhorn JN, Daly MJ: Genomewide association studies for common diseases and complex traits. Nat Rev Genet. 2005, 6: 95108. 10.1038/nrg1521.View ArticlePubMedGoogle Scholar
 Ozaki K, Ohnishi Y, Iida A, Sekine A, Yamada R, Tsunoda T, Sato H, Hori M, Nakamura Y, Tanaka T: Functional SNPs in the lymphotoxinalpha gene that are associated with susceptibility to myocardial infarction. Nat Genet. 2002, 32: 650654. 10.1038/ng1047.View ArticlePubMedGoogle Scholar
 Onouchi Y, Gunji T, Burns JC, Shimizu C, Newburger JW, Yashiro M, Nakamura Y, Yanagawa H, Wakui K, Fukushima Y, Kishi F, Hamamoto K, Terai M, Sato Y, Ouchi K, Saji T, Nariai A, Kaburagi Y, Yoshikawa T, Suzuki K, Tanaka T, Nagai T, Cho H, Fujino A, Sekine A, Nakamichi R, Tsunoda T, Kawasaki T, Hata A: ITPKC functional polymorphism associated with Kawasaki disease susceptibility and formation of coronary artery aneurysms. Nat Genet. 2008, 40: 3542. 10.1038/ng.2007.59.PubMed CentralView ArticlePubMedGoogle Scholar
 Tokuhiro S, Yamada R, Chang X, Suzuki A, Kochi Y, Sawada T, Suzuki M, Nagasaki M, Ohtsuki M, Ono M, Furukawa H, Nagashima M, Yoshino S, Mabuchi A, Sekine A, Saito S, Takahashi A, Tsunoda T, Nakamura Y, Yamamoto K: An intronic SNP in a RUNX1 binding site of SLC22A4, encoding an organic cation transporter, is associated with rheumatoid arthritis. Nat Genet. 2003, 35: 341348. 10.1038/ng1267.View ArticlePubMedGoogle Scholar
 Misawa K, Fujii S, Yamazaki T, Takahashi A, Takasaki J, Yanagisawa M, Ohnishi Y, Nakamura Y, Kamatani N: New correction algorithms for multiple comparisons in casecontrol multilocus association studies based on haplotypes and diplotype configurations. J Hum Genet. 2008, 53: 789801. 10.1007/s1003800803120.View ArticlePubMedGoogle Scholar
 Kimmel G, Shamir R: A fast method for computing highsignificance disease association in large populationbased studies. Am J Hum Genet. 2006, 79: 481492. 10.1086/507317.PubMed CentralView ArticlePubMedGoogle Scholar
 Ohashi J, Tokunaga K: The power of genomewide association studies of complex disease genes: statistical limitations of indirect approaches using SNP markers. J Hum Genet. 2001, 46: 478482. 10.1007/s100380170048.View ArticlePubMedGoogle Scholar
 Nakamura Y: The BioBank Japan Project. Clin Adv Hematol Oncol. 2007, 5: 696697.PubMedGoogle Scholar
 YamaguchiKabata Y, Nakazono K, Takahashi A, Saito S, Hosono N, Kubo M, Nakamura Y, Kamatani N: Japanese population structure, based on SNP genotypes from 7003 individuals compared to other ethnic groups: effects on populationbased association studies. Am J Hum Genet. 2008, 83: 445456. 10.1016/j.ajhg.2008.08.019.PubMed CentralView ArticlePubMedGoogle Scholar
 Culler DE, Gupta A, Singh JP: Parallel Computer Architecture: A Hardware/Software Approach. 1997, San Francisco, CA: Morgan Kaufmann PublishersGoogle Scholar
 The_International_HapMap_Consortium: The International HapMap Project. Nature. 2003, 426: 789796. 10.1038/nature02168.View ArticleGoogle Scholar
 Tian C, Kosoy R, Lee A, Ransom M, Belmont JW, Gregersen PK, Seldin MF: Analysis of East Asia genetic substructure using genomewide SNP arrays. PLoS ONE. 2008, 3: e386210.1371/journal.pone.0003862.PubMed CentralView ArticlePubMedGoogle Scholar
 Higasa K, Miyatake K, Kukita Y, Tahira T, Hayashi K: DHaploDB: a database of definitive haplotypes determined by genotyping complete hydatidiform mole samples. Nucleic Acids Res. 2007, 35: D685689. 10.1093/nar/gkl848.PubMed CentralView ArticlePubMedGoogle Scholar
 Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, LiuCordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science. 2002, 296: 22252229. 10.1126/science.1069424.View ArticlePubMedGoogle Scholar
 Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263265. 10.1093/bioinformatics/bth457.View ArticlePubMedGoogle Scholar
 Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005, 33: D514517. 10.1093/nar/gki033.PubMed CentralView ArticlePubMedGoogle Scholar
 Okada M, Cheeseman IM, Hori T, Okawa K, McLeod IX, Yates JR, Desai A, Fukagawa T: The CENPHI complex is required for the efficient incorporation of newly synthesized CENPA into centromeres. Nat Cell Biol. 2006, 8: 446457. 10.1038/ncb1396.View ArticlePubMedGoogle Scholar
 Izaki T, Kamakura S, Kohjima M, Sumimoto H: Two forms of human Inscuteablerelated protein that links Par3 to the Pins homologues LGN and AGS3. Biochem Biophys Res Commun. 2006, 341: 10011006. 10.1016/j.bbrc.2006.01.050.View ArticlePubMedGoogle Scholar
 Humphrey MB, HerreraSosa H, Gonzalez G, Lee R, Bryan J: Cloning of cDNAs encoding human caldesmons. Gene. 1992, 112: 197204. 10.1016/03781119(92)90376Z.View ArticlePubMedGoogle Scholar
 Micu I, Jiang Q, Coderre E, Ridsdale A, Zhang L, Woulfe J, Yin X, Trapp BD, McRory JE, Rehak R, Zamponi GW, Wang W, Stys PK: NMDA receptors mediate calcium accumulation in myelin during chemical ischaemia. Nature. 2006, 439: 988992.PubMedGoogle Scholar
 Li C, Ullrich B, Zhang JZ, Anderson RG, Brose N, Sudhof TC: Ca(2+)dependent and independent activities of neural and nonneural synaptotagmins. Nature. 1995, 375: 594599. 10.1038/375594a0.View ArticlePubMedGoogle Scholar
 Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, Gabriel SB, Topol EJ, Smoller JW, Pato CN, Pato MT, Petryshen TL, Kolonel LN, Lander ES, Sklar P, Henderson B, Hirschhorn JN, Altshuler D: Assessing the impact of population stratification on genetic association studies. Nat Genet. 2004, 36: 388393. 10.1038/ng1333.View ArticlePubMedGoogle Scholar
 Tian C, Gregersen PK, Seldin MF: Accounting for ancestry: population substructure and genomewide association studies. Hum Mol Genet. 2008, 17: R143150. 10.1093/hmg/ddn268.PubMed CentralView ArticlePubMedGoogle Scholar
 Kimura M: Evolutionary rate at the molecular level. Nature. 1968, 217: 624626. 10.1038/217624a0.View ArticlePubMedGoogle Scholar
 Pope KO, Terrell JE: Environmental setting of human migrations in the circumPacific region. J Biogeogr. 2008, 35: 121.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.