Software review  Open  Published:
ParaHaplo: A program package for haplotypebased wholegenome association study using parallel computing
Source Code for Biology and Medicinevolume 4, Article number: 7 (2009)
Abstract
Background
Since more than a million singlenucleotide polymorphisms (SNPs) are analyzed in any given genomewide association study (GWAS), performing multiple comparisons can be problematic. To cope with multiplecomparison problems in GWAS, haplotypebased algorithms were developed to correct for multiple comparisons at multiple SNP loci in linkage disequilibrium. A permutation test can also control problems inherent in multiple testing; however, both the calculation of exact probability and the execution of permutation tests are timeconsuming. Faster methods for calculating exact probabilities and executing permutation tests are required.
Methods
We developed a set of computer programs for the parallel computation of accurate Pvalues in haplotypebased GWAS. Our program, ParaHaplo, is intended for workstation clusters using the Intel Message Passing Interface (MPI). We compared the performance of our algorithm to that of the regular permutation test on JPT and CHB of HapMap.
Results
ParaHaplo can detect smaller differences between 2 populations than SNPbased GWAS. We also found that parallelcomputing techniques made ParaHaplo 100fold faster than a nonparallel version of the program.
Conclusion
ParaHaplo is a useful tool in conducting haplotypebased GWAS. Since the data sizes of such projects continue to increase, the use of fast computations with parallel computingsuch as that used in ParaHaplowill become increasingly important. The executable binaries and program sources of ParaHaplo are available at the following address: http://sourceforge.jp/projects/parallelgwas/?_sl=1
Background
Recent advances in highthroughput genotyping technologies have allowed us to test allele frequency differences between case and control populations on a genomewide scale [1]. Genomewide association studies (GWAS) are used to compare the frequency of alleles or genotypes of a particular variant between disease cases and controls, across a given genome. A common approach is to test for differences in the allele frequencies of every singlenucleotide polymorphism (SNP) between the case and the control populations, by using the chisquare test [2–4]. The chisquare test uses the Pearson score, which increases as the difference in allele frequency between 2 populations increase. The chisquare test evaluates the Pearson score by way of the chisquare distribution.
One crucial problem in conducting SNPbased GWAS is performing corrections for multiple comparisons. A Bonferroni correction for a Pvalue is usually used to account for multiple testing under the assumption that all SNPs are independent. When SNP loci are in linkage disequilibrium, Bonferroni corrections are known to be too conservative and SNPbased GWAS may exclude truly significant SNPs [5, 6].
To address the multiplecomparison problem in GWAS, Misawa et al. [5] have developed new algorithms to correct for multiple comparisons at multiple SNP loci in a linkage disequilibrium, by treating linked loci as one haplotype block. This approach can be referred to as haplotypebased GWAS. In the present study, a haplotype refers to a list of alleles at multiple linked polymorphic loci, while a haplotype copy denotes a list of alleles within a gamete. Misawa et al. [5] developed a method of calculating the exact probability of a typeI error of haplotypebased GWAS, under the conditions that the haplotype frequencies in the population are known and the number of haplotype copies in the sample follows a multinomial distribution. Since this algorithm calculates all possible terms, the complexity of the computational time of this exact test is O(2n_{1}! 2 n_{2}!), where n_{1} is the sample size of the case population and n_{2} is the sample size of the control population. When the numbers of cases and controls exceed 50, such exact probabilities cannot be calculated, since they require too much time. As an alternative method, Misawa et al. [5] developed algorithms to asymptotically calculate the typeI error rates using a Markovchain Monte Carlo (MCMC) sampler that provides a good approximation to values calculated by the exact method. The computational complexity of the MCMC algorithm is O(Nnm), where N is the number of generations, n is the total sample size, n = n_{1} + n_{2}, and m is the number of loci.
The permutation test can also mitigate haplotypebased GWAS [6]. In the standard permutation test (SPT) for SNPbased GWAS, the test proceeds as follows. First, the Pearson score is calculated from the allele frequencies of the 2 populations at an SNP site; this score is the observed value of the Pearson score, S. Next, the 2 populations are pooled. The Pearson score is then calculated from the allele frequencies and recorded by randomly dividing these pooled values into two groups of size, n_{1} and n_{2}. The onesided Pvalue of the test is calculated as the proportion of sampled permutations where the Pearson score was greater than or equal to S. When SPT is applied to haplotypebased GWAS, haplotype copies of 2 populations are permuted, and Pearson scores are calculated for each SNP. The time complexity of the algorithm is O(Nnm). To search for SNPs whose Pvalues are lower than p, at least 1/p permutations are needed; therefore, the time complexity can be written as O(nm/p). For instance, to reach a P value of 10^{6} in a study that contains 1,000 cases and 1,000 controls with 10,000 loci, 10^{13} basic computer operations are required. Obviously, scaling up to larger studies comprising 100,000 loci is completely unattainable [6]. Therefore, to obtain small Pvalue bounds, one must expend a great deal of computational effort. By using importance sampling, Kimmel and Shamir [6] developed the rapid association test (RAT); much less effort is needed to achieve accurate and very low Pvalues by RAT than by SPT. The complexity of the running time for RAT is O(nb + N nc), where N is the number of permutations drawn by RAT, b is a predefined sampling constant, and c is the upper bound on the distance in SNPs between linked loci.
When the penetrance of a disease is small, a large number of SNPs from a large number of individuals from both case and control populations must be genotyped, to detect diseaseassociated genes [7]. There are currently more than a million SNPs for which accurate and complete genotypes have been obtained [8, 9]; thus, neither the MCMC algorithm [5] nor the RAT algorithm [6] can obtain a haplotypeGWAS result.
To conduct haplotypeGWAS within a short time period, we developed ParaHaplo, a parallelcomputation program that performs precisely these 2 functions for GWAS. ParaHaplo is based on data parallelism, a programming technique used to split large datasets into smaller datasets that can be run in a parallel, concurrent fashion [10]. ParaHaplo was developed on the basis of the Intel Message Passing Interface (MPI) and runs on PC clusters. ParaHaplo is a set of computer programs for SPT, RAT, MCMC, and the exact test, based on parallel computation. ParaHaplo is intended for use on workstation clusters using the Intel MPI, as well as on singleprocessor machines.
Using ParaHaplo, we conducted haplotypebased GWAS and SNPbased GWAS, to determine differences between Japanese in Tokyo, Japan (JPT) and Han Chinese in Beijing, China (CHB) of the HapMap dataset [11], because there are known to be small differences between JPT and CHB [9, 12]. We compared the speed of calculation of our algorithm with that of the regular permutation test on chromosome 22 of JPT and CHB of HapMap.
Implementation
Software overview
ParaHaplo supports the HapMap data format [8], as well as those of the Dhaplo DB [13] and BioBank Japan [8]. ParaHaplo requires an input file of the haplotype block boundary, as well as 2 datasets of population data. ParaHaplo can conduct either haplotypebased GWAS or SNPbased GWAS; in the case of the former, the data must be phased. ParaHaplo tests differences in allele frequency between 2 populations, e.g., a case population and a control population. ParaHaplo outputs the Pearson score for a chisquare test; a user can create ParaHaplo output by using a commandline option.
The ParaHaplo package includes both the calculation of exact probabilities and the algorithm to calculate asymptotically typeI error rates using a MCMC sampler [2]. Permutation tests SPT and RAT are also included in the ParaHaplo package. RAT is a fast algorithm for computing Pvalues in association studies and is based on an importance sampling developed by Kimmel and Shamir (2006) [3]. We did not incorporate the original source programs of Kimmel and Shamir (2006) into the package. The global typeI error is then obtained from the local typeI error by using a Bonferroni correction, because different haplotype blocks are considered independent of each other.
Parallel computing using MPI methods
ParaHaplo is implemented in an MPIC multithreaded package. The MPI package allows us to construct parallel computing programs on multiprocessors. The genomewide polymorphism data are broken into userdefined haplotype blocks, and the MPI Bcast function is used to distribute a single set of haplotype block data into each processor. The haplotype frequency data of 1 haplotype block are analyzed by a single processor; in this step, the probability of a local typeI error is calculated, given the significance level at each SNP locus.
Once the analysis of each haplotype block is complete, the results are compiled into a single genomewide dataset by using the MPIGatherv function. ParaHaplo is compatible with OpenMPI version 1.2.5, as well as MPICH version 1.2.7p1. Users can compile the source with a GCC compiler or an Intel C compiler. For singleprocessor machines, both the C and Java programs are also available.
Method
Hardware
When computational time was measured, we used a CentOS PC cluster at RIKEN comprising 1,024 nodes, each of which had a 1.6GHz Core2duo processor. On this PC cluster, 2,048 threads can be processed in parallel. The program was compiled by an Intel C compiler. Numbers of processing unit(s) used were 1, 64, 128, 256, 512, 768, and 1536.
Example data
As an example of GWAS, we applied ParaHaplo to compare genomewide haplotype frequencies between JPT and CHB of HapMap [11]; the number of individuals therein were 44 and 45, respectively. In this study, we conducted 100,000 generations for RAT. Haplotype blocks were obtained as LD blocks, according to the method of Gabriel et al. [14] and using the Haploview program [15]. The entire genomes of JPT + CHB were divided into 106,149 haplotype blocks by Haploview [15].
Results
Haplotypebased GWAS between JPT and CHB
Table 1 shows a list of haplotype blocks whose haplotype frequencies were significantly different between JPT and CHB, as detected by ParaHaplo. ParaHaplo detected 13 haplotype blocks whose haplotype frequencies were significantly different between JPT and CHB, when the significance level was set to 0.01. In contrast, when SNPbased GWAS was conducted on the same dataset, only 2 SNPs, rs10957985 and rs10115450which are denoted by an asterisk in Table 1were detected as being significantly different between JPT and CHB. Since there are 1,385,520 SNPs in this dataset, SNPbased GWAS considers SNPs whose Pearson scores are greater than 33.48 as significantly different between the 2 populations at the same significance level. This result suggests that ParaHaplo, as compared to SNPbased GWAS, can detect smaller differences between 2 populations.
We found 5 genes on haplotype blocks whose haplotype frequencies were significantly different between JPT and CHB as shown in Table 1. According to OMIM [16], CENPO is a subunit of a CENPHCENPIassociated centromeric complex that targets CENPA to centromeres and is required for proper kinetochore function and mitotic progression [17]. In Drosophila, PARD3B regulates cell polarization and is precisely regulated by 2 apically localized multiprotein signaling complexes that are tethered by Inscuteable, which regulates the apical localization [18]. CALD1 is a potential actomyosin regulatory protein found in smooth muscle and nonmuscle cells [19]. GRIN3A encodes a subunit of the Nmethyldaspartate (NMDA) receptors; it functions in physiological and pathological processes in the central nervous system [20]. SYT7 is a brainspecific, calciumdependent phospholipidbinding protein that plays a role in synaptic exocytosis and neurotransmitter release [21]
Calculation time
Speedup ratio is the ratio of the computational time of a single processor to that of multiple processors. Table 2 shows both the elapsed times and the speedups associated with the use of ParaHaplo, when chromosome 22 was analyzed. Numbers of processing unit(s) used were 1, 64, 128, 256, 512, 768, and 1536. As can be seen from table 2, calculation time decreased as the number of processors increased. When 1,536 processors were used, ParaHaplo was 100fold faster than the nonparallel program.
Discussion
We developed ParaHaplo, a set of computer programs for the parallel computation of accurate Pvalues in haplotypebased GWAS. ParaHaplo is intended for use in workstation clusters using the Intel MPI. By using ParaHaplo, we conducted haplotypebased GWAS as well as SNPbased GWAS, to find differences between JPT and CHB of the HapMap dataset [11].
Differences between Japanese in Tokyo, Japan, and Han Chinese in Beijing, China
We compared the performance of our algorithm with that of the regular permutation test in comparing JPT and CHB of HapMap. By using haplotypebased GWAS, a total of 13 haplotype blocks were found to exhibit significant differences in haplotype frequency between JPT and CHB; meanwhile, by using SNPbased GWAS, only 2 SNPs were significantly different. The results suggest that ParaHaplo can detect smaller differences between 2 populations than SNPbased GWAS. Accounting for differences in substructure is necessary for improving error rates in association studies [22, 23].
Natural selection is considered to be one of the causes of change in allele frequency; however, these haplotype blocks did not overlap the regions suspected of being influenced by natural selection Table 1 lists the genes on the haplotype blocks that have SNPs in which the JPT and CHP haplotype frequencies were significantly different (P < 0.01). It is unclear that the biological functions of the genes are different among JPT and CHB people.
Differences could also be caused by genetic drift [24], which brings about a change in allele frequency over time in a population, as a result of random sampling and chance. Archeological data suggest that there were probably 2 migratory waves of incoming people to Japan, both from the Asian continent. The first migration took place about 38,00037,000 years ago, before the Pleistocene land bridges became submerged. The last ice age ended and sea levels increased around 12,000 years ago; at this point, the Japanese people became isolated from the people of mainland Asia [25]. In the 12,000 years since then, the allele frequencies of the haplotype blocks listed on Table 1 may have changed due to genetic drift. Further studies may be necessary to determine whether these differences were maintained by natural selection, genetic drift, or both.
It is generally difficult to assess how many steps are necessary for the convergence of the MCMC algorithm and RAT. In this study, we conducted 7 runs of RAT for 100,000 generations by using different sets of processing units (table 2). We found the results of these runs were essentially the same; therefore, we considered 100,000 generations to comprise a sufficiently large dataset. We recommend monitoring convergence by comparing several independent runs.
Parallel computation of haplotypebased GWAS
The results showed that the parallel computing of ParaHaplo was 100fold faster than nonparallel programs. In this paper, we used only 89 JPT + CHB individuals whose genotypes had been determined by the HapMap project [11]. When a single processor was used, RAT for chromosome 22 took more than 1 h; if 9,000 individuals were analyzed under the same conditions, it would take approximately 5 days. In contrast, when multiple processors were used, RAT for chromosome 22 took less than 1 min; an analysis of 9,000 individuals under the same conditions would take approximately 1 h.
There are 1,536 haplotype blocks in chromosome 22. The speedup ratio was only 116 because of variations in the LD block size. Since ParaHaplo is based on data parallelism, the computational times of each of the RAT, SPT, and MCMC methods was proportional to the number of SNPs within the LD block [5, 6]; as a result, a large LD block becomes a computational bottleneck. To archive faster parallel computing of haplotypebased GWAS, further studies into more finegrained parallelization is required.
Conclusion
The results showed that the parallel computing of ParaHaplo was 100fold faster than that of nonparallel programs when the number of processors is sufficient. There are more than a million SNPs for which accurate and complete genotypes have been obtained, and thousands of people are now being genotyped [8, 9]. Since the data sizes of such projects continue to increase, the use of fast computations with parallel computingsuch as that used in ParaHaplowill become increasingly important.
Availability and requirements

Project name: ParaHaplo

Project home page: http://sourceforge.jp/projects/parallelgwas/?_sl=1

Operating systems: Platform independent

Programming language: Java and C

Other requirements: OpenMPI version 1.2.5, or MPICH version 1.2.7p1

License: MIT license

Any restrictions to use by nonacademics: License required
Abbreviations
 RAT:

Rapid Association Test
 SPT:

Standard Permutation Test
 MCMC:

Markovchain Monte Carlo
 JPT:

Japanese Tokyo
 CHB:

Han Chinese Beijing.
References
 1.
Hirschhorn JN, Daly MJ: Genomewide association studies for common diseases and complex traits. Nat Rev Genet. 2005, 6: 95108. 10.1038/nrg1521.
 2.
Ozaki K, Ohnishi Y, Iida A, Sekine A, Yamada R, Tsunoda T, Sato H, Hori M, Nakamura Y, Tanaka T: Functional SNPs in the lymphotoxinalpha gene that are associated with susceptibility to myocardial infarction. Nat Genet. 2002, 32: 650654. 10.1038/ng1047.
 3.
Onouchi Y, Gunji T, Burns JC, Shimizu C, Newburger JW, Yashiro M, Nakamura Y, Yanagawa H, Wakui K, Fukushima Y, Kishi F, Hamamoto K, Terai M, Sato Y, Ouchi K, Saji T, Nariai A, Kaburagi Y, Yoshikawa T, Suzuki K, Tanaka T, Nagai T, Cho H, Fujino A, Sekine A, Nakamichi R, Tsunoda T, Kawasaki T, Hata A: ITPKC functional polymorphism associated with Kawasaki disease susceptibility and formation of coronary artery aneurysms. Nat Genet. 2008, 40: 3542. 10.1038/ng.2007.59.
 4.
Tokuhiro S, Yamada R, Chang X, Suzuki A, Kochi Y, Sawada T, Suzuki M, Nagasaki M, Ohtsuki M, Ono M, Furukawa H, Nagashima M, Yoshino S, Mabuchi A, Sekine A, Saito S, Takahashi A, Tsunoda T, Nakamura Y, Yamamoto K: An intronic SNP in a RUNX1 binding site of SLC22A4, encoding an organic cation transporter, is associated with rheumatoid arthritis. Nat Genet. 2003, 35: 341348. 10.1038/ng1267.
 5.
Misawa K, Fujii S, Yamazaki T, Takahashi A, Takasaki J, Yanagisawa M, Ohnishi Y, Nakamura Y, Kamatani N: New correction algorithms for multiple comparisons in casecontrol multilocus association studies based on haplotypes and diplotype configurations. J Hum Genet. 2008, 53: 789801. 10.1007/s1003800803120.
 6.
Kimmel G, Shamir R: A fast method for computing highsignificance disease association in large populationbased studies. Am J Hum Genet. 2006, 79: 481492. 10.1086/507317.
 7.
Ohashi J, Tokunaga K: The power of genomewide association studies of complex disease genes: statistical limitations of indirect approaches using SNP markers. J Hum Genet. 2001, 46: 478482. 10.1007/s100380170048.
 8.
Nakamura Y: The BioBank Japan Project. Clin Adv Hematol Oncol. 2007, 5: 696697.
 9.
YamaguchiKabata Y, Nakazono K, Takahashi A, Saito S, Hosono N, Kubo M, Nakamura Y, Kamatani N: Japanese population structure, based on SNP genotypes from 7003 individuals compared to other ethnic groups: effects on populationbased association studies. Am J Hum Genet. 2008, 83: 445456. 10.1016/j.ajhg.2008.08.019.
 10.
Culler DE, Gupta A, Singh JP: Parallel Computer Architecture: A Hardware/Software Approach. 1997, San Francisco, CA: Morgan Kaufmann Publishers
 11.
The_International_HapMap_Consortium: The International HapMap Project. Nature. 2003, 426: 789796. 10.1038/nature02168.
 12.
Tian C, Kosoy R, Lee A, Ransom M, Belmont JW, Gregersen PK, Seldin MF: Analysis of East Asia genetic substructure using genomewide SNP arrays. PLoS ONE. 2008, 3: e386210.1371/journal.pone.0003862.
 13.
Higasa K, Miyatake K, Kukita Y, Tahira T, Hayashi K: DHaploDB: a database of definitive haplotypes determined by genotyping complete hydatidiform mole samples. Nucleic Acids Res. 2007, 35: D685689. 10.1093/nar/gkl848.
 14.
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, LiuCordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science. 2002, 296: 22252229. 10.1126/science.1069424.
 15.
Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263265. 10.1093/bioinformatics/bth457.
 16.
Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005, 33: D514517. 10.1093/nar/gki033.
 17.
Okada M, Cheeseman IM, Hori T, Okawa K, McLeod IX, Yates JR, Desai A, Fukagawa T: The CENPHI complex is required for the efficient incorporation of newly synthesized CENPA into centromeres. Nat Cell Biol. 2006, 8: 446457. 10.1038/ncb1396.
 18.
Izaki T, Kamakura S, Kohjima M, Sumimoto H: Two forms of human Inscuteablerelated protein that links Par3 to the Pins homologues LGN and AGS3. Biochem Biophys Res Commun. 2006, 341: 10011006. 10.1016/j.bbrc.2006.01.050.
 19.
Humphrey MB, HerreraSosa H, Gonzalez G, Lee R, Bryan J: Cloning of cDNAs encoding human caldesmons. Gene. 1992, 112: 197204. 10.1016/03781119(92)90376Z.
 20.
Micu I, Jiang Q, Coderre E, Ridsdale A, Zhang L, Woulfe J, Yin X, Trapp BD, McRory JE, Rehak R, Zamponi GW, Wang W, Stys PK: NMDA receptors mediate calcium accumulation in myelin during chemical ischaemia. Nature. 2006, 439: 988992.
 21.
Li C, Ullrich B, Zhang JZ, Anderson RG, Brose N, Sudhof TC: Ca(2+)dependent and independent activities of neural and nonneural synaptotagmins. Nature. 1995, 375: 594599. 10.1038/375594a0.
 22.
Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, Gabriel SB, Topol EJ, Smoller JW, Pato CN, Pato MT, Petryshen TL, Kolonel LN, Lander ES, Sklar P, Henderson B, Hirschhorn JN, Altshuler D: Assessing the impact of population stratification on genetic association studies. Nat Genet. 2004, 36: 388393. 10.1038/ng1333.
 23.
Tian C, Gregersen PK, Seldin MF: Accounting for ancestry: population substructure and genomewide association studies. Hum Mol Genet. 2008, 17: R143150. 10.1093/hmg/ddn268.
 24.
Kimura M: Evolutionary rate at the molecular level. Nature. 1968, 217: 624626. 10.1038/217624a0.
 25.
Pope KO, Terrell JE: Environmental setting of human migrations in the circumPacific region. J Biogeogr. 2008, 35: 121.
Acknowledgements
We thank Drs. Yumi YamaguchiKabata, Akihiro Fujimoto, and Tatsuhiko Tsunoda for their useful comments. The present study was supported in part by grants from the Research Project for Personalized Medicine (MEXT). This work was supported by the National Project "Nextgeneration Integrated Living Matter Simulation" of the Ministry of Education, Culture, Sports, Science, and Technology (MEXT).
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Kazuharu Misawa wrote the software and the manuscript, and Naoyuki Kamatani supervised the project. Both authors read and approved the final manuscript.
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Haplotype Frequency
 Message Passing Interface
 Haplotype Block
 MCMC Algorithm
 Workstation Cluster