popRange: a highly flexible spatially and temporally explicit Wright-Fisher simulator
- Kimberly F McManus^{1, 2}Email author
https://doi.org/10.1186/s13029-015-0036-4
© McManus; licensee BioMed Central. 2015
Received: 9 October 2014
Accepted: 30 March 2015
Published: 11 April 2015
Abstract
Background
Sequencing and genotyping technology advancements have led to massive, growing repositories of spatially explicit genetic data and increasing quantities of temporal data (i.e., ancient DNA). These data will allow more complex and fine-scale inferences about population history than ever before; however, new methods are needed to test complex hypotheses.
Results
This article presents popRange, a forward genetic simulator, which incorporates large-scale genetic data with stochastic spatially and temporally explicit demographic and selective models. Features such as spatially and temporally variable selection coefficients and demography are incorporated in a highly flexible manner. popRange is implemented as an R package and presented with an example simulation exploring a selected allele’s trajectory in multiple subpopulations.
Conclusions
popRange allows researchers to evaluate and test complex scenarios by simulating large-scale data with complicated demographic and selective features. popRange is available for download at http://cran.r-project.org/web/packages/popRange/index.html.
Keywords
Population genetics Python R Genetic simulatorsBackground
Recent advances in sequencing and genotyping technology have led to dramatic reduction in cost and increased accuracy of DNA sequencing. This advance has led to the creation of large repositories of spatially explicit genetic data and increasing quantities of temporal data (i.e., ancient DNA). Furthermore, data continue to be generated at an unprecedented rate; 10X more sequences are generated every year [1-3].
Comparison of population genetic simulators
Feature | SPLATCHE2 [16] | SimAdapt [6] | quantiNEMO [7] | SFS_CODE [4] | SLiM [5] | popRange |
---|---|---|---|---|---|---|
Simulation method | Coalescent | Forward-Time | Forward-Time | Forward-Time | Forward-Time | Forward-Time |
Data type | SNPs, STRs, DNA sequences, RFLPs | SNPs, STRs | SNPs, STRs | SNPs, DNA sequences | SNPs, DNA sequences | SNPs |
Interface | Command-line, GUI | Command-line, GUI, accessed via R package RNetlogo | GUI | Command-line | Command-line | R package |
Many SNPs (>100) | No | No | Allowed, but very slow [6] | Yes | Yes | Yes |
Population structure | Friction, migration rates (one rate per population) | Dispersal distance (one rate per population) | Migration rates, stochastic founding/extinction of populations | Migration rates, speciation, domestication & admixture events | Migration rates | Population grid based, migration rates, stochastic founding/extinction of populations |
Population dynamics | Logistic growth | Logistic growth | Logistic growth | Logistic & exp. growth, step size changes | Step size changes | Logistic growth, Allee effect, step size changes |
Natural selection | No | Fixed values | Many models, spatially & temporally varying | Fixed values, gamma, normal & 3-point mass models | Fixed values, gamma & exponential distributions | Fixed values, gamma distribution, spatially & temporally varying |
Linkage | Yes | No | Yes | Yes | Yes | No |
Software such as sfs_code [4] and SLiM [5] allow simulation of large segments of DNA and integrate a wide range of parameters, such as recombination, migration and selection. These simulators allow extraction of haplotypes at various time points to explore time-series genetic trends. However, they require specification of divergence times, founder population sizes and deterministically set migration rates, which limit their ability to model stochastic demographic events. For example, modeling range expansions are difficult to simulate, as populations cannot stochastically populate the world.
Other simulators allow populations to form and diverge in a more stochastic manner than those described above. However these simulators focus on a small number of independently segregating loci. One of the most flexible simulators, SimAdapt [6], allows, among many features, temporally variable gene flow barriers, differences in fitness between populations, and different carrying capacities. Another simulator, quantiNemo [7], allows the simulation of spatially and temporally explicit selection coefficients, but requires the user to set starting allele frequencies and runs very slowly on even mid-sized data [8]. Simulators in this category are typically unable to generate the large quantity of single nucleotide polymorphisms (SNPs) and still lack flexibility with respect to spatially and temporally variable parameters.
A main use of a new generation of simulators is to allow researchers to evaluate and test hypotheses generated from the data, with flexible scenarios [9]. Most modern population genetic analyses, including principal component analysis (PCA), large-scale inference of demography (i.e. ∂a∂i [10]) and ancestry analyses (i.e. ADMIXTURE [11]) require the generation of a large number of independently segregating SNPs.
popRange bridges this gap by simulating complex demographic scenarios with large-scale genetic data. These simulators are necessary to interpret current genetic data in more realistic demographic scenarios. Though popRange does not simulate linkage, independently segregating loci are sufficient for many large-scale analyses.
This software provides a simulation framework for modeling highly probabilistic spatial and temporal population dynamics. To date, no existing simulator incorporates both stochastic spatially and temporally explicit scenarios and chromosome-scale data. This grid-based population structure model allows spatial and migration flexibility, such as in simulations of arbitrary landscape barriers. However, information from both types of data in simulations is essential to gain insight to realistic dynamic processes on the genome.
Implementation
Technical details
popRange is implemented as an R package and requires R [12], Python 2.7.× or Python 3.2.×-3.4.× [13], the Python package NumPy [14], and the R package findpython [15]. As an R package, it can run on any operating system.
Simulation overview
popRange is a highly probabilistic Wright-Fisher forward population genetic simulator. Specifically, it incorporates 1) large scale data (many SNPs, populations, and individuals), 2) a grid-based population structure, 3) a wide variety of spatially and temporally explicit stochastic demographic parameters, and 4) a variety of output file formats.
- 1.
Extinction: Each generation, each population can become extinct with probability set by the user.
- 2.
Migration: Migration rates are spatially and temporally explicit, allowing the simulation of a wide range of landscapes. Note that migration is highly probabilistic; rates are the probability each individual migrates in each generation. The number of migrants from each population is determined by a binomial distribution and migration probability. These probabilities may allow a random adjacent destination population to be chosen or they may be specific with respect to initial and final populations.
- 3.Mutation: Mutations are based on the infinitely many sites model. The number of mutations introduced into each population in each generation is drawn from a Poisson distribution parameterized by:$$ \lambda =\mu *g*N $$
where μ is the mutation rate parameter, g is the number of base pairs in the genome, and N is the population size.
- 4.
Selection: When a mutation is introduced, a selection coefficient may be placed on the new allele. Selection coefficients may be fixed values or may be drawn from a gamma distribution.
- 5.Population growth: Populations may grow logistically or may experience instantaneous population size changes. For logistic growth, the growth rate, r, is drawn from a normal distribution with a mean and variance provided by the user. This r is used in the logistic growth equation:$$ {N}_t=r*{N}_{t-1}*\frac{1-{N}_{t-1}}{K}*\frac{N_{t-1}-A}{K}, $$
where N is the population size, K is the carry capacity and A is the Allee effect.
- 6.
Drift/Reproduction: Populations may be haploid or diploid and random mating is assumed within each hermaphroditic population.
- 7.
Output results: When all generations are complete, output may be written to a variety of file formats, including Geneland, PLINK, and GENEPOP (Additional file 1: Section 5).
Results and discussion
This framework combines simulators commonly used in ecology, which incorporate stochastic demographic scenarios, such as population growth and contraction, and stochastic founding and extinction of populations, with simulators more commonly used in population genetics, which include many SNPs. It also incorporates more advanced features such as spatially and temporally explicit selection.
Runtime
Runtime scales linearly with the number of base pairs and the number of individuals (see Additional file 1). For reference, simulating a 100 kb sequence in a 4 × 4 population grid (16 populations) with 100 diploids per population, 0.01 migration rate between adjacent populations, 1.1E-8 mutation rate per generation and 1000 generations completes in a bit under 4 minutes.
Accuracy
This software was evaluated for accuracy through comparisons with theoretical expectations of heterozygosity, fixation probabilities of new mutations, and Fst (Additional file 1: Section 6).
Example simulation
Conclusion
popRange allows users to simulate spatially and temporally explicit scenarios with chromosome-scale data efficiently for the first time. Features such as spatially and temporally variable selection coefficients are incorporated in a flexible manner. This software allows for large-scale analyses and comparisons of these complex, stochastic models and is implemented in R, facilitating ease-of-use. I expect that this software will fill a gap and help researchers better make use of the increasing geographically explicit genomic data that is being accumulated for diverse group of organisms.
Availability and requirements
Project name: popRange
Project homepage: http://cran.r-project.org/web/packages/popRange/index.html
Direct Download link: http://cran.r-project.org/src/contrib/popRange_1.1.2.tar.gz
Operating systems: Linux, Mac OS X, Windows
Programming languages: R, Python 2.7.× or Python 3.2.×-3.4.×
Other requirements: NumPy (python package), findPython (R package)
License: MIT
Any restrictions to use by non-academic users: no licenses required.
Declarations
Acknowledgements
The author would like to thank Dr. Carlos Bustamante and Dr. Omar Cornejo for valuable feedback on the simulation framework and manuscript. This work was supported by the NIH training grant #5T32GM007276-38 and a Stanford Center for Computational, Evolutionary and Human Genomics (CEHG) fellowship.
Authors’ Affiliations
References
- Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155(1):27–38.View ArticlePubMed CentralPubMedGoogle Scholar
- Metzker ML. Sequencing technologies – the next generation. Nat Rev Genet. 2010;11:21–46.View ArticleGoogle Scholar
- Wall JD, Slatkin M. Paleopopulation genetics. Annu Rev Genet. 2012;46:635–49.View ArticlePubMedGoogle Scholar
- Hernandez RD. A flexible forward simulator for populations subject to selection and demography. Bioinformatics. 2008;24:2786–7.View ArticlePubMed CentralPubMedGoogle Scholar
- Messer PW. SLiM: simulating evolution with selection and linkage. Genetics. 2013;194(4):1037–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Rebaudo F, Le Rouzic AL, Dupas S, Silvain JF, Harry M, Dangles O. SimAdapt: an individual-based genetic model for simulating landscape management impacts on populations. Methods Ecol Evol. 2013;4:595–600.View ArticleGoogle Scholar
- Neuenschwander S, Hospital F, Guillaume F, Goudet J. quantiNemo: an individual-based program to simulate quantitative traits with explicit genetic architecture in a dynamic metapopulation. Bioinformatics. 2008;24(13):1552–3.View ArticlePubMedGoogle Scholar
- Yuan X, Miller DJ, Zhang J, Herrington D, Wang Y. An overview of population genetic data simulation. J Comp Biol. 2012;19(1):42–54.View ArticleGoogle Scholar
- Hoban S, Bertorelle G, Gaggiotti OE. Computer simulations: tools for populations and evolutionary genetics. Nat Rev Genet. 2012;13:110–22.PubMedGoogle Scholar
- Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multiple populations from multidimensional SNP frequency data. PLOS Genet. 2009;5(10), e10000695.View ArticleGoogle Scholar
- Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–64.View ArticlePubMed CentralPubMedGoogle Scholar
- R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. 2014. [http://R-project.org]
- Python Software Foundation. Python Language Reference, version 2.7-3.4. [http://www.python.org]
- Van der Walt S, Colbert CS, Varoquaux G. The NumPy array: a structure for efficient numerical computation. Comput Sci Eng. 2011;13:22–30.View ArticleGoogle Scholar
- Davis TL, Gilbert P. findpython: Python tools to find an acceptable python binary. R package version 1.0.1. [http://CRAN.R-project.org/package=findpython]
- Ray N, Currat M, Excoffier L. SPLATCHE2: a spatially explicit simulation framework for complex demography, genetic admixture and recombination. Bioinformatics. 2010;26(23):2993–4.View ArticlePubMedGoogle Scholar
- Arenas M. Simulation of molecular data under diverse evolutionary scenarios. PLOS Comput Biol. 2012;8(5), e1002495. doi:10.1371/journal.pcbi.1002495.View ArticlePubMed CentralPubMedGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.