Simulating pedigrees ascertained for multiple disease-affected relatives

Background Studies that ascertain families containing multiple relatives affected by disease can be useful for identification of causal, rare variants from next-generation sequencing data. Results We present the R package SimRVPedigree, which allows researchers to simulate pedigrees ascertained on the basis of multiple, affected relatives. By incorporating the ascertainment process in the simulation, SimRVPedigree allows researchers to better understand the within-family patterns of relationship amongst affected individuals and ages of disease onset. Conclusions Through simulation, we show that affected members of a family segregating a rare disease variant tend to be more numerous and cluster in relationships more closely than those for sporadic disease. We also show that the family ascertainment process can lead to apparent anticipation in the age of onset. Finally, we use simulation to gain insight into the limit on the proportion of ascertained families segregating a causal variant. SimRVPedigree should be useful to investigators seeking insight into the family-based study design through simulation. Electronic supplementary material The online version of this article (10.1186/s13029-018-0069-6) contains supplementary material, which is available to authorized users.


Algorithm to Simulate All Life Events Starting at Birth
To simulate all life events for an individual, starting at birth, we implement the following algorithm, which simulates life events until either death or a simulated event exceeds the last year of the study.
• Set y to the individual's year of birth.
• Set y S to the last year of the study.
• Set t max = y S − y.
• Set t = 0. In this context, t represents the individual's age, in years; hence, at birth the individual is 0 years old.
• Determine the individual's risk variant status, x, where x = 1 if the individual has the familial risk variant and x = 0 otherwise.
• Set δ, the disease status indicator, to 0 to indicate that disease onset has not occurred at birth.
• While t < t max : • Simulate w o|t,x , the waiting time to disease onset conditioned on the current age and rare-variant status 1 .
• Simulate w d|t,δ , the waiting time to death conditioned on the current age and disease status.
• Simulate w r|t , the waiting time to reproduction conditioned on the current age.
• If t + t < t max and t = w o|t,x : · store the individual's year of disease onset, y + t + t , · and set t = t + t .
• If t + t < t max and t = w d|t,δ : · store the individual's year of death, y + t + t , · set t = t max to stop the simulation.
• If t + t < t max and t = w r|t : · create offspring, store offspring's year of birth, y + t + t , simulate the offspring's gender uniformly between male and female, and simulate the offspring's rare-variant status according to Mendel's laws • If t + t ≥ t max , set t = t + t (i.e. stop simulation).

Distribution of Average IBD Probability Among Affected Family Members
We measure familial disease clustering by the average of the pairwise identity by descent (IBD) probabilities among the affected relatives in the pedigree. We denote this measure by by A IBD . To formalize this measure, within a pedigree, we denote the k affected family members by m 1 , m 2 , ..., m k , and let p i,j denote the probability that m i and m j share a variant IBD. Using this criteria, A IBD may be calculated as To investigate the relationship between familial clustering among affected relatives and κ, the relative-risk of disease in genetic cases, we consider three genetic-relative-risk groups: κ = 1, κ = 10, and κ = 20. The simulated study samples are described in the main text in section Results: Familial Clustering. Tables 1 and 2 summarize the conditional distribution of A IBD in families with two and three disease-affected relatives, respectively, for the three genetic-relative-risk groups considered.

Negative Control for Anticipation: Age at Death
As discussed in the main text, in section Results: Anticipation, it is possible to use the ages of death in unaffected relatives as a negative control to gain insight into ascertainment bias that contributes to apparent anticipation signals in age of onset [1]. In this context, an individual's generation number is relative to the eldest pedigree founder. That is, the two eldest founders will have generation number one, their offspring generation number two, etc. Figure 1 displays box plots of age of death for three genetic-relative-risk groups: κ = 1, κ = 10, and κ = 20. In Figure 1, we see that, within genetic-relative-risk group, the age of death tends to decrease successive generations. This apparent anticipation arises from right truncation in younger generations.

Effect of Follow Up on Ascertainment Bias
To determine if increasing the time to follow up reduces the effect of the ascertainment bias, we simulated three study samples each containing 500 pedigrees according to the following criteria. 6. Genetic cases, i.e. affected individuals that inherited the rare variant, experience disease onset at 1, 10, or 20 times the baseline, age-specific hazard rate of lymphoid cancer. That is, for the first sample of 500 pedigrees the genetic relative-risk was set to 1, for the second it was set to 10, and for the third it was set to 20. 7. Since death by lymphoid cancer accounts for a relatively small proportion of all causes of death, the age-specific hazard rate for death in the unaffected population was approximated by that of the general population. Individuals who developed lymphoid cancer experienced death according to the age-specific hazard rate of death in the affected population [2,5,6], whereas unaffected individuals experienced death according to the age-specific hazard rate of death in the general population [4].
8. The proband's probabilities for recalling relatives were set to recall probs = (1); so that pedigrees were fully-ascertained.

9.
The stop year of the study was set to 2115.
We restrict attention to pedigree members who were alive at the time of ascertainment. Individuals born after 2015 were not considered. For the three genetic-relative-risk groups considered (1, 10, and 20), we compare the distribution of age of onset by assigned generation number for disease-affected relatives at various follow-up milestones. We consider the following milestones: at the end of the ascertainment period or the 0-year milestone (2015), at the 25-year milestone (2040), at the 50-year milestone (2065), and at the 100-year milestone (2115). Figure 2 displays box plots of the age of onset for the three groups and four milestones considered. Figure 2: Box plots of age of onset for disease-affected relatives by assigned generation number (see main text) at 0, 25, 50, and 100 years to follow-up for the three relative-risk groups considered. From top to bottom, the first row provides results for the κ = 1 (fully sporadic) sample, the second row provides results for the κ = 10 sample, and the third row provides results for the κ = 20 sample.
From Figure 2 we see that, as the time to follow-up increases and additional relatives experience disease onset, the age of onset for assigned generations three and four shift upward, and appear more like those of generations one and two. Thus increasing the time to follow-up by a considerable amount reduces the effect of ascertainment bias.

Effect of Carrier Probability on Proportion of Ascertained Families with Genetic Cases
We illustrate the effect of varying carrier probability on the proportion of ascertained pedigrees that are segregating a genetic variant. To accomplish this, in addition to the one thousand pedigrees considered in Results: Applications: Proportion of Ascertained Pedigrees Segregating a Causal Variant we simulated an additional one thousand pedigrees, according to the same settings described in the main text, with carrier probability 0.01 and 0.005. The results of this investigation are displayed in Figure 3. Figure 3: Scatter plots of the probability that a randomly selected pedigree from a sample of ascertained pedigrees is segregating a genetic variant with carrier probability p c and relativerisk of disease κ against the relative-risk of disease κ. We consider restricting attention to the ascertained pedigrees with n A or more disease-affected relatives. In the leftmost plot, we consider all one thousand pedigrees ascertained with two or more disease-affected relatives; in the rightmost plot, we consider the subset with three or more disease-affected relatives. Figure 3 illustrates that as the carrier probability increases the proportion of ascertained pedigrees that segregate a causal variant increases for any genetic relative-risk value considered except when κ = 1.
Our simulation procedure only allows the starting founder, and not any of the marry-ins, the opportunity to introduce a causal variant. Therefore, as the carrier probability increases our procedure will introduce a causal variant less frequently than would be observed under the assumptions of random mating in the population. As a result, as p c increases this procedure will underestimate the proportion of ascertained families that are segregating a causal variant.

Comparison of Simulated and Observed Age-Specific Fertility Data
We demonstrate that the proposed method to simulate the waiting time to reproduction, described in Methods: Simulating Life Events: Reproduction, mimics observed fertility data. We simulated 10,000 lives starting at birth and ending with death, and recorded the ages at which each individual reproduced. From this data we calculated the percentage of first-born births by age group. Table 3 compares the percentage of first-born births by age group in the simulated data with that of the 1993 and 2013 Canadian populations [7].