### IBD estimation

There are two common approaches to estimating IBD coefficients: maximum likelihood [10–12] and the method of moments [13–15]. Typically, maximum-likelihood estimators (MLEs) are more biased than method-of-moments estimators (MMEs), especially when the number of loci is small; they are also more computationally expensive [14]. However, MMEs are less precise than MLEs and can fall outside the biologically meaningful parameter space [11].

In this section, we review a popular method of moments approach to estimating IBD coefficients introduced by [15] and implemented in PLINK. This approach assumes that the individuals are from the same homogeneous, random-mating and non-inbred population. Alleles from two individuals are considered to be IBD if they are descended from a common ancestor in some base population that we take to be relatively recent. All alleles in this base population are defined to be non-IBD. Given a SNP with alleles *A* and *a*, a pair of individuals that are, say, *AA* and *aa*, respectively, will be denoted (*AA, aa*).

Identity-by-state (IBS) for a pair of subjects is denoted by the random variable

*I* and identity-by-descent by the random variable

*Z*, with possible states being 0, 1, and 2 for both random variables. The IBD coefficients to be estimated are the proportions of genome shared IBD, denoted by

*P*(

*Z* = 0),

*P*(

*Z* = 1), and

*P*(

*Z* = 2). For a given SNP

*m*, the procedure begins by expressing the prior probability of IBS sharing as

$\begin{array}{l}{P}_{m}(I=i)=\sum _{z=0}^{i}{P}_{m}(I=i|Z=z\left)P\right(Z=z).\end{array}$

(1)

*P*(

*Z* =

*z*) and

*P*_{
m
}(

*I* =

*i*) are specific to the pair of subjects being considered, while the conditional SNP-specific IBS probabilities

*P*_{
m
}(

*I* =

*i*|

*Z* =

*z*) apply to all pairs. For a given pair of individuals at a given SNP, the above equation specifies three identities for the IBS states 0, 1, and 2. These three identities are summed over SNPs and then rearranged to express

*P*(

*Z* = 0),

*P*(

*Z* = 1), and

*P*(

*Z* = 2) for the pair in terms of marginal and conditional IBS probabilities. For example, in the case of

*i* =

*z* = 0, we obtain

$P(Z=0)=\sum _{m}{P}_{m}(I=0)/\sum _{m}{P}_{m}(I=0|Z=0).$

The method-of-moments estimators of IBD coefficients for a given pair are obtained by substituting estimators of the conditional SNP-specific IBS probabilities, *P*_{
m
}(*I* = *i*|*Z* = *z*), pertaining to any pair and the pair’s marginal SNP-specific IBS probabilities, *P*_{
m
}(*I* = *i*), into the identities and then solving for *P*(*Z* = *i*).

The marginal SNP-specific IBS probabilities, *P*_{
m
}(*I* = *i*) for a pair of subjects may be estimated by the indicator function for whether the pair has *I* = *i* at the SNP. An unbiased estimator of $\sum _{m}{\widehat{P}}_{m}(I=i)$ is therefore the count of SNPs at which the pair shares *i* alleles IBS. Estimates of the SNP-specific conditional IBS probabilities, *P*_{
m
}(*I* = *i*|*Z* = *z*), are based on data from all subjects in the sample. Derivation of unbiased estimators of *P*_{
m
}(*I* = *i*|*Z* = *z*) is more involved. To simplify notation, we temporarily drop the SNP subscript *m*. If *p* and *q* = 1−*p* denote the frequencies of *A* and *a* in the base population, then *P*(*I* = *i*|*Z* = *z*) is a function of *p* and *q*. For example, two individuals share 0 alleles IBS if they are either (*AA, aa*) or (*AA, aa*). Given that *Z* = 0, the probabilities of these genotypes are *p*^{2}*q*^{2} and *q*^{2}*p*^{2}, respectively, leading to *P*(*I* = 0|*Z* = 0) = 2*p*^{2}*q*^{2}. The plug-in estimators of conditional IBS probabilities, such as *P*(*I* = 0|*Z*=0), obtained by inserting estimators $\widehat{p}$ and $\widehat{q}$ are biased [Additional file 1]. Unbiased estimators, expressed as the plug-in estimator multiplied by a correction factor, may be derived as described next.

Let *X* and *Y* be the counts of the alleles *A* and *a*, respectively, so that the allele frequency estimators are $\widehat{p}=X/T$ and $\widehat{q}=Y/T$, where *T* is twice the number of observed genotypes in the population random sample. The estimators of the conditional IBS probabilities *P*(*I* = *i*|*Z* = *z*) may be motivated by the following model. The genotype of each individual in the present population is obtained from two independent draws from an infinite base population of alleles. Consequently, the *T* alleles of a population random sample of study subjects can be viewed as a random sample from the base population. Moreover, conditional on IBD status, any pair of individuals in the present population can be viewed as independent allelic draws from the base population, with the number of draws determined by their IBD status.

For example, in the case of

*Z* = 0, a random pair of individuals results from randomly drawing two pairs of alleles from the base population. An indicator variable of whether this sampling process results in

*I* = 0 is an unbiased estimator of

*P*(

*I* = 0|

*Z* = 0). An unbiased estimator is therefore the average of these indicator variables over all possible draws from the

*T* alleles on which we have data; i.e., the proportion of pairs of allelic pairs with

*I* = 0. The proportion can be computed as follows. The number of ways of selecting four distinct alleles from a total of

*T* is

*T*(

*T* − 1)(

*T* − 2)(

*T* − 3). Without loss of generality, suppose the first two alleles are assigned to the first individual in a pair and the last two alleles to the second individual. Then the number of pairs that are (

*AA, aa*) and (

*AA, aa*) are

*X*(

*X* − 1)

*Y*(

*Y* − 1) and

*Y*(

*Y* − 1)

*X*(

*X* − 1), respectively. Hence,

$\begin{array}{l}\widehat{P}(I=0|Z=0)=\frac{2X(X-1)Y(Y-1)}{T(T-1)(T-2)(T-3)},\end{array}$

(2)

is an unbiased estimator of

*P*(

*I*=0|

*Z*=0) (see Additional file

1 for verification by direct computation). After algebra, the unbiased estimator may be expressed in terms of the allele frequency estimators and a correction factor as:

$\begin{array}{ll}\widehat{P}(I=0|Z=0)=2{\widehat{p}}^{2}{\widehat{q}}^{2}& \left(\frac{X-1}{X}\times \frac{Y-1}{Y}\times \frac{T}{T-1}\right.\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{0.3em}{0ex}}\times \left.\frac{T}{T-2}\times \frac{T}{T-3}\right).\phantom{\rule{2em}{0ex}}\end{array}$

For

*Z* = 1, we consider a pair of individuals to be the result of drawing three alleles from the base population, one of which is shared by the pair of individuals. The proportion of such pairs of individuals with IBS state

*I*=1 in our data is an unbiased estimator of

*P*(

*I* = 1|

*Z* = 1). The number of ways to select three distinct alleles from a total of

*T* is

*T*(

*T* − 1)(

*T* − 2). Among these, the genotype pairs that are

*I* = 1 are the

*X*(

*X* − 1)

*Y*,

*Y* *X*(

*X* − 1),

*Y*(

*Y* − 1)

*X*, and

*X* *Y*(

*Y* − 1) that are (

*AA, aa*), (

*AA, aa*), (

*AA, aa*), and (

*AA, aa*), respectively. Thus,

$\begin{array}{l}\widehat{P}(I=1|Z=1)=\frac{2X(X-1)Y+2\mathit{\text{XY}}(Y-1)}{T(T-1)(T-2)}\\ \phantom{\rule{7em}{0ex}}=\frac{2\mathit{\text{XYX}}}{T{T}^{2}}\left(\frac{X-1}{X}\times \frac{T}{T-1}\times \frac{T}{T-2}\right)\\ \phantom{\rule{8em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\frac{2X{Y}^{2}}{T{T}^{2}}\left(\frac{X-1}{X}\times \frac{T}{T-1}\times \frac{T}{T-2}\right)\\ \phantom{\rule{7em}{0ex}}=2{\widehat{p}}^{2}\widehat{q}\left(\frac{X-1}{X}\times \frac{T}{T-1}\times \frac{T}{T-2}\right)\\ \phantom{\rule{8em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2\widehat{p}{\widehat{q}}^{2}\left(\frac{X-1}{X}\times \frac{T}{T-1}\times \frac{T}{T-2}\right).\end{array}$

The other conditional IBS probabilities are estimated in an analogous manner and their expressions are provided in Table one of [15].

With estimates

${\widehat{P}}_{m}(I=i)$ and

${\widehat{P}}_{m}(I=i|Z=z)$ for each SNP, we sum over SNPs to obtain estimates of the IBD coefficients for a given pair in the sample. Let

$\begin{array}{l}\widehat{N}(I=i|Z=z)=\sum _{m=1}^{L}{\widehat{P}}_{m}(I=i|Z=z)\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\text{and}\\ \phantom{\rule{3em}{0ex}}\widehat{N}(I=i)=\sum _{m=1}^{L}{\widehat{P}}_{m}(I=i),\end{array}$

(3)

where

*L* is the total number of SNPs with genotype data on both individuals. For any pair of subjects, summing equation (

1) over all the SNPs and using equation (

3) gives the following method-of-moment estimators of the IBD coefficients:

$\phantom{\rule{-16.0pt}{0ex}}\begin{array}{l}\widehat{P}(Z\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}0)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\frac{\widehat{N}(I=0)}{\widehat{N}(I=0|Z=0)}\\ \widehat{P}(Z\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\frac{\widehat{N}(I=1)-\widehat{P}(Z=0)\times \widehat{N}(I=1|Z=0)}{\widehat{N}(I=1|Z=1)}\\ \widehat{P}(Z\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}2)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\frac{\widehat{N}(I\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}2)\phantom{\rule{0.3em}{0ex}}-\phantom{\rule{0.3em}{0ex}}\widehat{P}(Z\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}0)\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\widehat{N}(I\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}2|Z\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}0)\phantom{\rule{0.3em}{0ex}}-\phantom{\rule{0.3em}{0ex}}\widehat{P}(Z=1)\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\widehat{N}(I\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}2|Z\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1)}{\widehat{N}(I\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}2|Z\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}2)}\phantom{\rule{0.3em}{0ex}}.\end{array}$

Adjustments to bound these estimators to values consistent with their interpretation as IBD proportions were proposed in [15]. We have not made these adjustments in our graphical displays.

### Gene drop simulation with LD

The package provides a graphical display that can be used to identify related sample pairs by plotting the estimated IBD coefficients $\widehat{P}(Z=1)$ versus $\widehat{P}(Z=0)$. To assess the variability of these estimators the points of the IBD plot are superposed on reference clusters obtained from one of the following relationships: unrelated, monozygotic twins/duplicates, parent-offspring, full siblings, half siblings and first cousins. These reference clusters are obtained by gene drop simulation that accounts for LD [16]. A strength of this approach is that we do not need to assume independence of marker loci. In candidate-gene association studies, this feature is important because of the dependence among a relatively small number of SNPs. Ignoring the dependence among SNPs within genes produces reference clusters that are too tight relative to the true variability, and can lead to false-positive results. We return to this point in the examples.

A graphical model is an approach to modeling the joint distribution of a set of dependent random variables when many independences or conditional independences exist between subsets of the variables. In the case of LD, it is expected that the joint distribution of alleles allong haplotypes shows such a structure. A flexible graphical model of haplotype frequencies that captures LD between loci is described in [17]. The model is fit to data from subjects that can be regarded as a population random sample; e.g., the controls in a case-control study of a rare disease. Model parameters are estimated by use of a stochastic optimization algorithm [16].

Once the LD model is fit, it is used to sample haplotypes for the founders of a pedigree. Data on the remaining members of the pedigree are simulated by gene drop. Gene drop is a method for randomly generating the genotypes of related individuals in a pedigree. Alleles are “dropped” from the founders through the pedigree according to Mendel’s laws. Multi-locus gene drop incorporates the process of recombination. To illustrate the simulation procedure, consider a parent-offspring relationship. A pedigree that encompasses this relationship is one comprised of two parents and the offspring. The founders are the parents. Parental haplotypes are simulated from the fitted LD model and are then dropped to the offspring. To mimic real data with missing genotypes, selected genotypes for a simulated individual are set to missing according to the missing genotype pattern of a randomly-sampled study subject.

Programs for fitting LD models and performing gene drop simulations are available in the Java Programs for Statistical Genetics and Computational Statistics (JPSGSC) library developed by Alun Thomas (http://balance.med.utah.edu/wiki/index.php/JPSGCS). We use the **R** package **rJPSGCS**[18] to access these programs from R.