Data pre-processing
By default, we start with coverage data calculated from BAM files by either the PureCN calculateBamCoverageByInterval function or by the GATK DepthOfCoverage tool. Both calculate total and average coverages of all targeted genomic regions (Fig. 1a). While it is possible to extract coverage data from germline and somatic single nucleotide variant (SNV) data directly, calculation of coverage across the complete targeted genome utilizes all on-target data and makes the correction of assay-specific capture biases straightforward by utilizing a pool of normal samples. Other biases, most importantly GC bias, are library-specific and should be corrected separately. We thus first GC normalize the coverage data using standard methods [10, 12]. Additionally, SNV data in VCF format are obtained separately using standard third-party tools such as MuTect [22]. All BAM files, from tumor and normal samples, are processed with this pipeline depicted in Fig. 1a. If the tumor and normal samples are matched, then the SNV caller can be run in matched mode to obtain somatic status of variants.
Copy number normalization and segmentation
Next, for calculating target-level copy number log-ratios, a suitable normal sample is ideally selected from a pool of high quality process-matched normals via principal component analysis (PCA) of GC-normalized coverage data (Fig. 1b). By default, the normal sample with minimum Euclidean distance to the tumor on the first 3 principal components is used for normalization. This procedure selects a normal sample with sufficient coverage and similar library-specific coverage biases compared to the tumor sample. Since coverage is the major source of variance, we scale the coverage for this step to a maximum (defaults to 100×) for samples exceeding this maximum. It is also possible to use the n best normals and then provide the normalization function an (weighted) averaged coverage, which is useful when the normal samples are sequenced to significantly lower target coverage than the tumor samples.
Our implementation is modular and allows the incorporation of existing segmentation algorithms. In the default setting, log-ratios of coverage between the tumor sample and PCA-matched normal sample are smoothed and segmented using a weighted version of the circular binary segmentation algorithm implemented in the DNAcopy R package (CBS) (Venkatraman and Olshen, 2007). We set target weights proportional to the inverse of the coverage ratio standard deviations in the pool of normals using the function createTargetWeights. Thus, targets with highly variable coverage in normals, either due to technical artifacts (e.g. mappability issues) or common germline variants, are down-weighted in the segmentation. If no pool of normals is available, the standard, unweighted CBS is used. If a pool of normal samples is available, we further exclude target intervals with low median coverage (by default lower than 20% of the chromosome median).
While heterozygous germline SNPs are sparse, they do provide valuable information for improving segmentations obtained by coverage data only. The number of targets with heterozygous SNPs usually varies between 5 and 15%, depending whether SNPs in flanking regions of targets are included or removed. We find that including SNPs in 50 bp flanking regions add a significant number of high coverage SNPs and we therefore use 50 bp as default, but optimal parameters depend on coverage and assay and should be tuned. Breakpoints of borderline significance (P > 0.001) are removed in our default segmentation when mirrored allelic fractions (1- allelic fraction if allelic fraction >0.5) of known heterozygous germline variants (dbSNP) are not significantly different (P > 0.2, two-sided t-test) in the corresponding neighboring segments. Segments are further tested for copy number neutral LOH. To this end, we recursively identify within all segments the optimal breakpoints that minimize the standard deviations of germline allelic fractions; only if the difference in allelic fraction of the neighboring candidate segments reaches a given alpha threshold, the breakpoint is accepted (two-sided t-test). We note that if the assay includes copy number tiling probes highly enriched in heterozygous SNPs, an algorithm (e.g. FACETS or PSCBS [8, 23]) that jointly segments coverage and allelic fractions can sometimes provide better results and we provide a convenient wrapper function for using the PSCBS method over the default. Finally, we use Ward’s hierarchical clustering to find segments with similar copy number log-ratios and mirrored allelic fractions. These segments are normalized to have the same mean log-ratios.
Purity and ploidy estimation
We first use a 2D grid search to find tumor purity and ploidy combinations that fit the log-ratio profile well. The log-ratios r
i
in a segment i are assumed to be normally distributed with standard error σ
ri
, the latter we estimate from the segmentation (i.e., is set to the average standard deviation of log-ratios in a segment). The log-ratios are a function of copy number C and purity p:
$$ {r}_i\sim N\left({ \log}_2\frac{p{C}_i+\left(1-p\right)2}{p\left({\displaystyle {\sum}_j}{l}_j{C}_j\right)/{\displaystyle {\sum}_j}{l}_j+\left(1-p\right)2},{\sigma}_{ri}\right) $$
(1)
A difference between our algorithm and most others designed for whole genome data is that segment likelihoods are essentially weighted by the number of exons per segment, not by the base pair segment size l. This is an advantage when targets are not evenly distributed, for example in smaller gene panels. The segment size is used for calculating the tumor ploidy in the denominator, using the tumor copy numbers of all j segments. Equation (1) assumes that ploidy in normal is 2; PureCN thus detects sex and excludes sex chromosomes for males.
Our algorithm can also take as input already segmented data and σ
ri
, for example when matched SNP6 data is available. The algorithm will then generate simulated exon-level data given a specified interval file and will use the same likelihood model. Typically, multiple purity/ploidy combinations are equally likely, and we will later use a Bayesian framework to pick the combination that best fits allelic fractions of germline and somatic single nucleotide variants (SNVs), without necessarily requiring knowledge whether these variants are indeed germline or somatic. All local optima identified in the grid search are tested via this framework. This grid search is typically performed in less than 2 min on an average workstation and significantly reduces the search space for the more computationally intensive fitting of variant allelic fractions.
In the grid-search, we assume that all ploidy values are possible, although this is not necessarily true since copy numbers are not continuous. The assumption allows the calculation of the likelihood scores in (1) without knowing the exact integer copy numbers of all segments required in the denominator. Thus every local optimum is in a second step optimized by Simulated Annealing, in which integer copy numbers are assigned to all segments and the purity estimate is fine-tuned. More precisely, Eq. (1) is used to calculate integer copy number posterior probabilities for all segments, P(C
i
), and we use a heated Gibbs sampler to optimize the segment copy numbers until convergence, which is, in general, achieved after few iterations. Purity is similarly optimized via heated Gibbs sampling using a specified grid (default from 0.15 to 0.95 in steps of 0.01). We consider copy numbers from 0 to 7 and include a “sub-clonal” state based on a univariate distribution, used for all segments that do not fit integer values and for capturing high-level amplifications with copy number >7 (Carter, et al., [1]).
Mis-calibrated copy number log-ratios (slightly right or left-shifted) can cause shifts in maximum likelihood ploidy estimates when assigning integer copy numbers to segments. In our optimization, we thus re-calibrate the log-ratios by Gibbs sampling. By default, log-ratios are right or left-shifted by at most 0.25 times the mean segment log-ratio standard deviation. If the optimized ploidy is one chromosome higher or lower than the ploidy identified the grid search, additional optimizations are attempted with this re-calibration range increased to up to 1 times the log-ratio standard deviation. The purity/ploidy solution is finally discarded if the optimized ploidy is, after these extensive re-calibrations, still not similar to the grid search ploidy. Mis-calibrations happen when major copy number alterations are not captured and are thus much more frequent in targeted panels without dedicated copy number tiling probes than in whole exome data.
SNV likelihood model
The next step in our approach is to determine somatic status of SNVs. We fit the allelic fractions of SNVs, provided as VCF file for example generated by the MuTect algorithm [22], to the purity/ploidy combinations of all local optima. We first specify the necessary prior probabilities for SNVs being somatic (vs. germline), P(g). If a matched normal is available, we set it to 0.999 for somatic mutations and 0.0001 for germline variants (note that these do not need to add up to 1, since these priors are assigned to different variants). The reason for not setting these priors to 1 and 0 is to limit the impact of single variants, in particular avoiding rare artifacts dominating the likelihood scores. Without matched normals, we rely on the public databases dbSNP and COSMIC, namely we set the prior to 0.95 if the variant is found more than 2 times in COSMIC; to 0.0005 if the variant found in dbSNP; to 0.01 if found in both COSMIC and dbSNP; and otherwise to 0.5. Accurate calibration of these priors is challenging, since these correspond to error rates in the public databases and these errors are sequence specific, for example errors in COSMIC often cluster in segmental duplications with low coverage and are thus different for different assays. All priors used in the PureCN likelihood model can be tuned by the user. In practice, since the vast majority of variants are germline and present in dbSNP, final results of purity and ploidy are very robust to the choice of these priors.
The expected allelic fraction f of variant i is a function of tumor purity p, copy number C, germline status g (1 for germline, 0 for somatic) and multiplicity M, which is the number of chromosomes harboring the mutation:
$$ E\left[{f}_i\right] = \frac{p{M}_i + {g}_i\left(1 - p\right)}{p{C}_i + 2\ \left(1 - p\right)} $$
(2)
Note that this does not model homozygous germline variants (g is not allowed to be 2), since these are uninformative and are by default removed. Somatic mutations further by definition always have a multiplicity larger than 0 (1 or larger for mono-clonal mutations). We model the sampling variance of allelic fractions using a beta distribution with n being the number of covered reads. The likelihood of observing a particular allelic fraction given these parameters is defined as in Carter et al. (Carter, et al., [1]):
$$ L\left({f}_i\Big|p,{C}_i,{g}_i,{M}_i,{n}_i\right)= Beta\left(E\left[{f}_i\right]\Big|\ {n}_i{f}_i+1,{n}_i\left(1-{f}_i\right)+1\right) $$
(3)
Note that heterozygous germline SNPs with observed allelic fraction significantly different from 0.5 [using (3), P < 0.05] in the matched normal or in a sufficient number of samples in the pool of normals are also removed. These are often SNPs in segmental duplications or other low-quality genomic regions. Smaller non-reference biases in regions of high mappability cause only minor shifts in expected allelic fractions and are not explicitly modeled, but we provide functionality to adjust observed allelic fractions, for example by estimating position-specific scaling factors in a large pool of normal samples. With increasing coverage, these biases may lead to very small likelihoods for correct purity and copy number values if not adjusted correctly, causing a paradox where increasing coverage decreases accuracy, and we therefore define a maximum value for n (defaults to 300×).
Incorporating the uncertainty of copy number calculated via Eq. (1), (3) becomes:
$$ L\left({f}_i\Big|p,{g}_i,\ {M}_i,n\right)={\displaystyle \sum_{C_i\in \left\{0..7\right\}}}P\left({C}_i\right)L\left({f}_i\Big|p,{C}_i,\ g,{M}_i,n\right) $$
(4)
We finally integrate over the uncertainty of germline status and multiplicity to find for each variant the most likely state:
$$ P\left({g}_i,{M}_i\Big|{f}_i,\ {n}_i\right) = \frac{P\left({M}_i\right)P\left({g}_i\right)L\left({f}_i\Big|p,{g}_i,{M}_i,{n}_i\right)}{{\displaystyle {\sum}_{C_i\in \left\{0..7\right\}}}{\displaystyle {\sum}_{K_i\le {C}_i}}{\displaystyle {\sum}_{M_i\le {C}_i}}{\displaystyle {\sum}_{g_i\in \left\{0,1\right\}}}P\left({C}_{ij}\right)P\left({M}_{ij}\right)P\left({g}_{ij}\right)L\left({f}_i\Big|p,{C}_{ij},{g}_{ij},\ {M}_{ij},{n}_i\right)} $$
(5)
Possible values for M depend on the number of maternal and paternal chromosomes, with K denoting the smaller one of the two chromosome numbers. We assume that the multiplicity of germline variants in a segment correspond to the maternal and paternal chromosome numbers with probability PK, by default set to 0.999. By not setting this value to 1, we make the likelihood model more robust to segmentation errors. For somatic mutations, we further always allow the mutation of a single chromosome; this assumes that multiplicities larger than 1 are the result of copy number alterations, almost never of independent mutations resulting in identical base changes. The prior probabilities for M are thus:
$$ P\left({M}_i\Big|{K}_i,\ {C}_i,\ {g}_i,{P}_K\right)=\left\{\begin{array}{c}\hfill {P}_K\frac{1}{n_s}\kern1.25em if\ {M}_i = {K}_i \vee \kern0.5em {M}_i = {C}_i-{K}_i\kern0.5em \vee \left({M}_i\le 1 \wedge {g}_i=0\right)\hfill \\ {}\hfill 0\ if\ {K}_i > \kern1.25em \left\lfloor {C}_i/2\right\rfloor \hfill \\ {}\hfill \begin{array}{cc}\hfill \left(1-{P}_K\right)\frac{1}{C_i+1-{n}_s}\hfill & \hfill otherwise\hfill \end{array}\hfill \end{array}\right. $$
(6)
Where n
s
denotes the number of utilized “allowed” states covered in the first case of Eq. (6) for a given K and C combination. This value can range from 1 to 4; in germline SNPs n
s
it would be 1 when both maternal and paternal copy numbers are equal and 2 when these two numbers differ. Somatic mutations can have two additional states, the mutation of a single chromosome (when M = 1) and a sub-clonal state (when M < 1). This sub-clonal state is by default modelled in Eq. (5) by replacing the invalid M = 0 and g = 0 state (somatic mutations by definition have M > 0) with M = 1/3 and g = 0. This M value represents the expected average cellular fraction of sub-clonal mutations.
We assume flat priors for K, \( P\left({K}_i\right)=\frac{1}{C_i+1} \), but note that databases of samples could provide better priors (see [1] for a related karyotype likelihood model). For example LOH in the TP53 tumor suppressor is very common in various cancer types; we would thus find the corresponding copy number state K = 0 and C = 1 frequently in these cancer types. If low ploidy solutions can explain the data well, then this prior further results in favoring low over high ploidy solutions (which is why K is defined over the complete copy number range). We however noticed that haploid solutions are often ranked relatively high in low purity samples, because the lack of one tumor chromosome does not result in sufficiently unbalanced germline allelic fractions in those samples. We thus give haploid and diploid solutions the same prior probability when the tumor purity is below 35%. Regions of LOH are classified as LOH or not, using the most likely segment state as determined in Eq. (5); a segment is in LOH if C = 1 or K = 0.
To model possible contamination from other individuals’ DNA, we optionally include two additional SNV states. The first models homozygous germline SNPs that were not removed because reference alleles were sequenced from the contaminated DNA, resulting in allelic fractions lower than 1 (Eq. 7). The second state (Eq. 8) models SNPs where the non-reference allele is only present in the contamination. The expected allele frequency is now a function of purity, tumor copy number and contamination rate c:
$$ E\left[{f}_i\right] = \frac{p{C}_i + 2\left(1 - p-c\right)}{p{C}_i + 2\ \left(1 - p\right)} $$
(7)
$$ E\left[{f}_i\right] = \frac{c}{p{C}_i + 2\ \left(1 - p\right)} $$
(8)
The prior probabilities for these states are set to a non-zero value for germline variants (present in dbSNP) only, because the dimension (i.e. number of possible states) for novel variants is much higher, thus rarely resulting in likelihood scores low enough to impact purity/ploidy selection. We set contamination rate and prior probability for this state by default to a low 0.01. The main motivation for this functionality is to provide a bin for germline variants that do not fit any other state, and more specialized tools should be used to detect contamination. We note that if matched normal samples are available, then this step is not crucial, since contamination is identified as non-germline by variant calling algorithms, whereas without matched normal samples, the presence of variants in dbSNP results in high germline prior probabilities.
In samples of 100% purity, homozygous SNPs should not be removed a priori, since these could be heterozygous SNPs in mono-clonal LOH regions. For high purity samples without matched normal samples, we therefore optionally provide yet another germline state, the homozygous state. Any observed reference reads are assumed to be independent sequencing errors resulting in identical base pairs (by default occurring at rate ε = 10−3/3) and the state likelihoods are then modeled with a binomial distribution. Flat prior probabilities independent of ploidy are applied.
Finally, for variants most likely being somatic, we calculate the fraction h of tumor cells harboring the mutation:
$$ h = \left[\frac{f}{M}pC+2\left(1-p\right)\right]\frac{1}{p} $$
(9)
The SNV-fit likelihood is the sum of the log-likelihood scores of the most likely states for all variants. The tumor purity/ploidy combinations are finally sorted by sum of the log-likelihood scores of both copy number and SNVs.
Our implementation provides an additional post optimization, in which purity is optimized using both copy number and allelic fractions in the SNV fitting step. This is achieved by adding purity as additional dimension in the denominator of Eq. 5. In default mode, this is turned off, i.e., allelic fractions are only used to select the most likely purity/ploidy combination from the copy number fitting. The accuracy gain for copy number calling is typically marginal in high quality samples with sufficient coverage as used in this benchmarking (data not shown). For classification of variants by somatic status, we recommend turning this feature on as small inaccuracies in purity can decrease the performance significantly since the distributions of allelic fractions of the different SNV states often overlap. By default, we use flat priors for tumor purity, but users can provide priors for all tested purity values in the grid.
Automated calling
If the algorithm is applied to many samples, it is important to flag samples that likely need manual curation. We flag samples of potentially low quality (noisy segmentations, high AT- or GC-dropout, sample contamination), samples where the maximum likelihood solution has characteristics rarely seen in correct solutions (rare ploidy, excessive LOH, excessive homozygous losses), and samples that are difficult to call (non-aberrant, poly-genomic). We further provide functionality for automatically removing very unlikely optima via a bootstrapping procedure, in which variants are sampled with replacement and optima are then re-ranked. Optima which never rank high in any bootstrap replicate are removed. Bootstrap values may also flag samples for manual curation when PureCN identified multiple plausible solutions. Finally, we calculate for each sample a goodness-of-fit score of the SNV fitting, ranging from 0 to 100%, where 0% corresponds to the worst possible fit and 100% to a perfect fit. We defined the worst possible fit as a fit in which observed allelic fractions differ on average by 0.2 from their expected values. Both low purity and high ploidy solutions are biased towards higher scores; low purity allelic fractions have a low variance in general and high ploidy solutions are complex and usually find good fits. Compared to log-likelihood scores, however, this goodness-of-fit score is intuitive and allows a straightforward flagging of very poor fits.