Copy number calling and SNV classification using targeted short read sequencing

Abstract PureCN [1] is a purity and ploidy aware copy number caller for cancer samples inspired by the ABSOLUTE algorithm [2]. It was designed for hybrid capture sequencing data, especially with medium-sized targeted gene panels without matching normal samples in mind (matched whole-exome data is of course supported). It can be used to supplement existing normalization and segmentation algorithms, i.e. the software can start from BAM files, from target-level coverage data, from copy number log2ratios or from already segmented data. After the correct purity and ploidy solution was identified, PureCN will accurately classify variants as germline vs. somatic or clonal vs. sub-clonal. PureCN was further designed to integrate well with industry standard pipelines [3], but it is straightforward to generate input data from other pipelines. Package PureCN 1.13.20


Introduction
This tutorial will demonstrate on a toy example how we recommend running PureCN on targeted sequencing data. To estimate tumor purity, we jointly utilize both target-level 1 coverage data and allelic fractions of single nucleotide variants (SNVs), inside -and optionally outside -the targeted regions. Knowledge of purity will in turn allow us to accurately (i) infer integer copy number and (ii) classify variants (somatic vs. germline, mono-clonal vs. sub-clonal, heterozygous vs. homozygous etc.).
This requires 3 basic input files: 1. A VCF file containing germline SNPs and somatic mutations. Somatic status is not required in case the variant caller was run without matching normal sample.
3. At least one BAM file from a normal control sample, either matched or processmatched.
In addition, we need to know a little bit more about the assay. This is the annoying step since here the user needs to provide some information. Most importantly, we need to know the positions of all targets. Then we need to correct for GC-bias, for which we need GC-content for each target. Optionally, if gene-level calls are wanted, we also need for each target a gene symbol. We may also observe subtile differences in coverage of tumor compared to normal due to varying proliferation rates and we can provide replication timing data to check and correct for such a bias. To obtain best results, we can finally use a pool of normal samples to automatically learn more about the assay and its biases and common artifacts.
The next sections will show how to do all this with PureCN alone or with the help of GATK and/or existing copy number pipelines.
All the steps described in the following are available in easy to use command line scripts described in a separate vignette.

Basic input files 2.1 VCF
Germline SNPs and somatic mutations are expected in a single VCF file. At the bare minimum, this VCF should contain read depths of reference and alt alleles in an AD format field and a DB info flag for membership in germline databases 2 .
Without DB flag, variant ids starting with rs are assumed to be in dbSNP. Population allele frequencies are expected in a POP_AF info field. These frequencies are currently only used to infer membership in germline databases when the DB flag is missing; in future versions they will be used calculate somatic prior probabilities more accurately.
If a matched normal is available, then somatic status information is currently expected in a SOMATIC info flag in the VCF. The VariantAnnotation package provides examples how to add info fields to a VCF in case the used variant caller does not add this flag. If the VCF contains a BQ format field containing base quality scores, PureCN can remove low quality calls. 3 While PureCN can use a pool of normal samples to learn which intervals are reliable and which not, it is highly recommended to provide the correct intervals. Garbage in, garbage out.
VCF files generated by MuTect [4] should work well and in general require no post-processing. PureCN can handle MuTect VCF files generated in both tumor-only and matched normal mode. Experimental support for MuTect 2 and FreeBayes VCFs generated in tumor-only mode is available.

Target information
For the default segmentation function provided by PureCN, the algorithm first needs to calculate log2-ratios of tumor vs. normal control coverage. To do this, we need to know the locations of the captured genomic regions (targets). These are provided by the manufacturer of your capture kit 3 . Please double check that the genome version of the target file matches the reference. Usually the manufacturer provides two files: the baits file containing the coordinates of the actual capture baits, and the target file containing the coordinates of the actual regions we want to capture. We recommend to use the baits file (and recognize the confusing nomenclature that we follow due to convention in established tools).
Default parameters assume that these targets do NOT include a "padding" to include flanking regions. PureCN will automatically include variants in the 50bp flanking regions if the variant caller was either run without interval file or with interval padding (See section 12.2).
PureCN will attempt to optimize the targets for copy number calling (similar to [5]): • Large targets are split to obtain a higher resolution • Targets in regions of low mappability are dropped • Optionally, accessible regions in-between the target (off-target) regions are included so that available coverage information in on-and off-target reads can be used by the segmentation function. In the following, we will use intervals when something applies to both on-target and off-target regions and targets when it only applies to on-target.
It further annotates intervals by GC-content (how coverage is normalized is described later in Section 3).

Third-party coverage tools
Calculating coverage from BAM files is a common task and your pipeline might already provide this information. As alternative to calculateBamCoverageByInterval, PureCN currently supports coverage files generated by GATK3 DepthOfCoverage, GATK4 CollectFragmentCounts and by CNVkit. By providing files with standard file extension, PureCN will automatically detect the correct format and all following steps are the same for all tools. You will, however, still need the interval file generated in Section 2.2 and the third-party tool must use the exact same intervals. See also FAQ Section 12.2 for recommended settings for GATK3 DepthOfCoverage.

Third-party segmentation tools
PureCN integrates well with existing copy number pipelines. Instead of coverage data, the user then needs to provide either already segmented data or a wrapper function. This is described in Section 10.1.

Library-specific coverage bias
In coverage normalization, we distinguish between assay-and library-specific biases. Assayspecific biases, for example due to probe density, probe capture efficiency and read mappability, are best removed with a pool of normal samples (Section 4.1, [5]). In other words, by examining the coverage of particular intervals in a pool of normals, we can estimate how well this assay captures these intervals and will then adjust the tumor coverage accordingly.
Other biases are library-specific, meaning a patient sample captured in different libraries may display dramatically different coverage profiles across libraries. Data from great sequencing centers usually show relatively small technical variance nowadays, but some biases are not completely avoidable. The most important library-specific bias is due to GC-content, i.e. regions of high AT-or GC-content are not always captured with exactly the same efficiency in tumor and normals.
We usually also observe that early replicating regions have a slightly higher coverage than late replicating regions [6,7]. Since there is often a significant difference in proliferation rates of tumor and normal, the pool of normals might also not completely adjust for this small bias.
As first step, we thus correct the raw coverage of all samples, tumor and normal, for these two major sources of library-specific coverage biases ( Figure 1). For GC-normalization, we use a 2step loess normalization [8]. For the replication timing bias, a linear model of log-transformed coverage and provided replication timing score is used.
correctCoverageBias(normal.coverage.file, interval.file, output.file="example _ normal _ loess.txt", plot.bias=TRUE) All the following steps in this vignette assume that the coverage data are normalized.
The example coverage files are already GC-normalized. We provide a convenient command line script for generating normalized coverage data from BAM files or from GATK coverage files (see Quick vignette).  This plot shows coverage as a function of on-and off-target GC-content and replication timing before and after normalization. Each dot is an interval. The example files are already GC-normalized; real data will show more dramatic differences.

4
Pool of normals

Coverage normalization
For calculating copy number log2-ratios of tumor vs. normal, PureCN requires coverage from a process-matched normal sample. Using a normal that was sequenced using a similar, but not identical assay, rarely works. Differently covered genomic regions simply result in too many log2-ratio outliers. This section describes how to optimally normalize coverage against a pool of normals.
The createNormalDatabase function builds a database of coverage files (a command line script providing this functionality is described in a separate vignette): Again, please make sure that all coverage files were GC-normalized prior to building the database (Section 3). Internally, createNormalDatabase determines the sex of the samples and trains a PCA that is later used for denoising tumor coverage using Tangent normalization [9]: normalDB <-readRDS("normalDB.rds") pool <-calculateTangentNormal(tumor.coverage.file, normalDB)

Artifact filtering
It is important to remove as many artifacts as possible, because low ploidy solutions are typically punished more by artifacts than high ploidy solutions. High ploidy solutions are complex and usually find ways of explaining artifacts reasonably well. The following steps in this section are optional, but recommended since they will reduce the number of samples requiring manual curation, especially when matching normal samples are not available.

VCF
We recommend running MuTect with a pool of normal samples to filter common sequencing errors and alignment artifacts from the VCF. MuTect requires a single VCF containing all normal samples, for example generated by the GATK3 CombineVariants tool (see Section 12.2).
It is highly recommended to provide PureCN this combined VCF as well; it will help the software correcting non-reference read mapping biases. This is described in the setMapping BiasVcf documentation. To reduce memory usuage, the normal panel VCF can be reduced to contain only variants present in 4 or more samples (the VCF for MuTect should however contain variants present in 2-3 samples).
Because these VCFs can become huge with large pools of normals, we can optionally precompute the mapping bias, thus avoiding parsing these VCFs for every sample: This function calculates copy number log2-ratios using all normal samples provided in the nor mal.coverage.files argument. Assuming that all normal samples are in general diploid, a high variance in log2-ratio is indicative of an interval with either common germline alterations or frequent artifacts; high or low copy number log2-ratios in these intervals are unlikely measuring somatic copy number events.
This interval.weight.file is automatically generated by the NormalDB.R script described in the Quick vignette.

Artifact filtering without a pool of normals
By default, PureCN will exclude targets with coverage below 15X from segmentation (with a pool of normals, targets are filtered based on the coverage and variance in normal database only). For variants in the provided VCF, the same 15X cutoff is applied. MuTect applies more sophisticated artifact tests and flags suspicious variants. If MuTect was run in matched normal mode, then both potential artifacts and germline variants are rejected, that means we cannot just filter by the PASS/REJECT MuTect flags. The filterVcfMuTect function optionally reads the MuTect 1.1.7 stats file and will keep germline variants, while removing potential artifacts. Without the stats file, PureCN will use only the filters based on read depths as defined in filterVcfBasic. Both functions are automatically called by PureCN, but can be easily modified and replaced if necessary.
We can also use a BED file to blacklist regions expected to be problematic, for example the simple repeats track from the UCSC: The normal.coverage.file argument points to a coverage file obtained from either a matched or a process-matched normal sample, but can be also a small pool of best normals (Section 4.1).
The normalDB argument (Section 4.1) provides a pool of normal samples and for example allows the segmentation function to skip targets with low coverage or common germline deletions in the pool of normals. If available, a VCF containing all variants from the normal samples should be provided via args.setMappingBiasVcf to correct read mapping biases. The files specified in args.filterVcf help PureCN filtering SNVs more efficiently for artifacts as described in Sections 4.2 and 4.3. The snp.blacklist is only necessary if neither a matched normal nor a large pool of normals is available.
The post.optimize flag will increase the runtime by about a factor of 2-5, but might return slightly more accurate purity estimates. For high quality whole-exome data, this is typically not necessary for copy number calling (but might be for variant classification, see Section 6.2.1). For smaller targeted panels, the runtime increase is typically marginal and post.optimize should be always set to TRUE.
The plot.cnv argument allows the segmentation function to generate additional plots if set to TRUE. Finally, verbose outputs important and helpful information about all the steps performed and is therefore set to TRUE by default.

Plots
We now create a few output files: The RDS file now contains the serialized return object of the runAbsoluteCN call. The PDF contains helpful plots for all local minima, sorted by likelihood. The first plot in the generated PDF is displayed in Figure 2 and shows the purity and ploidy local optima, sorted by final likelihood score after fitting both copy number and allelic fractions. The colors visualize the copy number fitting score from low (blue) to high (red). The numbers indicate the ranks of the local optima. Yellow fonts indicate that the corresponding solutions were flagged, which does not necessarily mean the solutions are wrong. The correct solution (number 1) of this toy example was flagged due to large amount of LOH.
We now look at the main plots of the maximum likelihood solution in more detail.
plotAbs(ret, 1, type="hist") ## NULL Figure 3 displays a histogram of tumor vs. normal copy number log2-ratios for the maximum likelihood solution (number 1 in Figure 2). The height of a bar in this plot is proportional to the fraction of the genome falling into the particular log2-ratio copy number range. The vertical dotted lines and numbers visualize the, for the given purity/ploidy combination, Germline variant data are informative for calculating integer copy number because unbalanced maternal and paternal chromosome numbers in the tumor portion of the sample lead to unbalanced germline allelic fractions. Figure 4 shows the allelic fractions of predicted germline SNPs. The goodness of fit (GoF) is provided on an arbitrary scale in which 100% corresponds to a perfect fit and 0% to the worst possible fit. The latter is defined as a fit in which allelic fractions on average differ by 0.2 from their expected fractions. Note that this does not take purity into account and low purity samples are expected to have a better fit. In the middle panel, the corresponding copy number log2-ratios are shown. The lower panel displays the calculated integer copy numbers, corrected for purity and ploidy. We can zoom into particular chromosomes ( Figure 5).
plotAbs(ret, 1, type="BAF", chr="chr19") plotAbs(ret, 1, type="AF") Finally, Figure 6 provides more insight into how well the variants fit the expected values.  Figure 4: B-allele frequency plot Each dot is a (predicted) germline SNP. The first panel shows the allelic fractions as provided in the VCF file. The alternating grey and white background colors visualize odd and even chromosome numbers, respectively. The black lines visualize the expected (not the average!) allelic fractions in the segment. These are calculated using the estimated purity and the total and minor segment copy numbers. These are visualized in black and grey, respectively, in the second and third panel. The second panel shows the copy number log2-ratios, the third panel the integer copy numbers.

Data structures
The R data file (file.rds) contains gene-level copy number calls, SNV status and LOH calls. The purity/ploidy combinations are sorted by likelihood and stored in ret$results.
names(ret) ## [1] "candidates" "results" "input" We provide convenient functions to extract information from this data structure and show their usage in the next sections. We recommend using these functions instead of accessing the data directly since data structures might change in future versions.

Prediction of somatic status and cellular fraction
To understand allelic fractions of particular SNVs, we must know the (i) somatic status, the (ii) tumor purity, the (iii) local copy number, as well as the (iv) number of chromosomes harboring the mutations or SNPs. One of PureCN main functions is to find the most likely combination of these four values. We further assign posterior probabilities to all possible combinations or states. Availability of matched normals reduces the search space by already providing somatic status.
The predictSomatic function provides access to these probabilities. For predicted somatic mutations, this function also provides cellular fraction estimates [10], i.e. the fraction of tumor cells with mutation. The output columns are explained in Table 1.
To annotate the input VCF file with these values: For optimal classification results: • Set post.optimize=TRUE since small inaccuracies in purity can decrease the classification performance significantly • Provide args.setMappingBias a pool of normal VCF to obtain position-specific mapping bias information • Exclude variants in regions of low mappability • Use a somatic posterior probability cutoff of 0.8 and 0.2 for somatic and germline variants, respectively. This appears to be a good compromise of call rate and accuracy.
If the beta-binomial model was selected in the model argument of runAbsoluteCN, these cutoffs might need to be relaxed to get acceptable call rates.
• Add a Cosmic.CNT info field to the VCF or provide a COSMIC VCF in runAbsoluteCN (see Section 10.2).
Note that the posterior probabilities assume that the purity and ploidy combination is correct. Before classifying variants, it is thus recommended to manually curate flagged samples.

Amplifications and deletions
To call amplifications, we recommend using a cutoff of 6 for focal amplifications and a cutoff of 7 otherwise. For homozygous deletions, a cutoff of 0.5 is useful to allow some heterogeneity in copy number.
For samples that failed PureCN calling we recommended using common log2-ratio cutoffs to call amplifications, for example 0.9.
This strategy is implemented in the callAlterations function: gene.calls <-callAlterations(ret) head(gene.calls) Posterior probability for contamination. This state corresponds to homozygous germline SNPs that were not filtered out because reference alleles from another individual were sequenced, resulting in allelic fractions smaller than 1.

GERMLINE.CONTLOW
Posterior probability for contamination. This state corresponds to non-reference alleles only present in the contamination.

GERMLINE.HOMOZYGOUS
Posterior probability that SNP is homozygous in normal. Requires the model.homozygous option in runAbsoluteCN. See Section 8.  It is also often useful to filter the list further by known biology, for example to exclude nonfocal amplifications of tumor suppressor genes. The Sanger Cancer Gene Census [11] for example provides such a list.
The output columns of callAlterations are explained in Table 2.

Find genomic regions in LOH
The gene.calls data.frame described above provides gene-level LOH information. To find the corresponding genomic regions in LOH, we can use the callLOH function: The output columns are explained in Table 3. This will generate a CSV file in which the correct purity and ploidy values can be manually entered. It also contains a column "Curated", which should be set to TRUE, otherwise the file will be overwritten when re-run.
Then in R, the correct solution (closest to the combination in the CSV file) can be loaded with the readCurationFile function: This function has various handy features, but most importantly it will re-order the local optima so that the curated purity and ploidy combination is ranked first. This means plotAbs(ret,1,type="hist") would show the plot for the curated purity/ploidy combination, for example.  Table 4 for an explanation of all flags. Please note that in order to detect homozygous deletions in 100% pure samples, you will need to provide a normalDB in runAbsoluteCN to filter low quality targets efficiently (Section 5). 4 If the third-party tool provides target-level log2-ratios, then these can be provided via the log.ratio argument in addition to seg.file though. See also Section 10.

Maximizing the number of heterozygous SNPs
It is possible to use SNPs in off-target reads in the variant fitting step by running MuTect without interval file and then setting the filterVcfBasic argument remove.off.target.snvs to FALSE. We recommend a large pool of normals in this case and then generating SNP blacklists as described in Sections 4.2 and 4.3. Remember to also run all the normals in MuTect without interval file.
An often better alternative to including all off-target reads is only including variants in the flanking regions of targets (between 50-100bp). This will usually significantly increase the number of heterozygous SNPs (see Section 12.2). These SNPs are automatically added if the variant caller was run without interval file or with interval padding.
10 Advanced usage

Custom normalization and segmentation
Copy number normalization and segmentation are crucial for obtaining good purity and ploidy estimates. If you have a well-tested pipeline that produces clean results for your data, you might want to use PureCN as add-on to your pipeline. By default, we will use DNAcopy [12] to segment normalized target-level coverage log2-ratios. It is straightforward to replace the default with other methods and the segmentationCBS function can serve as an example.
The next section describes how to replace the default segmentation. For the probably more uncommon case that only the coverage normalization is performed by third-party tools, see Section 10.1.2.

Custom segmentation
It is possible to provide already segmented data, which is especially recommended when matched SNP6 data are available or when third-party segmentation tools are not written in R. Otherwise it is usually however better to customize the default segmentation function, since the algorithm then has access to the raw log2-ratio distribution 4 . The expected file format for already segmented copy number data is 5 : Since its likelihood model is exon-based, PureCN currently still requires an interval file to generate simulated target-level log2-ratios from a segmentation file. For simplicity, this interval file is expected either via the tumor.coverage.file or via the interval.file argument (see Figure 7). Note that PureCN will re-segment the simulated log2-ratios using the default seg mentationCBS function, in particular to identify regions of copy-number neutral LOH and to cluster segments with similar allelic imbalance and log2-ratio. The provided interval file should therefore cover all significant copy number alterations 6 . Please check that the log2-ratios are similar to the ones obtained by the default PureCN segmentation and normalization.  SNV Index Copy Number log−ratio q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

Custom normalization
If third-party tools such as GATK4 are used to calculate target-level copy number log2-ratios, and PureCN should be used for segmentation and purity/ploidy inference only, it is possible to provide these log2-ratios: Again, the max.candidate.solutions and test.purity arguments are set to non-default values to reduce the runtime of this vignette. It is highly recommended to compare the log2-ratios obtained by PureCN and the third-party tool, since some pipelines automatically adjust log2-ratios for a default purity value. Note that this example uses a pool of normals to filter low quality targets. Interval coordinates are again expected in either a interval.file or a tumor.coverage.file. If a tumor coverage file is provided, then all targets below the coverage minimum are further excluded.

Multi-sample segmentation
When multiple biopsies from the same patient are available, it might be beneficial to use a multi-sample segmentation that attempts to find a single segmentation for all biopsies.
The idea is to share information, most importantly from higher quality biopsies, and align breakpoints. PureCN supports the multipcf segmentation from the copynumber package: tumor2.coverage.file <-system.file("extdata", "example _ tumor2.txt", Again, the min.ploidy, max.ploidy and test.purity arguments are set to reduce the runtime of this toy example and should not be used for real data. The segmentationHclust function clusters segments using B-allele frequencies and joins adjacent segments when they are in the same cluster. Providing the calculateTangentNormal output pool via normal.coverage.file gives runAbsoluteCN access to the copy number ratios of all intervals, not only the segmentlevel ones. This function also supports weighting samples. By default, when coverages were calculated using PureCN, samples are weighted by the inverse of the read duplication rates. This usually dramatically reduces the number of spurious segments.

COSMIC annotation
If a matched normal is not available, it is also helpful to provide runAbsoluteCN the COSMIC database [13] via cosmic.vcf.file (or via a Cosmic.CNT INFO field in the VCF). While this has limited effect on purity and ploidy estimation due the sparsity of hotspot mutations, it often helps in the manual curation to compare how well high confidence germline (dbSNP) vs. somatic (COSMIC) variants fit a particular purity/ploidy combination.
For variant classification (Section 6.2.1), providing COSMIC annotation also avoids that hotspot mutations with dbSNP id get assigned a very low prior probability of being somatic.

ExAC and gnomAD annotation
PureCN is not automatically annotating input VCFs with data from common germline databases such as ExAC. See Section 2.1 for ways to tell PureCN where to find either a summary binary flag (i.e. likely germline yes/no) or population allele frequencies.

Mutation burden
The predictSomatic function described in Section 6.2.1 can be used to efficiently remove private germline mutations. This in turn allows the calculation of mutation burden for unmatched tumor samples. A wrapper function for this specific task is included as callMuta tionBurden: callableBed <-import(system.file("extdata", "example _ callable.bed.gz", package = "PureCN")) The callableBed file should be a file parsable by rtracklayer . This file can specify genomic regions that are callable, for example as obtained by GATK3 CallableLoci. This is optional, but if provided can be used to accurately calculate mutation rates per megabase. Variants outside the callable regions are not counted. Private germline rates should be fairly constant across samples; outliers here should be manually inspected.
The output columns are explained in Table 5.

Chromosomal Instability
Chromosomal Instability (CIN) is usually defined as the fraction of the genome that is altered. The callCIN function can be used to estimate this fraction.
Parameters define regions that are altered. First, allele.specific defines whether only total vs. both minor and major copy number define a state. Copy number neutral LOH would count as altered only when this parameter is set to TRUE. Second, reference.state defines the unaltered copy number state. This can be either normal for 2/1, or dominant for the most common state. While technically potentially wrong, the latter is robust to errors in ploidy and is thus recommended without careful manual curation. Similarly, setting allele.specific to FALSE makes this metric more robust to purity and ploidy errors, but usually to a much lesser extend.
It is recommended to test for a relationship of tumor purity and CIN and if necessary exclude low purity samples with uncertain CIN.

Detect cross-sample contamination
It is important to correctly handle heterozygous common SNPs that do not have an expected allelic fraction of 0.5 in normal samples. These can be SNPs in poor quality regions (as already described, see Section 4.2.1), but also SNPs from cross-sample contaminated DNA. Without matched normals, detection of those problematic SNPs is not trivial.
For cross-sample contamination, PureCN by default always tests for a 1% contamination and assigns common SNPs to a contamination state when allelic fractions are either close to 0 or close to 1 and when this cannot be explained by CNAs. The main purpose of these states is to provide a bin for common SNPs that for artifactual reasons do not fit any other state.
This tool applies a simple heuristic to flag samples for cross-contamination: Given the coverage and putative contamination rate based on allelic fractions of potentially contaminated SNPs, how many SNPs do we expect to detect based on our power to detect variants at that contamination rate? If the expected number is much higher than observed, then significant contamination is unlikely; observed SNPs close to 0 or 1 are more likely artifacts or the contamination rate is much lower than the minimum tested. Otherwise PureCN will perform a post-optimization in which contamination rate is optimized in additional variant fitting steps.
Cross-sample contamination can also result in increased observed heterozygosity on chrX for males, which in turn often results in a PureCN warning that sex inferred from coverage and VCF are in conflict.
By default, cross-contamination is tested in the range from 1 to 7.5%. Catastrophic failures with higher contamination might not get flagged.

Power to detect somatic mutations
As final quality control step, we can test if coverage and tumor purity are sufficent to detect mono-clonal or even sub-clonal somatic mutations. We strictly follow the power calculation by Carter et al. [2].
The following Figure 9 shows the power to detect mono-clonal somatic mutations as a function of tumor purity and sequencing coverage (reproduced from [2]):

Limitations
PureCN currently assumes a completely diploid normal genome. For human samples, it tries to detect sex by calculating the coverage ratio of chromosomes X and Y and will then remove sex chromosomes in male samples 7 . For non-human samples, the user needs to manually remove all non-diploid chromosomes from the coverage data and specify sex="diploid" in the PureCN call.
While PureCN supports and models sub-clonal somatic copy number alterations, it currently assumes that the majority of alterations are mono-clonal. For most clinical samples, this is reasonable, but very heterogeneous samples are likely not possible to call without manual curation. Poly-genomic tumors are often called as high ploidy or low purity. The former usually happens when sub-clonal losses are called as 2 copies and mono-clonal losses correctly as 1 copy. The latter when sub-clonal losses are called mono-clonal, which only happens when there are far more sub-clonal than mono-clonal losses. Please note however that unless purities are very high, algorithms that model poly-genomic tumors do not necessarily have a higher call rate, since they tend to overfit noisy samples or similarly confuse true high-ploidy with poly-genomic tumors. Due to the lack of signal, manual curation is also recommended in low purity samples or very quiet genomes.

Support
If you encounter bugs or have problems running PureCN, please post them at If PureCN throws user errors, then there is likely a problem with the input files. If the error message is not self-explanatory, feel free to seek help at the support site.
In your report, please add the outputs of the runAbsoluteCN call (with verbose=TRUE, or the * .log file in PureCN.R) and sessionInfo(). Please also check that your problem is not already covered in the following sections.
For general feedback such as suggestions for improvements, feature requests, complaints, etc. please do not hesitate to send us an email.

Checklist
• Used the correct interval files provided by the manufacturer of the capture kit and the genome version of the interval file matches the reference. Ideally used the baits file, not the targets file (in Agilent data, the baits files are called "covered" and the targets are "regions").
• For hybrid capture data, included off-target reads in the coverage calculation • BAM files were generated following established best practices and tools finished successfully.
• Checked standard QC metrics such as AT/GC dropout and duplication rates.
• Tumor and normal data were obtained using the same capture kit and pipeline.
• Coverage data of tumor and normal were GC-normalized.
• The VCF file contains germline variants (i.e. not only somatic calls).
• Maximized the number of high coverage heterozygous SNPs, for example by running MuTect with a 50-75bp interval padding (Section 9). The runAbsoluteCN output lists the percentage of targets with variants and this should be around 10-15%. Ultradeep sequencing data can provide good SNP allelic fractions in the 100-200bp flanking regions.
• If a pool of normal samples is available, followed the steps in Section 4.2.
• Read the output of runAbsoluteCN with verbose=TRUE, fixed all warnings.
• If third-party segmentation tools are used, checked that normalized log2-ratios are not biased, i.e. very similar compared to PureCN log2-ratios (some pipelines already adjust for a default normal contamination).

FAQ
If the ploidy is frequently too high, please check: • Does the log2-ratio histogram ( Figure 3) look noisy? If yes, then • Is the coverage sufficient? Tumor coverages below 80X can be difficult, especially in low purity samples. Normal coverages below 50X might result in high variance of log2-ratios. See Section 4.1 for finding a good normal sample for log2-ratio calculation.
• Is the coverage data of both tumor and normal GC-normalized? If not, see cor rectCoverageBias.
• Is the quality of both tumor and normal sufficient? A high AT or GC-dropout might result in high variance of log2-ratios. Challenging FFPE samples also might need parameter tuning of the segmentation function. See segmentationCBS. A high expected tumor purity allows more aggressive segmentation parameters, such as prune.hclust.h=0.2 or higher.
• Was the correct target interval file used (genome version and capture kit, see Section 2.4)? If unsure, ask the help desk of your sequencing center.
• Were the normal samples run with the same assay and pipeline?
• Did you provide runAbsoluteCN all the recommended files as described in Section 5?
• For whole-genome data, you will get better results using a specialized third-party segmentation method as described in section 10.1, since our default is optimized for targeted sequencing. In general, you should probably start with tools optimized for WGS data, such as Battenberg [14], ABSOLUTE [2], ACEseq [7], or TitanCNA [8]. We are planning to incorporate proper support for WGS once high coverage diagnostic WGS becomes more common.
• Otherwise, if log2-ratio peaks are clean as in Figure 3: • Was MuTect run without a matched normal? If yes, then make sure to provide either a pool of normal VCF or a SNP blacklist (if no pool of normal samples is available) as described in Sections 4.2 and 4.3.
• A high fraction of sub-clonal copy-number alterations might also result in a low ranking of correct low ploidy solutions (see Section 11).
If the ploidy is frequently too low: • PureCN with default parameters is conservative in calling genome duplications.
• This should only affect low purity samples (< 35%), since in higher purity samples the duplication signal is usually strong enough to reliably detect it.
• In whole-exome data, it is usually safe to decrease the max.homozygous.loss default, since such large losses are rare.
Will PureCN work with my data?
• PureCN was designed for medium-sized (>2-3Mb) targeted panels. The more data, the better, best results are typically achieved in whole-exome data.
• The number of heterozygous SNPs is also important (>1000 per sample). Copy number baits enriched in SNPs are therefore very helpful (see Section 9).
• Some users got acceptable results with small (<1Mb) panels. Try to find a perfect offtarget bin width (average.off.target.width in preprocessIntervals) and maximize the number of heterozygous SNPs by including as much padding as possible. Keep in mind that without tiling baits, you will only have a poor resolution to detect LOH.
• Coverages below 80X are difficult unless purities are high and coverages are even.
• PureCN also needs process-matched normal samples, again, the more the better.
• Samples with tumor purities below 15-20% usually cannot be analyzed with this algorithm and PureCN might return very wrong purity estimates. In high coverage samples with low duplication rates, this limit can be close to 10%.
• Whole-genome data is not officially supported and specialized tools will likely provide better results. Third-party segmentation tools designed for this data type would be again required.
• Amplicon sequencing data is also not officially supported. If the assay contains tiling probes (at least with 1Mb spacing) and uses a barcode protocol that reduces PCR bias of measured allelic fractions, then this method might produce acceptable results. Setting the model argument of runAbsoluteCN to betabin is recommended. Specialized segmentation tools might be again better than our default. Note that this MuTect VCF will contain variants in off-target reads. By default, PureCN will remove variants outside the provided targets and their 50bp flanking regions. We highly recommend finding good values for each assay. A good cutoff will maximize the number of heterozygous SNPs and keep only an acceptable number of lower quality calls. This cutoff is set via interval.padding in args.filterVcf. See Section 9.
• Support for GATK4 and Mutect2 is still experimental. When matched normals are available, this will require version 4.0.3.0 and higher and specifying the -genotypegermline-sites flag.
• For VCFs generated by other callers, the required dbSNP annotation can be added for example with bcftools: • To generate a mappability file with the GEM library: • Calculate mappability, set kmer size to length of mapped reads.
• For position-specific mapping bias correction, the more normals are available, the more rare SNPs will have reliable mapping bias estimates. This requires again at least about 10 normals to be useful, 100 or more are ideal.
• With smaller pool of normals, we additionally recommend filtering SNPs from low quality regions (Section 4.3). Additionally, it is worth trying the beta-binomial function instead of the default in the model argument of runAbsoluteCN. This will incorporate uncertainty of observed variant allelic fractions in the variant fitting step.
• Do I really need a pool of normals? I only have tumor samples. Unfortunately, yes. If you used a commercial capture kit, you might be able to obtain control samples from the vendor or the public domain. This is not optimal, but usually better than nothing.
• It is safe to include multiple normals from the same individual. Fewer common germline CNVs in calculateIntervalWeights and fewer SNPs in calculateMappingBiasVcf will be detected. But especially when the controls were sequenced in multiple batches, these replicates will still provide useful information for coverage normalization.
Questions related to manual curation. PureCN, like most other related tools, essentially finds the most simple explanation of the data. There are three major problems with this approach: • First, hybrid capture data can be noisy and the algorithm must distinguish signal from noise; if the algorithm mistakes noise for signal, then this often results in wrong high ploidy calls (see Sections 4.2 and 4.3). If all steps in this vignette were followed, then PureCN should ignore common artifacts. Noisy samples thus often have outlier ploidy values and are often automatically flagged by PureCN. The correct solution is in most of these cases ranked second or third.
• The second problem is that signal can be sparse, i.e. when the tumor purity is very low or when there are only few somatic events. Manual curation is often easy in the latter case. For example when small losses are called as homozygous, but corresponding germline allele-frequencies are unbalanced (a complete loss would result in balanced germline allele frequencies, since only normal DNA is left). Future versions might improve calling in these cases by underweighting uninformative genomic regions.
• The third problem is that tumor evolution is fast and complex and very difficult to incorporate into general likelihood models. Sometimes multiple solutions explain the data equally well, but one solution is then often clearly more consistent with known biology, for example LOH in tumor suppressor genes such as TP53. A basic understanding of both the algorithm and the tumor biology of the particular cancer type are thus important for curation. Fortunately, in most cancer types, such ambiguity is rather rare. See also Section 11.

Questions related to matched normals.
Coverage normalization: Even when matched normals are available, we recommend building a normal database for coverage normalization. This usually produces cleaner coverage profiles than the matched normal ( [9]).
VCFs: When matched normals are available, simply provide this information to the variant caller and make sure that germline SNPs are not filtered out. PureCN should automatically find the matched information.
If all or most of the samples are flagged as: Noisy segmentation: The default of 300 for max.segments is calibrated for high quality and high coverage whole-exome data. For whole-genome data or lower coverage data, this value needs to be re-calibrated. In case the copy number data looks indeed noisy, please see the first FAQ. Please be aware that PureCN will apply more aggressive segmentation parameters when the number of segments exceeds this cutoff. If the high segment count is real, this might confound downstream analyses.
High AT/GC dropout: If the data is GC-normalized, then there might be issues with either the target intervals or the provided GC content. Please double check that all files are correct and that all the coverage files are GC-normalized (Section 3).

Sex mismatch of coverage and VCF:
If the panel contains baits for chromosome Y, then the interval file was probably generated without mappability file (Section 2.2). Similarly when third-party tools were used for coverage normalization and segmentation, this usually means probes on chromosome Y were not filtered for mappability. Cross-sample contamination (Section 10.6) can also cause sex mismatches.