Currently, three Python programs are included with TagDigger: tagdigger_interactive.py, which counts the occurrences of tags in FASTQ files and outputs a table of tag counts indexed by tag and barcode, as well as (optionally) a table of numeric diploid genotypes for biallelic markers; tag_manager.py, which consolidates marker names, tag sequences, alignment and other information across multiple projects; and barcode_splitter.py, which splits a FASTQ file, by barcode, into multiple FASTQ files with barcode, adapter, and potentially chimeric sequence removed, and optionally generates MD5 checksums for all output files to facilitate archiving data with NCBI (National Center for Biotechnology Information), EBI (European Bioinformatics Institute) or DDBJ (DNA Data Bank of Japan). All functions used by these three programs are contained in the file tagdigger_fun.py so that they can be used by Python programmers for tasks such as batch processing. Restriction enzyme cut sites and adapter sequences are also included at the top of the tagdigger_fun.py file so that new enzymes and adapters can be easily added.
Input
For tagdigger_interactive.py, a user-generated key file lists all barcodes to search for in each FASTQ file. Three headers are needed for this CSV file: “File” (indicating the name of the FASTQ file), “Barcode” (indicating the barcode sequence), and “Sample”. Other columns are ignored. If FASTQ file names end with “.gz”, they are assumed to be compressed with GZIP, and otherwise they are assumed to be uncompressed. Barcodes that the user does not wish to investigate can be omitted to speed processing time. If the same sample name appears multiple times, tag counts are summed across all instances of that sample. If barcodes have been removed from the FASTQ file, the user can simply list one sample per file and leave the Barcode column blank. A very similar file is needed for barcode_splitter.py, but with the headers “Input File”, “Barcode”, and “Output File”.
Tag sequences can be imported to tagdigger_interactive.py and tag_manager.py in any of seven different formats. Four of these are the direct output of other SNP-calling software: FASTA files from TASSEL-UNEAK, the SAM file used for generating markers in TASSEL-GBS, tab-delimited text output of the cstacks program in Stacks, and the.alleles file output from pyRAD. Three CSV formats are also readable: two tags per row, with column headers “Marker name”, “Tag sequence 0”, and “Tag sequence 1”; one tag per row (allowing for non-biallelic markers), with column headers “Marker name”, “Allele name”, and “Tag sequence”; and two merged tags per row, with column headers “Marker name” and “Tag sequence”. For the latter format, the variable portion of the tag is put between square brackets with a forward slash separating the two alleles, e.g., ACAGACTT[A/T]GTACCCA. This merged format is also used as the output of tag_manager.py, since it conserves hard drive space and RAM and makes it easy for the human eye to see the polymorphism. For any of the seven tag formats, the user can supply, as a separate file, a list of names of markers to include in the output. Supplying such a list conserves processing time and output file size by ignoring tags in which the user is not interested.
Search algorithm
Before processing a FASTQ file, tagdigger_interactive.py recursively builds indexing trees of expected barcode and tag sequences (Fig. 1). These indexing trees enable very rapid matching of sequencing reads to the appropriate barcode and tag. They also enable the software to quickly discard reads that do not match any expected barcode or tag. To save processing time, tagdigger_interactive.py ignores sequencing quality scores, with the assumption that a read with an error is unlikely to match a known tag. TASSEL also ignores quality scores, instead distinguishing alleles from errors based on how frequently they appear in the dataset [6, 14]. Because most SNP mining software does not use the entire sequencing read (for example, by default TASSEL uses only the first 64 nucleotides after the barcode, and in other software the reads may be truncated to only retain the highest-quality portion) the TagDigger search algorithm checks to see if the read begins with a barcode and tag, and ignores any nucleotides in the read after the end of the tag. For any given FASTQ read, the algorithm begins at the first nucleotide of the read and moves along the read one nucleotide at a time, navigating through the indexing tree of barcodes. If the read begins with an expected barcode, and also has the expected restriction cut site after the barcode, the algorithm begins navigating the indexing tree of tags, beginning with the first nucleotide in the read after the restriction site (or at the beginning of the restriction site in the case of enzymes such as ApeKI with variable cut sites). If the read contains an expected tag, the software increments the read count for that barcode*tag combination.
The barcode_splitter.py program uses the above algorithm to assign reads to barcodes. The same search algorithm is used for identifying adapter sequence (to be trimmed out of the read), but the search is performed from the end of the read rather than the beginning. Additionally, the barcode splitter uses more conventional text searching to detect full restriction cut sites present anywhere in the read. Since a full cut site may indicate a chimera of two genomic DNA fragments, sequence after a full cut site is trimmed from the read.
Because the tag_manager.py program only assigns markers the same name if tag sequences are an exact match (as opposed to one tag sequence being a truncated version of another), it does not use the same search algorithm as the other two programs. Instead, a binary search using the bisect module in Python is used for matching tags across two sets of markers.
Output
The tagdigger_interactive.py program generates a CSV file of read counts, with samples in rows and tags in columns. The first row contains tag names, and the first column contains sample names. Tag names consist of the marker name and allele name(s) separated by an underscore. For biallelic markers, the final allele name is ‘0’ or ‘1’, with the tag that comes first alphabetically being allele ‘0’ (except in formats where the user specifies which tag is ‘0’ and which is ‘1’). The nucleotides at variable sites are also included as an allele name. Because TASSEL-GBS may mine multiple SNPs from the same locus where a tag aligns to the genome, TagDigger does not use the original SNP names when naming tags imported from the TASSEL-GBS pipeline, but can output a table indicating how the TASSEL-GBS SNP names correspond to the marker names generated by TagDigger.
When all markers are biallelic, tagdigger_interactive.py can also export a numeric genotype table, with samples in rows and markers in columns. Homozygous genotypes are coded as ‘0’ or ‘2’, heterozygous genotypes are coded as ‘1’, and cells with missing data are left blank. Homozygotes for allele ‘0’ are coded as ‘0’ and homozygotes for allele ‘1’ are coded as ‘2’.
The barcode_splitter.py program generates uncompressed FASTQ files. Barcodes, adapter sequence, and potentially chimeric sequence are removed as described in the previous section. The original comment line is retained for each read, with the barcode added to the end of the comment in keeping with Illumina FASTQ format. Optionally, barcode_splitter.py can also generate a CSV file listing the MD5 checksum of each new FASTQ file that has been created.
The tag_manager.py program outputs a CSV file with the headers “Marker name” and “Tag sequence”. Tag sequences are in the merged format, e.g., CCGATTAG[C/T]AGGGGTT, and can be read back into the TagDigger program. Marker names consist of a prefix supplied by the user followed by a number, e.g., MyLabsTags000102. Optionally, additional columns can contain the original marker names or other data provided by the user in CSV format. A FASTA file of sequences for each marker, with IUPAC nucleotide codes for variable sites, can also be exported. If this FASTA file is used for alignment to a reference genome, tag_manager.py can import the resulting SAM file and add columns to the marker list containing chromosome, position, and alignment quality.