SAMMate: a GUI tool for processing short read alignments in SAM/BAM format
© Xu et al; licensee BioMed Central Ltd. 2011
Received: 17 November 2010
Accepted: 13 January 2011
Published: 13 January 2011
Next Generation Sequencing (NGS) technology generates tens of millions of short reads for each DNA/RNA sample. A key step in NGS data analysis is the short read alignment of the generated sequences to a reference genome. Although storing alignment information in the Sequence Alignment/Map (SAM) or Binary SAM (BAM) format is now standard, biomedical researchers still have difficulty accessing this information.
We have developed a Graphical User Interface (GUI) software tool named SAMMate. SAMMate allows biomedical researchers to quickly process SAM/BAM files and is compatible with both single-end and paired-end sequencing technologies. SAMMate also automates some standard procedures in DNA-seq and RNA-seq data analysis. Using either standard or customized annotation files, SAMMate allows users to accurately calculate the short read coverage of genomic intervals. In particular, for RNA-seq data SAMMate can accurately calculate the gene expression abundance scores for customized genomic intervals using short reads originating from both exons and exon-exon junctions. Furthermore, SAMMate can quickly calculate a whole-genome signal map at base-wise resolution allowing researchers to solve an array of bioinformatics problems. Finally, SAMMate can export both a wiggle file for alignment visualization in the UCSC genome browser and an alignment statistics report. The biological impact of these features is demonstrated via several case studies that predict miRNA targets using short read alignment information files.
With just a few mouse clicks, SAMMate will provide biomedical researchers easy access to important alignment information stored in SAM/BAM files. Our software is constantly updated and will greatly facilitate the downstream analysis of NGS data. Both the source code and the GUI executable are freely available under the GNU General Public License at http://sammate.sourceforge.net.
Next generation deep sequencing technology has recently emerged as a promising tool to simultaneously and accurately quantify DNA/RNA abundance on the genomic scale . The alignment of tens of millions of short reads to a reference genome is a central step for subsequent data analysis. A variety of short read alignment tools are currently available that implement fast, efficient and accurate short read alignments against larger reference genomes. Some commonly used alignment tools include MAQ , Novoalign http://www.novocraft.com/, Bowtie , rMap  and RMAP . Many of these tools output the alignment results in the Sequence Alignment/Map (SAM) and Binary SAM (BAM) formats , which are widely considered the de facto standards for storing and transferring short read alignment results. Correspondingly, there are a number of open source software programs that process the alignment results stored in SAM/BAM files. For example, SAMtools http://samtools.sourceforge.net/ provides various utilities for manipulating alignments in SAM/BAM files including sorting, merging, indexing and generating alignments in a base-wise format . Another program, DNAA http://dnaa.sourceforge.net/, calculates alignment statistics, detects structural variation and simulates short-read data. For genome alignment visualization software, one may use GenomeView http://genomeview.sourceforge.net/ or IGV http://www.broadinstitute.org/igv. These programs have been very useful in analyzing NGS data and visualizing alignment results.
Nevertheless, a plethora of frequently needed genomic information stored in SAM/BAM files remains hidden from biomedical researchers. For example, one can calculate from SAM/BAM files the mRNA abundance scores from RNA-seq data  that are used to detect differential expression. Furthermore, SAM/BAM files may contain a base-wise signal map from RNA-seq , ChIP-seq  or Methyl-seq data  that is used to detect gene expression abundance, transcription factor binding sites or DNA methylation sites. SAM/BAM files may also store data about the short read coverage depth visualization needed to examine the gene or transcriptome structure alteration. Finally, SAM/BAM files contain alignment statistics needed to evaluate the performance of different alignments. In the past it has been quite tedious for biomedical researchers to access this rich data stored in SAM/BAM files. These significant hurdles have prevented biomedical researchers from fully exploiting the benefits of NGS technology. Therefore, SAMMate with its user friendly interface fulfills a critical role by giving biomedical researchers easy access to essential data.
User Interface Overview
The central pillar of SAMMate's design philosophy is ease of use for biomedical researchers. A user friendly interface and portability are essential to this goal. To achieve portability in a cross-platform environment, we have implemented SAMMate in Java. As such, our GUI tool SAMMate runs almost identically on the Windows, Mac and UNIX/Linux operating systems. A key advantage of using Java for software development lies in its independence from processor architectures and operating systems. Java achieves this independence via its use of a common executing engine (also known as the virtual machine) that has been implemented across different platforms. A compiler with a set of standard libraries has also been implemented for various hardware and operating systems. The only extra software needed by SAMMate is the freely available Java Runtime Environment (JRE), which is already available on most operating systems.
SAMMate's reusability is very robust as developers can easily reuse existing classes by using the new method to create an instance of a class. SAMMate is also very extensible. If a user wishes to expand upon a component, the user may simply use the extends keyword to inherit the methods and properties of the desired class. Developers may also conveniently add classes to implement new features. For example, developers may add a new parser to process a gene annotation file in a new format by adding a class to the Alignment package. Furthermore, a developer may add a new tab in the UI package to display results in the result navigator window at their own preference. One other important feature of SAMMate is its configurability. For example, SAMMate allows users to customize the chromosome names in output files to their own preferred chromosome names by manipulating the configuration file "chromosomesMap.txt". In summary, the SAMMate software architecture implements a number of Applied Programming Interfaces (API's) so that other software developers may easily extend and build more utilities.
Using the standard reference genome annotation files, SAMMate allows users to accurately calculate the gene expression abundance scores for all annotated genes using RNA-seq data. SAMMate can also calculate the abundance scores of customized genomic intervals. In particular, SAMMate is able to use short reads originating from both exons and exon-exon junctions to accurately calculate gene expression scores. We used in vitro 3'-UTR analysis to illustrate the superior reporting capability of SAMMate's gene expression scores compared with other competing software like TopHat . 3'-UTR assays are a widely used low-throughput and accurate method to quantify microRNA dependent post-transcriptional regulation. Fold change is a frequently used metric, especially for biologists, to access the magnitude of gene regulation, either up-regulation or down-regulation. Therefore, we expect a better method to quantify gene expression abundance using RNA-seq data can be determined by comparing the resulting fold changes to those calculated from more accurate low-throughput experiment, such as 3'-UTR assay. SAMMate can also generate wiggle files for visualization in the UCSC genome browser as well as an alignment statistics report. It is also compatible with both single-end and paired-end sequencing technologies. SAMMate is freely available under the GNU General Public License. The key features provided by SAMMate are outlined in Figure 1 and are thoroughly illustrated via the case studies below. The SAMMate software manual was also provided as Additional file 1.
Key Feature: Calculating genomic feature abundance scores
Ideally, transcriptome characterization and quantification should be done on the isoform level. However, many existing approaches that quantify transcript abundance on the isoform level depend upon stringent assumptions such as a priori known isoform structures and suffer from identifiability problems. Moreover, the accuracy of these algorithms for high throughput studies is in doubt. It is also not known how sensitive these algorithms are to error-prone isoform annotation databases. Many existing approaches [6, 12, 13] have proved their merit as pilot studies. However, they were only validated using RT-PCR for a limited number of genes with simple and identifiable isoform transcript structures. With the aforementioned in mind, transcript abundance quantification on the gene level remains one of the most demanded outputs from high throughput molecular profiling experiments such as microarray and NGS. The latter platform is much more sensitive in detecting low level gene expression and provides a much broader dynamic range of expression quantification. Taking gene expression profiling using Illumina Genome Analyzer as an example, Read Per Kilobase of exon model per million Mapped reads (RPKM) was used to score gene expression abundance. The values obtained can be interpreted as the number of copies of each transcript in the living cell where the average length of transcripts is 1KB . The RPKM scores can range from < 0.01 to > 10,000. There are now unprecedented and unparalleled opportunities to detect novel transcripts with ultra-low or ultra-high abundance.
Performance-wise, aligners vary vastly in accuracy as well as the underlying algorithms used. It is highly desirable for RNA-seq data analysis to allow users the freedom to choose and combine a pair of their favorite exon aligner and junction mapper to estimate gene expression scores. SAMMate fulfills this role by calculating and exporting a gene expression score matrix using a user-defined combination of an exon aligner and a junction mapper (Figure 4b). SAMMate then calculates the gene expression RPKM or FPKM score for gene i, ℛ as where i represents the gene index. is the short read counts uniquely mapped to exons using an exon aligner (e.g. Novoalign), and is the IUM short read counts uniquely mapped to the exon-exon junctions using a junction mapper (e.g. TopHat). N represents all uniquely mapped read counts in a cell extract sample, and L i is the summation of the exon lengths. Thus, SAMMate combines short reads mapped to exons (e.g. available in SAM/BAM format) and to exon-exon junctions (e.g. available in BED format) to accurately estimate gene expression scores (Figure 4b). SAMMate can also take many pairs of SAM(BAM)/BED files simultaneously, one for each cell sample, to calculate a Microsoft EXCEL compatible gene expression matrix. In this matrix rows correspond to genes or the customized genome coordinate intervals, and columns correspond to different cell samples. It must be noted that SAMMate is more flexible and accurate than other software, such as TopHat , that also export the gene expression scores. We validate our claim using experimental data obtained from 3'-UTR assay as a case study shown below. SAMMate's reporting utility for gene expression abundance score is also quite versatile as this utility is not limited to the annotated genes. In fact, SAMMate calculates genomic feature abundance scores for any user-defined genomic intervals. This utility dramatically simplifies the technical barriers for discovering novel genes.
Algorithmic and computational contributions
Biological case study: Comparing gene expression scores generated using SAMMate, TopHat and Novoalign to predict miRNA targets
We have studied a pair of control and treatment transcriptomes. The control transcriptome was derived from the wild-type MutuI cell line while the treatment transcriptome was derived from the miRNA-155 retrovirally transduced MutuI cell line . MicroRNAs plays pivotal roles in controlling normal and pathology associated cellular processes. Moreover, the importance of miRNA dysregulation in cancer is well known and a number of tumor promoting miRNA's have been identified. As a member of this class of microRNAs, miR-155 is implicated in lymphomagenesis and a wide array of nonlymphoid tumors including breast, colon, and lung. Despite the strong evidence for miRNA-155 as an oncogene, the underlying pathological mechanisms remain unclear, possibly due to limited knowledge of miRNA-155 targets and how these targets are involved in tumorigenesis [15, 16]. Both transcriptomes were profiled using the Illumina Genome Analyzer II platform with a 50-mer in read length. For each transcriptome, two biological replicates were used. Each biological replicate has 2 to 4 technical replicates nested within it. Each technical replicate of the transcriptome (a single lane in each instrument run) contains around 6 to 12 million short reads. This NGS data set is available at the National Center for Biotechnology Information (NCBI) Short Read Achive (SRI) website http://www.ncbi.nlm.nih.gov/sra with access code SRA011001. We aligned the short reads generated from each transcriptome to the reference human genome (Build 37 version 1) using Novoalign, Bowtie and TopHat respectively allowing up to two mismatches. The alignment information of the exons and exon-exon junctions were stored in SAM/BAM and BED formats. Our biological goal is to predict a list of miRNA-155 direct targets on the genomic scale. Accurate calculation of the gene expression scores is a central problem for achieving this goal since down-regulated genes are likely to be potential miRNA-155 direct targets.
Key Feature: Generating signal map for peak detection
A signal map is a frequently demanded data format for NGS data analysis. In a signal map file, alignment results are represented in the per-base "pileup" format. In this format the single nucleotide short read coverage depth is calculated whereas the whole genome coverage is provided as a vector of integers with length 3.2 × 109. A signal map is a common input for a number of frequently performed sequential analyses to detect a wide range of genomic features. For ChIP-seq and Methyl-seq data, significant peaks in a signal map may indicate potential transcription factor binding sites and DNA methylation sites, respectively. For DNA-seq data, significant change points in the signal map might indicate a true copy number change, which is often a hallmark of cancer .
Comparison between SAMMate and SAMtools
Running Time Comparison of SAMMate and SAMtool (SAM File: 6 *106
SAM File Parsing Time
Intel core Duo CPU 2.26 GHz RAM 2GB
Intel core Duo CPU 2.26 GHz RAM 2GB
Intel core Duo CPU 9300 2.50GHz RAM 4GB
Intel core Duo CPU 9300 2.50GHz RAM 4GB
Algorithmic and computational contributions
In calculating a signal map, SAMMate uses an optimized merge sort algorithm to sort the list of the start position of short reads. It was implemented using the "Collections" class in the JRE system library. The optimized merge sort algorithm has nlog(n) time complexity. For nearly sorted lists, the optimized merge sort algorithm runs substantially faster. Although a highly optimized quicksort algorithm is generally considered to be faster than a merge sort algorithm, the quick sort algorithm is not stable and provides no guarantee of achieving a time complexity anywhere near nlog(n). The optimized merge sort algorithm also does not reorder equal elements yielding another significant advantage. A huge decrease in time needed is obtained as repeatedly sorting the same list (many short reads in one lane may have the same start position) is no longer done.
Biological case study: Genome-wide change-point analysis to identify potential miRNA targets
Other than down-regulation at the whole gene locus level, another characteristic response of potential miRNA-155 targets is only visible from the signal map: an abrupt drop-out of the base-wise coverage in the 3-prime end. For this case study, we have the same biological goal as the previous case study, i.e., to predict miRNA-155 targets. We used SAMMate to calculate a signal map for each biological replicate. We then applied a genome-wide change point analysis to all of the annotated 3'-UTR regions to identify potential miRNA targets. We also compared the predicted miRNA-155 targets generated using
RPKM gene expression scores (previous case study) with the ones generated using signal maps (this case study). We hope these two orthogonal and complementary case studies that share the same biological goal will be more than adequate to demonstrate the robustness and key features of SAMMate. For the sake of completeness, we briefly introduce the change point analysis method that we have applied . For each annotated 3'-UTR, we test the null hypothesis of the equal mean and variance parameters in the sequence of base-wise signal. The alternative hypothesis is the unknown number of change points' position exists. For statistical analysis we applied a Mean and Variance Change point Model (MVCM) based approach s. The input of the change point analysis algorithm is two sets of signal map files generated by SAMMate where each set is a series of replicates of a genome sample. In each signal map file, there are over 249; 000 bases. The output for each annotated 3'-UTR is a list of change point positions sorted by ascending order according to their Schwarz Information Criterion (SIC) values . The 3'-UTR coordinates were determined by the human genome annotation file (version hg19) with over 34; 000 annotated 3'-UTR's in total.
Key Feature: Generating wiggle files for visualization
Key Feature: Generating an alignment report
Short read alignment statistics provide indispensable resources to examine the alignment quality as well as compare alignment results. SAMMate calculates and exports a number of alignment statistics including the percentage of uniquely mapped short reads and the percentage of short reads mapped to intergenic, exonic and intronic regions.
We have implemented a GUI software to allow biomedical researchers to parse, process and integrate alignment information found in SAM files. With this tool biomedical researchers are able to calculate gene expression scores using either standard or customized annotations. They are also able to visualize and compare alignment results with great ease. These utilities and their biological impact are adequately demonstrated via the case studies of miRNA target prediction. The biological applications of SAMMate, however, are not limited to miRNA target prediction alone. In fact, SAMMate applies to any biological problem whose solution depends on the gene expression abundance score and base-wise short read coverage signal. SAMMate is also highly modular and extensible providing a programmer friendly interface for ease of updates and the incorporation of contributions from the community. Our tool will greatly facilitate the downstream analysis of genomic sequencing data.
Funding: This research was supported by the Tulane Cancer Center Post-doctoral Matching Funds award (ND), an NIH 1R21LM010137-01 grant (DZ), and an NIH ARRA award (R01CA130752-02S1, EKF).
- Mardis ER: Next-Generation DNA Sequencing Methods. Annual Review of Genomics and Human Genetics. 2008, 9: 387-402. 10.1146/annurev.genom.9.081307.164359. [http://dx.doi.org/10.1146/annurev.genom.9.081307.164359]View ArticlePubMedGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008, [http://dx.doi.org/10.1101/gr.078212.108]Google Scholar
- Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25. [http://genomebiology.com/2009/10/3/R25]PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang H, Wong WH: SeqMap : mapping massive amount of oligonucleotides to the genome. Bioinformatics. 2008, 24 (20): btn429-2396. 10.1093/bioinformatics/btn429. [http://dx.doi.org/10.1093/bioinformatics/btn429]View ArticleGoogle Scholar
- Smith AD, Xuan Z, Zhang MQ: Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics. 2008, 28;9: 128-10.1186/1471-2105-9-128.View ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis GR, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352. [http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics25.html#LiHWFRHMAD09]PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009, 10: 57-63. 10.1038/nrg2484. [http://dx.doi.org/10.1038/nrg2484]PubMed CentralView ArticlePubMedGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226. [http://dx.doi.org/10.1038/nmeth.1226]View ArticlePubMedGoogle Scholar
- Park PJ: ChIP-seq: advantages and challenges of a maturing technology. Nature Reviews Genetics. 2009, 10 (10): 669-680. 10.1038/nrg2641. [http://dx.doi.org/10.1038/nrg2641]PubMed CentralView ArticlePubMedGoogle Scholar
- Laird PW: Principles and challenges of genome-wide DNA methylation analysis. Nature Reviews Genetics. 2010, 11 (3): 191-203. 10.1038/nrg2732. [http://dx.doi.org/10.1038/nrg2732]View ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111. 10.1093/bioinformatics/btp120. [http://dx.doi.org/10.1093/bioinformatics/btp120]PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang H, Wong WH: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics. 2009, 25 (8): 1026-1032. 10.1093/bioinformatics/btp113. [http://dx.doi.org/10.1093/bioinformatics/btp113]PubMed CentralView ArticlePubMedGoogle Scholar
- Zheng S, Chen L: A hierarchical Bayesian model for comparing transcriptomes at the individual transcript isoform level. Nucl Acids Res. 2009, 37 (10): e75-10.1093/nar/gkp282. [http://dx.doi.org/10.1093/nar/gkp282]PubMed CentralView ArticlePubMedGoogle Scholar
- Xu G, Fewell C, Taylor C, Deng N, Hedges D, Wang X, Zhang K, Lacey M, Zhang H, Yin Q, Cameron J, Lin Z, Zhu D, Flemington EK: Transcriptome and targetome analysis in MIR155 expressing cells using RNA-seq. RNA. 2010, 16 (8): 1610-1622. 10.1261/rna.2194910. [http://dx.doi.org/10.1261/rna.2194910]PubMed CentralView ArticlePubMedGoogle Scholar
- Yin Q, Wang X, Fewell C, Cameron J, Zhu H, Baddoo M, Lin Z, Flemington EK: MiR-155 inhibits Bone Morphogenetic Protein (BMP) signaling and BMP mediated Epstein Barr virus reactivation. Journal of virology. 2010, [http://dx.doi.org/10.1128/JVI.00635-10]Google Scholar
- Lin Z, Xu G, Deng N, Taylor C, Zhu D, Flemington EK: Quantitative and Qualitative RNA-Seq-Based Evaluation of Epstein-Barr Virus Transcription in Type I Latency Burkitt's Lymphoma Cells. J Virol. 2010, 84 (24): 13053-13058. 10.1128/JVI.01521-10. [http://dx.doi.org/10.1128/JVI.01521-10]PubMed CentralView ArticlePubMedGoogle Scholar
- Chen J, Wang YP: A Statistical Change Point Model Approach for the Detection of DNA Copy Number Variations in Array CGH Data. IEEE/ACM Trans Comput Biol Bioinformatics. 2009, 6 (4): 529-541. 10.1109/TCBB.2008.129.View ArticleGoogle Scholar
- Bao H, Guo H, Wang J, Zhou R, Lu X, Shi S: MapView: visualization of short reads alignment on a desktop computer. Bioinformatics. 2009, 25 (12): 1554-1555. 10.1093/bioinformatics/btp255.View ArticlePubMedGoogle Scholar
- Arner E, Hayashizaki Y, Daub CO: NGSView: an extensible open source editor for next-generation sequencing data. Bioinformatics (Oxford, England). 2010, 26: 125-126. 10.1093/bioinformatics/btp611. [http://dx.doi.org/10.1093/bioinformatics/btp611]View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.