RECOT: a tool for the coordinate transformation of next-generation sequencing reads for comparative genomics and transcriptomics
© Izawa and Sese; licensee BioMed Central Ltd. 2013
Received: 17 January 2012
Accepted: 19 February 2013
Published: 26 February 2013
The whole-genome sequences of many non-model organisms have recently been determined. Using these genome sequences, next-generation sequencing based experiments such as RNA-seq and ChIP-seq have been performed and comparisons of the experiments between related species have provided new knowledge about evolution and biological processes. Although these comparisons require transformation of the genome coordinates of the reads between the species, current software tools are not suitable to convert the massive numbers of reads to the corresponding coordinates of other species’ genomes.
Here, we introduce a set of programs, called REad COordinate Transformer (RECOT), created to transform the coordinates of short reads obtained from the genome of a query species being studied to that of a comparison target species after aligning the query and target gene/genome sequences. RECOT generates output in SAM format that can be viewed using recent genome browsers capable of displaying next-generation sequencing data.
We demonstrate the usefulness of RECOT in comparing ChIP-seq results between two closely-related fruit flies. The results indicate position changes of a transcription factor binding site caused sequence polymorphisms at the binding site.
Recent and ongoing advances in DNA sequencing technologies have permitted the genome sequences of non-model animals and plants to be determined. Now that the genome sequences are available, various experiments using these sequences, such as RNA-seq and ChIP-seq, have been performed. Comparison of the results obtained from non-model organism with data from closely related model organism is an important analysis which enhances our understanding of how genetic polymorphisms contribute to differences in phenotypes such as gene expression and regulation.
There are two different approaches for these comparisons. One approach is to directly align the short reads determined from the query species to the genome of the target species for comparison. The other approach is to match the nucleotides in the genome of the observed species to those of the target species using an alignment program, and then transform the coordinates of the short reads according to the correspondence of the nucleotides between the genomes. The former approach may cause problems when corresponding query and target sequences are dissimilar, as the genetic polymorphism may cause inaccurate or prevent the alignment of the reads to the target genome. To perform the latter approach, a genome conversion tool called liftOver might be useful. However, the program converts an annotation format called BED and cannot handle the standard next-generation sequence formats, SAM and BAM. Therefore, liftOver is not suitable for studying comparative genomics and transcriptomics using next-generation sequencers.
In this paper, we introduce a set of programs, called REad COordinate Transformer (RECOT) created to convert the alignment or mapping coordinates of short reads obtained from the query species (for example a non-model organism) to a comparison target species (for example a model species). After the user inputs the short reads, the query species genome, its gene regions, and the target species genome, our scripts automatically calculate the corresponding regions between the genome sequences and transform the coordinates of the short reads from the query species to the target species. Because RECOT generates an output file in the SAM format, the results can be visualized in most genome browsers that can handle next-generation sequences.
In this section, we describe the features and usage of RECOT. Python is required to run RECOT. We call the species from which the short reads were obtained the query species and the species to which the coordinates of the short reads will be transformed for comparison the target species.
The script “recot_extract.py” extract the gene region sequences with their 500-bp upstream sequences from the transcription start sites and downstream sequences from the 3′-end. The upstream and downstream lengths are defined in the configuration file. The query sequences need to be aligned to the target species genome to identify their corresponding regions (Figure1B). Because the output sequences of recot_extract.py are given in FASTA format, you can select a long sequence alignment tool, such as GMAP, BLAST, or BLAT. If you use BLAST or BLAT for this step, you will need to convert the output to SAM format.
In the alignment results, a target genome region may associate with multiple query gene sequences due to the existence of paralogous genes. To address the problem of such one-to-multi or multi-to-multi results in the query-to-target sequence alignments, we have prepared an optional script that can select sequence relations (1) that are prioritized by user-specified relationships between query and target genes by describing the relations in a text file and (2) highest alignment score, when query and target alignments are not reciprocally unique.
Next, the SAM file of the short reads mapped to the query species genome, using a short read alignment tool such as BWA or bowtie, are transformed according to the alignment relationships between the query and target genomes (Figure1C). For the transformation, we map the differences in the nucleotides between the two species, as calculated in the previous step, onto each read. To accomplish this task, we have prepared two scripts. The first script, “recot_combine.py”, generates temporary files that contain associations between the short read query species mapping positions and the corresponding positions of the target species. The second script, “recot_convert.py”, converts the mapping coordinates of the short reads to the query genome to the corresponding positions of the target genome and generates the transformed SAM file. Reads on the non-corresponding positions are regarded unmapped in the SAM file. The transformation of billions of reads by “recot_convert.py” may take several hours. To accelerate the process, recot_convert.py can run the transformation in parallel, using the job queueing system of Oracle (SUN) Grid Engine or Torque.
These results can be visualized using genome browsers that can handle SAM format, such as IGV. Supplementary Figure1 is a screen shot of IGV in which ChIP-seq experiments in D. melanogaster and D. simulans are compared. The result shows the conservation of binding sites, even though the region contains short deletions and several single nucleotide polymorphisms.
RECOT can be used for species in which the genome sequence is unavailable. For these species, we first generate gene sequences using a transcriptome assembly tool such as Trinity, and we regard these sequences as genome sequences (we regard each gene/contig as a chromosome). The subsequent procedures are the same as the steps in Figure1.
We developed a set of programs to transform the coordinates of short reads between two species. Our tools can visualize the differences in the genome regions onto which the short reads are mapped between two genomes. Our tools are useful for comparative genomics and comparative transcriptomics between model and non-model organisms.
Availability and requirements
Project name: RECOT project
Project home page: http://sesejun.github.com/recot/
Operating systems: UNIX, Mac OS X and Windows
Programming language: Python
Other requirements: Python 2.5 or higher
Any restrictions to use by non-academics: no licenses are required
The source codes along with user documentation are available at http://sesejun.github.com/recot/
This work was supported by Grant-in-Aid for Scientific Research on Innovative Areas (23126504) from the Ministry of Education, Culture, Sports, Science and Technology, Japan.
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63. 10.1038/nrg2484.PubMed CentralView ArticlePubMedGoogle Scholar
- Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007, 316: 1497-1502. 10.1126/science.1141319.View ArticlePubMedGoogle Scholar
- Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F, Hillman-Jackson J, Kuhn RM, Pedersen JS, Pohl A, Raney BJ, Rosenbloom KR, Siepel A, Smith KE, Sugnet CW, Sultan-Qurraie A, Thomas DJ, Trumbower H, Weber RJ, Weirauch M, Zweig AS, Haussler D, Kent WJ: The UCSC genome browser database: update 2006. Nucleic Acids Res. 2006, 34: D590-8. 10.1093/nar/gkj144.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005, 21: 1859-1875. 10.1093/bioinformatics/bti310.View ArticlePubMedGoogle Scholar
- Altschul SF GISHW, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009, 25: 1754-10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nat Biotechnol. 2011, 29: 24-26. 10.1038/nbt.1754.PubMed CentralView ArticlePubMedGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011, 29: 644-652. 10.1038/nbt.1883.PubMed CentralView ArticlePubMedGoogle Scholar
- He Q, Bardet AF, Patton B, Purvis J, Johnston J, Paulson A, Gogol M, Stark A, Zeitlinger J: High conservation of transcription factor binding and evidence for combinatorial regulation across six Drosophila species. Nat Genet. 2011, 43: 414-420. 10.1038/ng.808.View ArticlePubMedGoogle Scholar
- Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, George RA, Lewis SE, Richards S, Ashburner M, Henderson SN, Sutton GG, Wortman JR, Yandell MD, Zhang Q, Chen LX, Brandon RC, Rogers YH, Blazej RG, Champe M, Pfeiffer BD, Wan KH, Doyle C, Baxter EG, Helt G, Nelson CR: The genome sequence of Drosophila melanogaster. Science. 2000, 287: 2185-2195. 10.1126/science.287.5461.2185.View ArticlePubMedGoogle Scholar
- Tamura K, Subramanian S, Kumar S: Temporal patterns of fruit fly (Drosophila) evolution revealed by mutation clocks. Mol Biol Evol. 2004, 21: 36-44.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.