Shannon information content calculation
Similarly to the method of Schneider and Stevens [5], the Shannon information content (IC) at each P linkage (measured in bits) is calculated at each position i within the set of center aligned sequences representing consensus matches as well as their flanking regions. IC is given in accordance to the following equation:
$$ I{C}_i={ \log}_2k-{H}_i $$
(1)
where k is the number of total possible symbols for each representation. Consequently, when calculating IC for the nucleotide sequence k = 4 in order to account for the four possible nucleotides, which can potentially be found at each location. Since there are nine distinct TRX scores (see Table 1 in [15]), k = 9 when analyzing each matched region by its flexibility. H
i
in turn is defined as:
$$ {H}_i=-{\displaystyle \sum_{s=1}^k{p}_{s,i}}\times \log \left({p}_{s,i}\right) $$
(2)
where p
s,i
is the relative positional frequency of the symbol s appearing among all samples at the given location i. Due to the fact that the single nucleotide representation has a smaller maximum IC than the dinucleotide notation (maximum IC is log2(4) = 2.0 bits and log2(9) = 3.2 bits, respectively, also noting that two of the ten dinucleotide states share the same TRX score), all IC scores for P linkages were normalized to 2.0 bits on the Y axis of the plots (as per standard sequence logos plots based on a 4 letter alphabet).
The plotting module produces a traditional sequence logos plot with additional gray-scaled bars at intervening phosphate linkage positions to represent both the information content of the DNA backbone (height of bar) as well as its average intrinsic flexibility (shaded from black = no flexibility (TRX = 0) to white = high flexibility (TRX = 43)).
Components of the R graphics module and Supplementary downloadable files
We include three supplementary download files. Additional file 1 includes three folders that include the TRX-logos R module, a Perl script wrapper for analyzing JASPAR database output, and a Perl script for general batch processing of position weight matrix (PWM) data files as well as an introductory README.txt file that introduces the package. The folder with the R module (“TRXlogosRmodule”) includes the following files.
README.txt and DESCRIPTION.txt - describes the implementation and usage
logos.R – main R source code for the graphics program
Working-Test.R – subroutine enforces conditions on input data and supplies warnings if format is not correct
get TRX.R and calcTRX.R – subroutines calculate DNA flexibility on DNA sequences
readPWM.R – subroutine returns position weight matrix regarding identity states of nucleobases
trxPWM.R – subroutine returns position weight matrix regarding dynamic states of phosphate linkage
calcTRXIC.R – subroutine returns Shannon information and average DNA flexibility at each position
The folder “PerlWrapperForJASPAR” contains two files.
trxLogos.init – a file where users enter full path of the R binary and the working directory
trxLogosWrapper.pl – the Perl code script for handling JASPAR output.
The folder “PerlScriptForBatchProcessing” contains only the Perl script BatchTRXlogos.pl. NOTE: full path to R binary and working directory are hard coded at the top of the script and must be modified to match your system.
Implementation of the R graphics code
For single plotting, an R Shiny version of TRX-LOGOS can be implemented at the following website (http://people.rit.edu/gabsbi/TRXlogos.html). TRX-LOGOS is also available as an R graphics module offered on the journal website (Additional file 1) and SourceForge (http://sourceforge.net/projects/seqlogotrx/files/). A Perl wrapper script (BatchTRXlogos.pl) is also included for batch processing multiple files. This package expands on the seqLogo package from Bioconductor and was built with R version 2.15.2. To set up the package, simply place all R scripts in the same directory. Open the file “logos.R” in your R studio console, and execute the source. This will create a function in your workspace called “logos”. The R script is called as follows:
logos(file, sourcefile, update = TRUE, adjust = TRUE)
file: The location of the sequence file. Sequence files have a very simple format; only the center aligned sequences are necessary. One sequence per line in the file.
source file: The location of the file where the helper scripts and logos.R were saved. This allows the program to find them and load them into your workspace. NOTE: Only needs to be supplied if update is true.
update (default TRUE): Will attempt to install the seqLogo package and update the dependency tree. Will also try and call helper R scripts to load them into the workspace. Not necessary unless the helper scripts are not in your workspace, or you do not have the seqLogo package already. NOTE: If this is true, sourcefile must be supplied, otherwise an error will be thrown.
adjust (default TRUE): After the graphic is produced, the program will enter a small console in which the user can manually adjust some of the graphical parameters (bar start location and distance between bars, or both). The script will create a new graphic with your custom parameters without recalculating, improving the speed on this functionality. The default bar parameters work well with sequences around 20 bp. If you are consistently plotting larger or smaller sequences, or simply unhappy with the default parameters, they can be edited in the Working-Test.R file. Simply change the default values (start and increment only) in the function declaration. You will need to have the update function turned on in order to insert the new source into the “seqLogo” package at least once after changing the parameters of the function. This functionality can be turned off when it is undesired, or if you intend to use this script inside of loop.
Implementation of the R module using the Perl wrapper script
A Perl wrapper script named trxLogosWrapper.pl is also provided in Additional file 1 and can be run by simply typing the following instruction at the command line.
perl trxLogosWrapper.pl [sequence file name] [filetype] [output file]
For PC, it requires the installation of the free Community Edition of ActivePerl 5.16 or higher and a suitable text editor. We recommend Activestate’s freeware version of the Komodo Editor (Komodo Edit 8.0).
This script reads in a sequence file, and constructs a TRX logo plot with all of the sequences in the file. It handles multiple file types.
[sequence file name] = name of file to be read in. Entire path is needed unless file is in the same folder as this script.
[filetype] specifies what kind of format your sequence file is in.
Recognized values for [fileype]:
fasta = .fasta filetype. Will perform center alignment across entire sequence, so it assumes that any flanking sequences on each side of any consensus motif are of equal length
jaspr = .fasta files downloaded from the JASPR database. In these files, the consensus motif is denoted by capital letters. TRX-LOGOS will perform center alignment to align the consensus motif noted by capital letters.
Warning: any files that have a space in the name must be passed in quotes :“[filename]”.