Development of CombAlign
A new code, CombAlign, was developed using Python 2.6. CombAlign takes as input a set of pairwise structure-based sequence alignments and generates a one-to-many, gapped, multiple structure-based sequence alignment (MSSA, see Methods) whereby the user can readily identify regions on the reference structure that have residue-residue correspondences with each of the other proteins against which the reference was structurally aligned. Although the intent in developing CombAlign was to construct multiple-sequence alignments from structure data, the code is agnostic to the program that is used to generate pairwise alignments used as input. However, because structure-based alignments can reveal structural (and, hence, potential functional) differences between proteins that may not necessarily be revealed through sequence-based alignments, development of CombAlign was targeted toward facilitating construction of multiple alignments using formats produced by two common protein structure tools: TM-align [13] and DaliLite [14].
CombAlign comprises a script (combAlign.py) that reads in the fasta sequence of a reference protein followed by a series of pairwise alignments, then creates an alignment object (alignment.py), which is used to combine the alignments into an MSSA, and lastly prints the results to a file. The reference fasta is used as a framework for recording correspondences between residues of the reference structure and residues of each structure in the comparison set; a data structure captures each position/residue in the reference fasta and tags it with a list of corresponding residues, one residue from each aligned structure (or ‘null’ if residue is absent or not aligned). Additionally, for each pairwise alignment, corresponding residues in the compared structure and positions of non-correspondence (gaps in the compared structure) are recorded; gaps that occur in the reference structure relative to the compared structure are inserted as null positions in a list attached to the preceding residue in the reference fasta sequence framework. Gap positions that occur in the reference structure relative to more than one compared structure are merged so as to avoid redundant gap insertion. The resulting one-to-many, gapped MSSA is formatted for output by dividing the reference fasta framework into segments corresponding to a user-provided or default line-size parameter and is printed to an output file. The correspondence data from the input pairwise alignments are reflected in the output MSSA. Symbols (‘-‘, ‘:’, ‘.’, ‘|’, “) used in CombAlign output have meaning identical to those of the program used to generate the pairwise alignments, and generally indicate the degree with which the residues corresponded. No other data provided by the pairwise alignment method (e.g., scoring, secondary structure prediction) are used by CombAlign.
Test case 1: One-to-many alignment of virus matrix proteins (VP40s)
The use and utility of CombAlign was demonstrated by generating a gapped MSSA using a structure model of the matrix protein (VP40) from Reston Ebolavirus (as the reference structure) and pairwise alignments between the reference and structure models of the VP40s from Bundibugyo, Sudan, Tai Forest, and Zaire Ebolaviruses and Marburg Marburgvirus (Fig. 1). The gapped MSSA revealed structure-based residue-residue correspondences between Reston Ebolavirus VP40 and each of the other VP40 proteins, which enabled identification of structurally similar versus differing regions in Reston compared to each of the closely related proteins.
In examing the MSSA (Fig. 1), it is apparent that the VP40 models are highly similar at the structure level, although clear differences emerge at the N- and C-termini, and small interruptions in correspondence are seen between the Reston Ebolavirus protein and that of Marburg Marburgvirus. The most apparent differences are observed within the N- and C- terminal regions. The mostly conserved PTAP/PPEY motifs (conserved in sequence among the Ebolaviruses but absent in the Marburgvirus protein), were disrupted in the pairwise structure alignments, and were, thereby, also distributed among gaps in the CombAlign MSSA. A distinguishing feature of the Reston Ebolavirus protein in comparison to each other protein was found to be the additional 5 residues at the extreme C-terminus (qnsyq), which are absent from all of the other VP40s. As this terminal region is believed to function in virus budding, the additional 5 residues in the Reston protein may have an adverse effect on VP40 function in this regard [9, 15, 16].
Test case 2: One-to-many alignment of ebolavirus Pre-small/secreted glycoproteins (sGPs)
A second test case involving structure-based comparison of Reston Ebolavirus sGP with the corresponding proteins from several other Ebolavirus species (Fig. 2) illustrates that combining structure-based alignments can reveal structural (and therefore potential functional) differences that might not be apparent using sequence-only methods (Fig. 3). The CombAlign alignment in Fig. 2 suggests that there may be considerable structural differences between sGP of Reston Ebolavirus compared to its pathogenic near neighbors in the N- terminal region, in the approximate center of the peptide chain, and in a large portion of the C-terminus, whereas the Clustal Omega [17] alignment depicted in Fig. 3 implies tight global and local correspondences between the residues of these proteins. Of particular note is the divergence seen at the C terminus, which contains the delta peptide (Fig. 3, box). This region is perfectly aligned at the sequence level, yet displays poor structural homology when examined using structure tools. Corresponding MSSAs were constructed using CombAlign to determine whether any given Ebolavirus sGP (as the reference structure) displayed close structure homology to any other (data not shown), and none was found to align well to any other. This apparent poor structure homology may be due to disorder in this region of the protein. Nonetheless, the MSSA in Fig. 2 supports the use of CombAlign for detecting structural deviations in a protein of interest relative to its structural near neighbors. It has been postulated that the delta peptide may function either to prevent superinfection of producer cells during early stages of infection or they may prevent trapping of budding progeny virus [11]. As the function of the delta peptide may be critical to pathogenicity or disease progression, it is interesting to note the apparent structural differences among the sGPs from the species depicted in Fig. 2, and based on this observation it would be reasonable to justify structure-function studies of these peptides in the context of their proposed functions.
Code availability and requirements
The CombAlign source code is available for download from the GitHub code archive. To access the code, one should first download and install the git client [18, 19]. The CombAlign project files can be cloned either using the GUI interface or more simply from the command line (once the software is installed, typing ‘git’ should display a help menu). CombAlign files can then be downloaded by entering, “git clone https://github.com/carolzhou/Protein”. CombAlign was written in Python 2.6 and can be run on any desktop or server that supports Python. No specific processing requirements are indicated. A help menu is provided by typing, “python combAlign.py help”.