Pspace: a program that assesses protein space

Background We describe a computer program named Pspace designed to a) obtain a reliable basis for the description of three-dimensional structures of a given protein family using homology modeling through selection of an optimal subset of the protein family whose structure would be determined experimentally; and b) aid in the search of orthologs by matching two sets of sequences in three different ways. Methods The prioritization is established dynamically as new sequences and new structures are becoming available through ranking proteins by their value in providing structural information about the rest of the family set. The matching can give a list of potential orthologs or it can deduce an overall optimal matching of two sets of sequences. Results The various covering strategies and ortholog searches are tested on the bromodomain family. Conclusion The possibility of extending this approach to the space of all proteins is discussed.


Background
Recent advances in comparative structural modeling have demonstrated that reasonably reliable model structures can be built for proteins that share greater than 20-25% sequence similarity to their template proteins whose structures are determined experimentally [1]. As a consequence, Vitkup and colleagues [2] have discussed the possible approaches to obtain three-dimensional structural information of all known proteins based on structures of a judiciously chosen subset of proteins from all protein families that would be experimentally determined.
This approach begs the question how one can identify and select these structures that are representative of large protein families. In general, target selection for structural genomics is governed by the principle of trying to maxi-mize the information from a selected target [3]. To this effect a graph can be constructed whose vertices are proteins and edges are placed between vertices whenever the sequence similarity between them is such that comparative modeling can provide a model of adequate accuracy. A set of structures, from which models can be generated for all members of the proteins forming the graph, will correspond to a vertex set with the property that for all other vertices in the graph there is an edge connecting it to a member of this vertex set. Such a set is called a dominating set -an example is shown on Figure 1a. While the determination of the smallest dominating set is considered among the so-called 'hard' problems in computer science [4], it has been suggested that the so-called greedy algorithm (that keeps picking the node with the most neighbors) is nearly optimal [5]. On the graph shown on  Figure 1b, that consists of a single vertex -clearly an optimal choice. For the problem of covering the whole protein space, it has been shown that the greedy algorithm is two to three-fold more efficient than selecting structures randomly and in an uncoordinated fashion [2].
This situation is, however, complicated by the fact that new proteins are continuously added to a set of unknowns, and the target selection is not guided exclusively by the aim of optimizing the coverage of a protein space. As such, there is no practical way to realize the greedy algorithm. However, if the proteins with no experimental structures are given a weight representing their importance for the process of covering the protein space, then the approach inherent in the greedy algorithm can be reformulated to selecting proteins with probability proportional to their weight. Such approach may yield an efficiency that is close to that of the greedy algorithm while providing additional flexibility in the target selection for experimental structure determination. This weighting is the conceptual basis for Pspace, a new computer program that we are describing in this study.
In addition to providing a guide for the coverage of a protein space, Pspace also has a facility for the search of ortholog candidates.

Strategies for covering a protein space
As discussed above, there may be other algorithms besides the greedy algorithm that afford more freedom without significantly lowering the efficiency of coverage. We implemented into Pspace four different algorithms to select proteins for structure determination in order to be able to assess their relative efficiency. Upon selection of a protein for structure determination each proceeds by updating the weights that represent the respective information content of the rest and stops when only structures with zero weight remain: 1. Greedy and coordinated: Determine structures of proteins with the highest weights in the set U.

2.
Stochastic and coordinated: Determine structures of proteins from the set U with a probability proportional to a weight associated with each protein.
3. Random and coordinated: Determine structures of proteins from the set U with uniform probability considering only proteins whose weight is positive.

4.
Random and uncoordinated: Determine structures of proteins from the set U with uniform probability considering all proteins in the set U.
Vitkup et al [2] showed that for the entire protein space the random and uncoordinated approach requires 2-3 times more structure determination than the greedy algorithm.

General formalism for calculation of structural information
At any given time, let P be the set of all proteins with known sequences, D be the set of all proteins with known sequences and structures, and U = P\D, the set of proteins with known sequence but unknown structure. For a protein i, let's define its 'sequence vicinity', V H (i), as the set of proteins with unknown structure that are close to i in the protein space, i.e., their measure of similarity exceeds a threshold: where h ij is a similarity measure in the protein space (e.g., percent of sequence identity) and H is the threshold value below which structure determination with comparative modeling is considered unreliable -see Figure 2. For the sequence identity as a measure, 30% has been suggested as a reasonable choice [2].
Assume further, that there is a function I(i, S), giving the amount of information one can gain from knowledge of structures of the proteins in the set S about the structure of the protein i. Then the utility of determining the structure of a protein k, k ∈ U, is the total amount of information gained when protein k is added to the set D. This utility can be expressed as: Thus, given P, D, the vicinities V H (i), for each sequence i and assuming a reasonable form for I(i, S), ∆I(k) can be Examples of dominating sets in a graph Figure 1 Examples of dominating sets in a graph. Vertices covered by black discs form the dominating set.
calculated and used as a measure of the importance of obtaining the structure of protein k.
Clearly, for k ∈D, ∆I(k) = 0. Furthermore, the paradigm put forth above suggests that the knowledge of a structure i provides information about the structure of all proteins in its sequence vicinity V H (i). Also, the amount of information will be assumed to be proportional to the number of residues of the protein(s) whose structure can be obtained by homology modeling. Formally, this can be written as where nr k is the number of residues in k and i.e., the similarity between k and the protein in the set S most similar to it.

Establishing the similarity measure h ij
The two most common measures of sequence similarity between two proteins are the percent of identical residues and the alignment score -the latter being a function of the similarity matrix and gap penalties used in the alignment. The alignment score can be normalized by the maximum possible score to make it commensurate with the percent identity. Pspace considers both measures in a combined manner: Where the superscripts p and s refer to using the percent identity or similarity score, respectively, in Equation 3. The percent identity between two sequences is calculated relative to the number of residues in the shorter sequence.
In this treatment, I 0 (k, S) will be nonzero only when both measures fall above their respective thresholds. Recent work quantified the accuracy of homology models [6] as a function of sequence similarity and their result can be used in selecting the threshold values used by Pspace.
The measures discussed above treat all residues equally. Proteins, however, usually have a selected set of residues that are directly involved in the protein's function. For any single protein, the residues forming the binding site may be known. For a family of proteins, conserved residues are usually assumed to have special roles. It is thus reasonable to assume that for such residues higher level of similarity is required than for the rest. Pspace allows the specification of such a selected set with corresponding thresholds that can be different from the thresholds used for the rest of the residues.

Higher order approximations to the utility function I(k, S)
The zeroth order approximation to I(i, S) described by Equation 3 above is based on a discretized representation of sequence similarity. In general, however, the amount and reliability of information regarding the structure of protein k that can be obtained from the knowledge of a set of proteins S a continuous function of the extent of similarities between k and the members of the set S, {h i, k |i = 1 ,..., |S|} (|S| is the number of elements in the set S).
Besides increased sequence similarity, the reliability of a homology model for k derived from the set S increases with the number of proteins with significant similarity to the protein k in question. Since all these effects are ignored in writing Equations 3 and 4 we present here two more general forms of I(i, S).
At the next level of approximation the step function used for each measure could be replaced by a sigmoidal function p(h) multiplied by the number of residues nr k : where p(h) is zero below the threshold value and gets close to one for h>H and reaches one at the measure of perfect similarity (i.e., identity). Its actual form can be established by the study of a large set of models whose accuracy is reasonably well known -the work of Chakravarty et al., will be useful for this purpose as well [6].
Schematics of the relation between the sets P, U, D, and V H (i) Figure 2 Schematics of the relation between the sets P, U, D, and V H (i).
When multiple measures are used -such as in Pspacethen I 1 (k, S) (as well as the further generalizations described below) can be obtained as a weighted average: where the superscript k indicates one of the measures and the w k 's sum up to one.
The effect of using more than one structure for the homology-based estimation of the structure of the protein k can be incorporated in the first approximation by multiplying I 1 (k, S) with a function f N (|S|, p), representing the additional information the multiple reference structures represent: Clearly, f N (1, p) = 1.f N (|S|, p) should be monotonically increasing as a function of |S| but level off at a value under 1/p when |S| reaches the number of structures that was found to be sufficient to accurately determine an unknown protein's structure.
At higher levels of approximation, instead of relying on just the homology to the nearest structure, a weighted sum of all p(h i, k )*nr i, k could be used:

Updating I(k, S) with new sequences and/or structures
When a sequence k is added to the set U (i.e., the structure is unknown) then we need to calculate the amount of information its structure would yield, ∆I(k). This calculation requires the determination of its neighborhood. When the structure corresponding to a known sequence is determined (i.e., moved from the set U to the set D) then all of the ∆I(k) values of sequences in its vicinity have to be updated. Since the set U is in general quite large, the algorithms for these updates have to be considered carefully.
Adding a new sequence to U requires the alignment of this sequence with all members of the sets U and D. Given the large number of sequences already determined and its nearly exponential growth this is in itself a major task. It has been addressed by several groups. Most recently, a database of similarity scores of all known proteins was made available that is now continually being updated as new sequences are being determined [7].
However, for our purpose the results should only be stored for those protein pairs that are within the threshold of utility. The alignment results will give directly I(i, S) for the new protein. We have to update the I(i, S) values 'just' for the proteins in the set U that have high enough sequence similarity to it that adding i to their neighborhood will increase their I(i, S). This results in a limited number of updates. Finally, the sum of weights derived from the I(i, S)'s have also to be updated if the weights have to be turned into probabilities for the sampling algorithm.

Ortholog search
When proteins (or clusters of proteins) in two sets representing proteins in two organisms can be paired by mutual relation of maximum similarity, then the determination of orthologs is straightforward [8]. Pspace, however, is prepared to treat cases when this is not necessarily the case: it establishes a match between the two sets with the best overall similarity. For this calculation, the socalled Hungarian method of graph theory [9] is used that establishes the match between two sets that maximizes the overall similarity. In any event, the result of such matching needs further verification based on the biological roles of the proteins matched. The methods for detecting orthologs in distant families (where the seqence similarity of orthologs can be quite low) has recently been reviewed by Wan and Xu [10].

Comparison of coverage strategies
Pspace was tested on the sequences in the bromodomain family, as extracted from the SmartEMBL [11] database. Specifically, we selected the bromodomains from proteins in yeast, rat, mouse and humans. For protein alignments we used the PAM-120 scoring matrix, extracted from the database AAindex, Version 3.0 [12,13]. The initial gap penalty was set to 12 and the gap extension penalty was set to 1. The distribution of the percent identities and alignment score percentages in the human bromodomain set are shown on Figure 3 as calculated by Pspace.
The input sequences were checked for redundancy by clustering at near-identity level, using a hierarchical clustering based on minimal cluster member distance [14]. This resulted in a significant reduction in the number of sequences. We also used the bromodomains of this four sets (after clustering at the 99% level) to compare the four strategies for the covering of the set. After having aligned the sequences, we determined the percent identity and alignment score percentage. For each protein we calculated I 0 (k, S), using uniformly 25% for minimum percent identity and 40% for minimum similarity score. Different powers of I 0 (k, S) were used for the selection weight. Also, the various random and stochastic strategies were executed 10 times with different random number seeds. Table 1 provides the result for the different strategies.
These results clearly show that the stochastic and coordinated approach performs close to the greedy algorithm if the weights are 'strong' enough. It is also clear that the large gain in the number of structures needed comes from the coordination. This emphasizes the fact that the most important step in reducing the number of structure determinations is the coordination of efforts.

Test of the ortholog searches
The search for orthologs was run to match the rat and mouse bromodomains. The matched sequences are shown in Table 2. Most matches found are indeed orthologs. There are several sequences in the set of rat bromodomains whose function is unknown thus the matches found may be considered a prediction for their function. Note also that for matches found that turned out not to be orthologs, there were no true orthologs represented in this sequence database.
We also compared the yeast and human sets remained (after clustering at the 99% level). While the 14 yeast bromodomains were unequivocally matched to counterparts in the human bromodomain set, none of the matches found corresponded to actual orthologs. This is not surprising since yeast and humans are very distant in the evolutionary tree. This raises the question of how close the score between true orthologs are to the score of the best match. This can be tested by asking Pspace to list for each yeast bromodomain all human bromodomains whose matching score is within a certain percent of the best score. For example, the score of the human ortholog of the GCN5 is within 5% of the best score, but so are 6 other bromodomains.

Potential scope
Since structures of individual domains of complex multidomain proteins are often determined separately (as is the case for the bromodomains discussed here) the current implementation of Pspace is best suited for the treatment of a specific family or a limited set of families. However, the concept of dynamically assigning a weight to proteins with unknown structure is applicable to the space of all proteins and can be a valuable help in selecting proteins for structure determination. This can be achieved by implementing the calculation of these weights on a webbased server. Since the results of this study clearly showed that the major gain in the efficiency of covering a protein space comes from coordination, the effect of creating such a server would be a significant gain in the efficiency of covering the protein space. Note that effort of this scale has already been undertaken: the SIMAP server [15] provides the dynamically updated similarity matrix of all known protein sequences. Full line: distribution of pair-wise percent identity for the sequences in the human bromodomain set; Dotted line: dis-tribution of the pair-wise score percentage (the score for a perfect match represents 100%) for the sequences in the human bromodomain set   tion is only necessary for the proteins in the set D since the knowledge of the structure allows, reasonably, reliable establishment of domain boundaries.

Conclusion
Pspace is available at the URL http://inka.mssm.edu/ mezei/pspace. The distribution includes the source code, the matrices of the AAindex database and the (HTML) documentation. A list of its currently implemented functions is given in the Appendix.