One concern of this method is the accuracy of the topology of the network. We need to take into account the false positives and false negatives of the protein-protein interactions obtained from experiments. Incorrect input data affect prediction accuracy and we expect to get better predictions when the input data has better quality.

Another determinant of the method's prediction accuracy is the fineness of the classification. The more coarse-grained the classification, the fewer degrees of freedom there are in the network, and therefore the more likely we will have a correct prediction. On the other hand, to gain more insight from the predictions, we would like the method to be able to predict protein functions on a more fine-grained level.

At this point, we use the first level GO (Gene Ontology) classification for our prediction based on the information we can get from the protein function catalogue. In the future, with more protein annotation and protein-protein interaction data available, we will apply our prediction system on different levels of classification and find a level at which we can predict with meaningful accuracy and the predictions will be most insightful for further biological studies.

Certainly, when we predict the functions for uncharacterized proteins, the ultimate way to validate the predictions is to perform biological experiments. However, the predictions produced by computational methods can give us a good place to begin the experimental exploration and can hence reduce the amount of bench work needed. Instead of speculating wild guesses for an uncharacterized protein's functions, we can form some educated hypotheses and perform experiments to test those hypotheses. The clustering information can also give us some insights into biological pathways because proteins functioning in the same pathway tend to interact with each other and fall into the same cluster. Therefore the global optimization clustering method can help us either to better understand some pathways or to find the missing pieces in them.

The combinatorial optimization tools we use here can easily be used on larger data sets. Given that the Concorde program has solved a 24,978-city TSP problem to the optimum [4], we can expect it to solve the TSP problems obtained from the protein-protein interaction matrices of most organisms. When we get adequate protein-protein interaction information for other organisms, we can use the same methodology to predict protein functions and biological pathways for those organisms.

The reason why our TSP solver based clustering performs better than traditional gene clustering algorithms such as hierarchical clustering or nearest neighbor tree clustering is that the latter methods are essentially greedy bottom-up algorithms where they progressively combine the most similar nodes or clusters at each step till a tree of clusters is built[12]. Greedy algorithms adopt locally best decisions at each step and are likely to face very costly moves at later stages. For this reason, greedy algorithms tend to produce sub-optimal solutions especially for larger problems. In contrast, in our approach, we use Concorde the TSP solver to find the globally optimal solution for the TSP equivalent of our clustering problem. Aiming at global optimization, our method works better especially in the context that there are thousands of nodes (proteins) to be clustered.

Most recently, Climer and Zhang have proposed the TSPCluster algorithm [12], which is an improved TSP-based approach to optimal rearrangement clustering. Their algorithm produces optimal solutions when we have known the number of clusters (k) we are going to cluster the data into and the goal is to find the cluster borders optimally [12]. The rationale is that if we know the number of clusters beforehand, we can use that information, introduce dummy nodes, and modify the object function to minimize the total intra-cluster dissimilarity while tolerating large inter-cluster dissimilarity [13]. Their algorithm works better in situations where we know in advance the range of values for the number of clusters k that are of interest, and we can try a few k values in that range. An example of such situations would be to determine the locations of a few distribution centers based on population clustering [12]. In the more explorative situations like we have now where we do not know how many clusters the proteins are going to be clustered into based on their interaction information, it is better to use our algorithm to globally cluster the proteins and use that clustering information for protein function prediction. After we have performed the study by our algorithm and found out the viable number of clusters k, we can further apply the TSPCluster algorithm with that k value and some nearby values to additionally optimize the clustering.

The success of the method relies on the insight that we need to get information, not only from the protein's immediate neighbors, but also from other components more remotely related. Our method is still a simple one in that we adopt a simple rule where we use the clustering information for proteins with small numbers of neighbors and use direct voting for proteins with more neighbors. If we try to perceive the protein interaction relationship with a more integrated view, we can see that a protein can have direct neighbors, indirect neighbors with a certain number of "bridge" proteins, non-neighbors in the same cluster, and non-neighbors in different clusters. If we assign different weights to those relationships according to the distances of how the proteins are related, and we fine-tune the weights based on our training sets, we hope to get a more sophisticated and more accurate prediction system.