HAMSTER: visualizing microarray experiments as a set of minimum spanning trees
 Raymond Wan^{1, 2}Email author,
 Larisa Kiseleva^{2},
 Hajime Harada^{2},
 Hiroshi Mamitsuka^{1} and
 Paul Horton^{2}
DOI: 10.1186/1751047348
© Wan et al; licensee BioMed Central Ltd. 2009
Received: 15 January 2009
Accepted: 20 November 2009
Published: 20 November 2009
Abstract
Background
Visualization tools allow researchers to obtain a global view of the interrelationships between the probes or experiments of a gene expression (e.g. microarray) data set. Some existing methods include hierarchical clustering and kmeans. In recent years, others have proposed applying minimum spanning trees (MST) for microarray clustering. Although MSTbased clustering is formally equivalent to the dendrograms produced by hierarchical clustering under certain conditions; visually they can be quite different.
Methods
HAMSTER (Helpful Abstraction using Minimum Spanning Trees for Expression Relations) is an open source system for generating a set of MSTs from the experiments of a microarray data set. While previous works have generated a single MST from a data set for data clustering, we recursively merge experiments and repeat this process to obtain a set of MSTs for data visualization. Depending on the parameters chosen, each tree is analogous to a snapshot of one step of the hierarchical clustering process. We scored and ranked these trees using one of three proposed schemes. HAMSTER is implemented in C++ and makes use of Graphviz for laying out each MST.
Results
We report on the running time of HAMSTER and demonstrate using data sets from the NCBI Gene Expression Omnibus (GEO) that the images created by HAMSTER offer insights that differ from the dendrograms of hierarchical clustering. In addition to the C++ program which is available as open source, we also provided a webbased version (HAMSTER^{+}) which allows users to apply our system through a web browser without any computer programming knowledge.
Conclusion
Researchers may find it helpful to include HAMSTER in their microarray analysis workflow as it can offer insights that differ from hierarchical clustering. We believe that HAMSTER would be useful for certain types of gradient data sets (e.g timeseries data) and data that indicate relationships between cells/tissues. Both the source and the web server variant of HAMSTER are available from http://hamster.cbrc.jp/.
Background
The high dimensionality and exploratory nature of microarray data analysis has led to the application of several unsupervised data clustering techniques to aid in the visualization of gene expression data. Three popular methods are hierarchical clustering (HC) [1], kmeans [2], and selforganizing maps (SOM) [3] (others have previously compared these systems [4]). Implementations of these methods can be found in TreeView [1], Cluster [5], and GENECLUSTER [3]; in more general statistical tools such as R [6]; or online, as part of public microarray repositories such as NCBI's Gene Expression Omnibus (GEO) [7]. Among these, the most popular is hierarchical clustering (HC), which builds a tree by recursively combining the two most similar objects.
Hierarchical Clustering (HC)
A microarray data set can be represented as a twodimensional data table of n experiments and m probes. Even though we focus on visualizing experiments, the methods we describe are equally applicable to probes. Of course, issues such as computation time may be significantly different since the number of probes is usually many times larger than the number of available experiments.
Hierarchical clustering (HC) forms a tree called a dendrogram, in either a bottomup (agglomerative) or topdown (divisive) fashion. In bottomup construction, each experiment is initialized as being in its own cluster and these clusters form the leaves of the dendrogram. Recursive pairing of clusters grows the dendrogram upwards, until only a single cluster remains. Each merge step adds an internal node to the dendrogram.
Sample microarray.
Probe 1  Probe 2  

A  2  1 
B  2  2 
C  1  3 
D  4  4 
One important property of dendrograms which we will return to later is that the two branches of each internal node can be flipped without any lost in the definition of hierarchical clustering. This gives flexibility in the order in which experiments are shown, but it also can cause problems since a dendrogram could imply an order which is not present. In our example, the positions of nodes A and B can be inverted and the dendrogram would still represent the data of Table 1.
Minimum Spanning Trees (MSTs)
A graph is a concept in computer science which represents information as a set of nodes and edges, such that each edge indicates some relationship between its two associated nodes (for our purposes, we disallow edges which connect a node to itself). An undirected graph G(V, E) is composed of a set of vertices V and a set of edges E, with no direction on any of the edges. A spanning tree for G is a connected graph with the same vertices but only V  1 edges and no cycles. In a connected graph, each node can be reached from any other node, by following a series of edges. If the edges in the original graph are weighted, a spanning tree whose edges have the minimal total weight is called a minimum spanning tree (MST). We denote a minimum spanning tree of G as G_{ M }(V, E_{ M }), such that E_{ M }⊆ E.
Several algorithms exist for calculating MSTs, including Prim's algorithm [9] and Kruskal's algorithm [10]. Prim's algorithm starts from an arbitrary node and extends the MST a node at a time in a greedy manner by selecting neighboring edges with the lowest weight. Instead of adding nodes, Kruskal's algorithm adds edges. It organizes the dissimilarity matrix d as a sorted list and adds edges, starting from the one with the lowest edge weight, if they connect two previously disconnected components. If all edge weights in G are unique, then the MST produced is unique, regardless of the algorithm employed. These algorithms and a description of MSTs are described in books on graph theory [11, 12].
MST construction refers to the selection of nodes or edges. If this procedure is coupled with the Euclidean distance measure for determining edge weights, then some authors have referred to the MSTs as EMSTs (Euclidean Minimum Spanning Trees) [13]. We do not make such a restriction since several other dissimilarity metrics are equally important to microarray data.
In this work, we have selected Kruskal's algorithm, which we illustrate by extending our earlier example.
Dendrograms and MSTs
In our example, the dendrogram produced from HC appears similar in structure to the MST. As observed by others [14], this property can be summarized as follows:
Property 1 Bottomup hierarchical clustering using single linkage generates a dendrogram which is equivalent in structure to a minimum spanning tree generated using Kruskal's algorithm.
The reason for Property 1 is straightforward. Hierarchical clustering using single linkage recursively chooses the smallest dissimilarity between two clusters. On the other hand, Kruskal's algorithm sorts the edge weights in decreasing order and adds edges to G only if they connect two separated clusters. So, if the dendrogram from hierarchical clustering is treated as a more general graph, the connections added to both graph representations are the same. If there were identical dissimilarities (edge weights), additional tiebreaking rules (based, for example, on node labels) would be required.
Basically, Property 1 implies that the same groups of nodes are connected, whether they are graph components in an MST or subtrees in a dendrogram. Despite Property 1, dendrograms are visually very different from MSTs and it is this difference that HAMSTER aims to leverage.
Some of the differences between dendrograms and MSTs are as follows. A dendrogram has an orientation, so that users can examine it from the root node down to the leaves. MSTs have no obvious starting point and need to be examined as a whole. The most important difference is that dendrograms introduce internal nodes to connect clusters together while MSTs do not. In an MST, experiments are connected directly to each other, allowing users to examine the MST for hubs and neighborhoods. For example, in Figure 5, it can be seen that experiment B is a central node, in the sense that it is the nearest neighbor of A, C, and D. This point cannot be easily seen from the corresponding dendrogram.
Previous Applications of MSTs
In bioinformatics, MSTs have been used in areas ranging from depicting the sequence repetitions in part of the C. elegans genome [15] to showing the shapes of disease clusters (e.g., cases of the West Nile virus) on a map [13]. In the latter case, the authors showed that the disease clusters are allowed to form arbitrary shapes, as determined by distances between objects, instead of circles as implied by Euclidean distancebased methods.
As for microarray data, our application of MSTs is most related to the work on data clustering by researchers at the Oak Ridge National Laboratory/University of Georgia. Initially, they proved that MSTs have the desirable property that all "good" clusters must consist of nodes that form a connected subgraph of the MST [16]. The sufficient condition of "good" proposed is that if a cluster of nodes from a graph G is split into two nonempty halves C_{1} and C_{2}, the nearest neighbor of any node in G\C_{1} is in C_{2}. Their main idea is to construct an MST from the genes in a microarray data set and then apply clustering algorithms on the MST instead of the original data. In essence, the problem of clustering the original data is converted to a simpler treepartitioning problem. They incorporated their ideas into a system dubbed EXCAVATOR (EXpression data Clustering Analysis and VisualizATiOn Resource) [17]. Since then, they have also built CUBIC for clustering regulatory binding sites using MSTs where the vertices are kmers and the dissimilarity between kmers is calculated using the Hamming distance [18, 19]. More recently they investigated parallelizing MST construction where the number of vertices is as high as one million [20]. Varma and Simon [2004] have used MSTs for feature (gene) selection for twoclass microarray clustering [21]. In many microarraybased studies, the number of genes that are up/downregulated are typically very small. Their aim is to use MSTs to select the genes which are most pertinent to the study. Instead of examining all combinations of genes, they only evaluated the subset of genes obtained by removing a single edge at a time from the MST. Afterwards, the experiments are clustered using hierarchical clustering and this gene subset.
Magwene et al. [2003] presented a method for identifying the timeindex of biological samples by combining an MST with a data structure called a PQtree [22]. Assuming that changes in the transcriptome are "smooth and continuous", the samples should form a single path. Deviations from the path (branches) are checked recursively using the PQtree.
Results
Previous works used MSTs for data clustering since they afford more efficient clustering. Our primary aim, however, is to use MSTs for microarray visualization. Rather than breaking MSTs into components to form clusters, all of our MSTs remain connected.
Based on this underlying premise, there are three main results in this paper. Instead of a single MST, we build sets of MSTs for a single microarray data set and show how they are similar to dendrogram construction. This observation is an extension to our earlier discussion where we showed how dendrograms are similar MSTs, with the internal nodes of dendrograms being the most notable difference. We then propose three schemes which score and then rank the MSTs in a set to help users select the important MSTs. Finally, we describe a publicly available system called HAMSTER that embodies these ideas and which we also demonstrate with real data. These three results are covered in the sections below.
Set of MSTs
A set of MSTs is constructed using HAMSTER by recursively merging the most similar experiments in the original microarray data set of n experiments and m genes. To facilitate the discussion below we distinguish between the microarray and the MST views used by HAMSTER. Clusters of experiments are formed from the microarray with dissimilarities (or distances) and linkages between clusters, similar to hierarchical clustering. The abstract MSTview of these clusters has nodes and edge weights between them. This separation emphasizes that experiment merging is done on the microarray data set and one possible interpretation of the set of MSTs is that it allows users to interpret the relationships between clusters as a connected graph.
The MSTs are constructed by buildmst as follows. Initially, each of the n experiments in the data set is itself a cluster. The first MST (MST 0) is obtained by directly applying Kruskal's algorithm. Then, the two most similar clusters in the microarray data set are combined and an MST of n1 nodes is created from it (again, using Kruskal's algorithm) and designated as MST 1. This process continues until MST n1 is formed (n MSTs in total), which would have a single node that encompasses all of the experiments and no edges. As with hierarchical clustering, edge weights between composite clusters are calculated through the user's selection of linkages.
The main difference between our work and what others have done previously with MSTs is the application of Kruskal's algorithm n times. While each merge changes the number of clusters, Kruskal's algorithm does not make use of information such as the number of experiments within each cluster.
We describe buildmst in further detail using Algorithm 1. The buildmst system constructs a priority queue of potential clusters in order to efficiently locate the next merge. When a cluster is formed, its dissimilarity with all other clusters must be calculated and inserted into the queue. The priority queue of dissimilarities is implemented as a heap and Kruskal's algorithm sorts these dissimilarities to build the MSTs [11].
Algorithm 1: Pseudocode depicting the merging scheme of buildmst. Translating the description of the n MSTs into images is performed by layoutmst (not shown here).
Data: Microarray data set X (n experiments × m probes)
Result: n MSTs ( )
1 C ← initializeClusters(X)
2 D ← calculateDistances(X)
3 PQ ← buildPQueue(D)
4 ← buildMST(C, PQ)
5 for i ← 1 to n1 do
6 C ← mergeClusters(C)
7 PQ ← updatePQueue(PQ)
8 ← buildMST(C, PQ)
9 end
The output of buildmst is a set of MSTs such that each edge in the MST also has an edge weight and every node has an attribute (color and shape). Edge weights are normalized out of 1.0 for each MST and are used to indicate the relative distance between nodes when the MST is drawn. As for attributes, a user has the option of assigning them to the experiments of the data set, which are then passed along as merging progresses. If two clusters with the same attributes are merged, then the associated MST node obtains the same attributes; otherwise, it is assigned default attributes. Creating a graphical MST from this information is done by the layoutmst phase, which lays out the graph by compromising between the distances between nodes (the edge weights) and minimizing the number of node overlaps.
By comparing Figure 3 with Figure 7, we can see that the set of MSTs is analogous to taking snapshots at each iteration of the hierarchical clustering process. At each step, the corresponding images show the relationship between the clusters.
Scoring and Ranking MSTs
While some users may be satisfied with the set of images produced by our method, others may require further guidance by having each image scored individually. This is analogous to users having to decide where to cut a dendrogram to give meaningful subtrees. We investigated several scoring schemes for the purpose of MST ranking and implemented three: gapbased, ANOVA, and normalized association.
The gapbased and ANOVA schemes are based on the idea that the dissimilarities in the distance matrix can be separated into two disjoint groups: those that are within a cluster (intracluster) and those that are between clusters (intercluster). Starting from all of the distances in the intercluster group, the merging process used by HAMSTER moves distances from this group to the intracluster group until the former is empty and only a single cluster remains. Since distances are chosen in increasing order, these schemes determine the point at which many small distances are in the intracluster group and many large distances are in the intercluster group.
The normalized association measures how well the experiments within a cluster are associated with each other, relative to other experiments. In the case of HAMSTER, because every experiment has a dissimilarity to every other cluster, the largest normalized association is the trivial case of one cluster containing all experiments. Furthermore, the normalized association monotonically increases with each iteration. To correct this, we multiply Equation (2) by the number of clusters (Z). This increases the score if there are more clusters.
These scoring schemes are used to evaluate MSTs but are actually calculated by examining the effect merging has on the underlying distance matrix. The scoring schemes are all normalized so that they are percentages of the highest score. As a starting point for users, we suggest the gapbased scheme as it is the most familiar to people who employ hierarchical clustering. As suggested earlier, in a dendrogram like Figure 3, clusters are formed by cutting it horizontally to form many trees. Examining the gaps is analogous to assessing the point where such a cut should be made.
Implementation
We describe our implementation of HAMSTER and, in less detail, its web server variant called HAMSTER^{+}. The HAMSTER system and access to HAMSTER^{+} are both available at http://hamster.cbrc.jp/.
HAMSTER is open source and distributed under the GNU General Public License (version 3 or later). The two parts of the HAMSTER system represent two separate executables (buildmst and layoutmst). The source code is written in C++ and documented inline such that Doxygen [24] could be used to produce documentation for users who would like to extend HAMSTER's features (See additional files 1 and 2: buildrefman.pdf and layoutrefman.pdf, respectively.). The software was successfully compiled using Autotools and v4.3.2 of the g++ compiler running under Linux.
Summary of the features of HAMSTER.
Program  Feature  Options 

buildmst  Dissimilarities  Euclidean, Manhattan, Pearson correlation, and Spearman correlation 
buildmst  Linkages  Single, Average, Complete, and Centroid 
buildmst  Centroid linkage types  Euclidean, Manhattan, Pearson correlation, and Spearman correlation 
layoutmst  Colors and shapes  Same as Graphviz 
layoutmst  File formats  PNG, SVG, and Postscript; additional formats supported by Graphviz available by modifying the source code 
Running buildmst
The filename of the input microarray data file is required without any option flags. The format of the microarray data for HAMSTER is a tabseparated file with row and column labels included. All other values in the data file must be either floating point values or the string NULL to indicate a missing expression level. While our description of HAMSTER focuses on the experiments, in practice, the system can also be applied to the probes as well by transposing the data file.
An optional tabseparated file can be provided with the attr option which describes each experiment's attributes. Every experiment and every node in each MST has two attributes: a shape and a color. The set of acceptable shapes and colors are defined by Graphviz with a few options listed in the README file. If a cluster consists of a mix of attributes, then the MST node which represents it obtains the default attribute of a "gray ellipse".
Four dissimilarity measures and four types of linkages are provided by buildmst, as summarized in Table 3. The four dissimilarity metrics are: Euclidean distance, Manhattan distance, Pearson correlation coefficient, and Spearman rank correlation coefficient. The latter of these two are converted to dissimilarity measures by subtracting from 1. The four available linkages are: single, average, complete, and centroid. The last linkage calculates the centroids of the two clusters and then the dissimilarity between them. The type of dissimilarity between centroids is typically the Euclidean distance, but the user can specify others using centroid. Additional distance and linkage measures can be added by modifying vect_dist.cpp and cluster_link.cpp, respectively.
The software archive includes a sample configuration file, the sample data set of Table 1 as sample.data, and its corresponding attribute file called sample.attr. A sample application of buildmst would be:
buildmst sample.data attr sample.attr
A summary of the merging process (summary.txt) is produced which shows information about each merge step.
Running layoutmst
After buildmst has calculated the MSTs, the layoutmst system generates the images. The only required argument is the summary created by buildmst. All other files are assumed to be in the same directory. Additional options are available which can be used to change the size or resolution of the images. The following command would use layoutmst with the sample data and default options:
layoutmst summary.txt
The layout of the images is performed externally by Graphviz. The result from this example is a series of images similar to the MSTs of Figure 7. Actual images may differ since the exact placement of nodes is determined by Graphviz, which is executed independently for each MST. An option (fixedpos) is available which lays out each MST using the previous MST's node positions as starting points in order to minimize the visual differences between them.
Since these images are generated independently, layoutmst is also able to make use of MPI (Message Passing Interface) to distribute the workload to multiple CPUs. Details on how to do this is shown in the README file. The system has been tested with Open MPI v1.2.7 [26], but other libraries that follow the MPI standard should work.
Either fixedpos or MPI can be used, but not both. This is because enabling MPI distributes the workload across multiple processors, while fixing the positions of nodes requires layoutmst to process the MSTs in sequential order.
A final timesavings measure that can be used with MPI is the percent option which indicates the percentage of images to generate, starting with the ones corresponding to the MSTs with the highest scores.
Web Server  HAMSTER^{+}
HAMSTER^{+} adds a wrapper around the local version of HAMSTER and is also available from http://hamster.cbrc.jp/. Further details about HAMSTER^{+}are provided in an online tutorial. Its main features are:

No user login or local software installation is required.

Support of microarray data from NCBI's Gene Expression Omnibus (GEO) in their Simple Omnibus Format in Text (SOFT) [7]. GEO data sets (GDS) can be referred to by their unique accession number and downloaded from NCBI's ftp server.

The web interface allows experiments to be selected from the microarray data set.

559 experiments have been manually classified into 87 categories for the purpose of assigning initial attributes to them.

Seven GEO platforms that encompass 19.3% of the 275,665 experiments in GEO (as of January 6, 2009) have been mapped to Gene Ontology categories [27]. This allows the probes of the microarray data to be selected based on gene functions.

The MST images and their respective Graphviz sources can be browsed and downloaded.

Each data set to be processed is assigned a unique URL that users can send to collaborators or use later for viewing. We recommend that users concerned about privacy to use the local version instead.
Testing
We report on the expected running time of HAMSTER on various GEO data sets and then examine the output from both hierarchical clustering and HAMSTER for three of these data sets. For the first data set we consider all aspects of HAMSTER, including its similarity (and dissimilarity) with hierarchical clustering and the results from MST scoring. As for the remaining two data sets, we direct our attention to data sets which involve a gradient of sample categories which we believe make them suitable for visualization using MSTs. Since hierarchical clustering is not part of HAMSTER, we have chosen to use the agnes function of the cluster library for R to create the dendrograms [6, 28]. We employ varies types of linkages in our examples to illustrate the possibilities with HAMSTER.
Execution time of HAMSTER
Dimensions of GEO test data and running time and memory usage of buildmst.
Data set size  Execution of buildmst  

Data set  Experiments  Probes  Elapsed time (s)  Memory usage (MB) 
GDS596  158  22283  23.28  201.391 
GDS1962  180  54681  66.65  383.980 
GDS2765  13  45101  1.03  28.305 
GDS2771  192  22283  33.03  276.352 
GDS3069  12  22283  0.52  25.000 
GDS3216  12  22810  0.56  25.758 
All of our experiments were run on an otherwise idle 2.4 GHz Intel Core 2 Quad CPU (Q6600) with 8 GB RAM. Running times are reported as seconds and averaged across 5 trials.
The running time and memory usage of buildmst are shown as the last two columns of Table 4. The longest time is associated with GDS1962, which takes just over 1 minute. As expected, the running time is more dependent on the number of experiments than the number of probes, as shown by comparing the results of GDS596 with GDS2765. As for memory, the data set that gave the peak memory usage was GDS1962 at almost 400 MB.
GDS2765: Effect of Creatine on Mice
As an example, we apply both HAMSTER and hierarchical clustering to the data set GDS2765, where researchers investigated the effect creatine has on the expression level of brain tissue in mice [29]. There are 13 samples in total and only two classes: untreated/control (7 samples) and creatinetreated (6 samples) mice. In this example, we have chosen Euclidean distance and single linkage for both so that hierarchical clustering is directly comparable to the corresponding MSTs. We should emphasize that in both cases, the methods are not provided with information about the experiment type (untreated or treated), except in the sense that we chose the color associated with each experiment based on its type (untreated in red; treated in blue).
The different peaks in this graph show that there are multiple ways in evaluating how well the experiments of a data set cluster, regardless of the clustering method. We suggest that users try the scheme which is most suitable for their needs, based on the definitions given earlier.
GDS3069: Various brain tumors
Our next sample data set is GDS3069 which was used to analyze 12 primary brain tumors based on their histological diagnoses [30]. Unlike the previous data set of two distinct categories, this one has a gradient of 5 categories with overlaps between them. Also, the number of samples per category varies greatly  for example one category has only one sample. The categories and the coloring that we have chosen are: high grade glioblastoma (red), high grade gliosarcoma (yellow), high grade glioblastoma/gliosarcoma (violet), low grade oligodendroglioma (green), and low grade anaplastic mixed glioma (blue). Euclidean distance and average linkage have been selected for this analysis.
GDS3216: Whole seedling roots response to salinity stress: timecourse
The images show that the replicates are generally adjacent to each other in both the dendrogram at the top and MST 0. It seems that the dendrogram gives a better view of the timeseries data set than the MST. The colors indicate that the experiments in the dendrogram are ordered by timeindex, with the exception of GSM184931 at the far right. In the MST, we have a gradient split into two parts. The first six experiments appear at the top and the last six at the bottom, with GSM184932 (green) lying away from this latter group. Rather than connected endtoend, a red experiment is connected to a violet one.
However, as suggested earlier, there are many possible permutations to a dendrogram that would be valid. In the dendrogram of the top of Figure 16, we can swap the branches at the three internal nodes indicated in red to form the dendrogram below it. These swaps have moved GSM184931 (green) to the far left and GSM184925 (red) to the far right. GSM184933 (blue) has also shifted to the right side of its subtree. While three pairs of replicates remain together (yellow, orange, and violet) since they are part of their own subtree, these swaps have shown that a different dendrogram could easily have been generated. It appears that the agnes function for R uses the order of the experiments in the original data sets to determine the final dendrogram leaf order. The reason why the top dendrogram is produced is that the experiments appear in timeindex order in the original GEO data file. This would be a useful feature if the natural ordering of the data is both known a priori and is valid for visualization.
Conclusion
In this paper, we have described a system called HAMSTER which allows users to visualize the experiments of a gene expression data set as a set of minimum spanning trees (MSTs). In addition, we also describe three scoring schemes which help users judge the quality of these MSTs.
Our results show that MSTs offer a view of microarray data that is related to, but still different from the dendrograms that have been used for data visualization and clustering by others. The creation of a set of MSTs in this manner is absent from previous works with MSTs. This feature allows users to visualize microarray data in terms of dendrograms by presenting relationships between subtrees. Through examples, we show that MSTs are particularly useful for microarray studies with gradientbased data (such as timecourse studies).
The HAMSTER system implements the above procedure as an opensource, GPL licensed software that makes use of other tools, including Graphviz and (optionally) Open MPI. In addition, we also introduced a web server called HAMSTER^{+} which has been developed to add a wrapper around HAMSTER that is tailored toward NCBI GEO data. HAMSTER^{+} allows users to evaluate HAMSTER without any local software installation. Both HAMSTER and HAMSTER^{+} are available from http://hamster.cbrc.jp/. While HAMSTER's original intention was to depict microarray experiments as a set of MSTs, the system is general enough that it could be used directly for probes if the data set is transposed. Evaluation of this potential purpose of HAMSTER is left as future work. Our survey of GEO data has shown that the number of samples in a microarray data set is typically less than 200 (see Table 4). While HAMSTER's running time of 67 seconds for the largest data set seems acceptable, if data sets were many more times this, then parallelization of buildmst using MPI is another possible avenue for future work [20]. It would also be interesting to explore other aspects of HAMSTER unrelated to running time, such as scoring schemes that better reflect the needs of users.
Abbreviations
 MST:

minimum spanning tree
 HC:

hierarchical clustering
 HAMSTER:

Helpful Abstraction using Minimum Spanning Trees for Expression Relations
 MPI:

Message Passing Interface
 GEO:

Gene Expression Omnibus
 SOFT:

Simple Omnibus Format in Text
 PNG:

Portable Network Graphics
 DPI:

dots per inch.
Declarations
Acknowledgements
We thank the reviewers whose comments greatly improved this paper. This work was supported by the Japan Society for the Promotion of Science [Postdoctoral fellowship for R.W.]; INTEC Systems Institute, Inc. [R.W.]; BIRD of Japan Science and Technology Agency (JST) [R.W. and H.M.]; and the Japanese Ministry of Education, Culture, Sport, Science and Technology, GrantinAid for Scientific Research (B) [L.K. and P.H.]
Authors’ Affiliations
References
 Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc National Academy of Sciences USA. 1998, 95 (25): 1486314868. 10.1073/pnas.95.25.14863.View ArticleGoogle Scholar
 Herwig R, Poustka AJ, Müller C, Bull C, Lehrach H, O'Brien J: Largescale clustering of cDNAfingerprinting data. Genome Research. 1999, 9 (11): 10931105. 10.1101/gr.9.11.1093.PubMed CentralView ArticlePubMedGoogle Scholar
 Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with selforganizing maps: Methods and application to hematopoietic differentiation. Proc National Academy of Sciences USA. 1999, 96 (6): 29072912. 10.1073/pnas.96.6.2907.View ArticleGoogle Scholar
 Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 2006, 22 (19): 24052412. 10.1093/bioinformatics/btl406.View ArticlePubMedGoogle Scholar
 de Hoon MJL, Imoto S, Nolan J, Miyano S: Open source clustering software. Bioinformatics. 2004, 20 (9): 14531454. 10.1093/bioinformatics/bth078.View ArticlePubMedGoogle Scholar
 R. [http://www.rproject.org/]
 Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profilesdatabase and tools update. Nucleic Acids Research. 2007, 35: D760D765. 10.1093/nar/gkl887. [http://www.ncbi.nlm.nih.gov/geo/]PubMed CentralView ArticlePubMedGoogle Scholar
 HAMSTER and HAMSTER^{+}. [http://hamster.cbrc.jp/]
 Prim RC: Shortest connection networks and some generalizations. Bell System Technical Journal. 1957, 36: 13891401.View ArticleGoogle Scholar
 Kruskal JB: On the shortest spanning subtree of a graph and the traveling salesman problem. American Mathematical Society. 1956, 7: 4850. 10.2307/2033241.View ArticleGoogle Scholar
 Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algorithms. 2009, Cambridge, USA: The MIT Press, thirdGoogle Scholar
 Sedgewick R: Algorithms in C, Part 5: Graph Algorithms. 2002, New York, USA: AddisonWesley Publishing Co, thirdGoogle Scholar
 Wieland SC, Brownstein JS, Berger B, Mandl KD: Densityequalizing Euclidean minimum spanning trees for the detection of all disease cluster shapes. Proc National Academy of Sciences USA. 2007, 104 (22): 94049409. 10.1073/pnas.0609457104.View ArticleGoogle Scholar
 Eppstein D: Fast hierarchical clustering and other applications of dynamic closest pairs. Proc 9th ACMSIAM Symposium on Discrete Algorithms. 1998, 619628.Google Scholar
 Agarwal P, States DJ: The Repeat Pattern Toolkit (RPT): analyzing the structure and evolution of the C. elegans genome. Proc 2nd International Conference on Intelligent Systems for Molecular Biology (ISMB 1994). 1994, 2: 19.Google Scholar
 Xu Y, Olman V, Xu D: Clustering gene expression data using a graphtheoretic approach: An application of minimum spanning trees. Bioinformatics. 2002, 18: 526535.Google Scholar
 Xu D, Olman V, Wang L, Xu Y: EXCAVATOR: a computer program for efficiently mining gene expression data. Nucleic Acids Research. 2003, 31 (19): 55825589. 10.1093/nar/gkg783. [http://csbl.bmb.uga.edu/downloads/]PubMed CentralView ArticlePubMedGoogle Scholar
 Olman V, Xu D, Xu Y: Identification of regulatory binding sites using minimum spanning trees. Pacific Symposium on Biocomputing. 2003, 327338.Google Scholar
 Olman V, Xu D, Xu Y: CUBIC: identification of regulatory binding sites through data clustering. J Bioinform Comput Biol. 2003, 1: 2140. 10.1142/S0219720003000162. [http://csbl.bmb.uga.edu/downloads/cubic]View ArticlePubMedGoogle Scholar
 Olman V, Mao F, Wu H, Xu Y: Parallel clustering algorithm for large data sets with applications in bioinformatics. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2009, 6 (2): 344352. 10.1109/TCBB.2007.70272.View ArticlePubMedGoogle Scholar
 Varma S, Simon R: Iterative class discovery and feature selection using minimal spanning trees. BMC Bioinformatics. 2004, 5 (126):
 Magwene PM, Lizardi P, Kim J: Reconstructing the temporal ordering of biological samples using microarray data. Bioinformatics. 2003, 19 (7): 842850. 10.1093/bioinformatics/btg081.View ArticlePubMedGoogle Scholar
 Shi J, Malik J: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000, 22 (8): 888905. 10.1109/34.868688.View ArticleGoogle Scholar
 Doxygen. [http://www.doxygen.org/]
 Gansner ER, North SC: An open graph visualization system and its applications to software engineering. Software  Practice and Experience. 2000, 30 (11): 12031233. 10.1002/1097024X(200009)30:11<1203::AIDSPE338>3.0.CO;2N. [http://www.graphviz.org/]View ArticleGoogle Scholar
 Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS: Open MPI: goals, concept, and design of a next generation MPI implementation. Proc 11th European PVM/MPI Users' Group Meeting, of Lecture Notes in Computer Science. 2004, 3241: 97104. [http://www.openmpi.org/]Google Scholar
 The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology. Nature Genetics. 2000, 25: 2529. 10.1038/75556. [http://www.geneontology.org/]PubMed CentralView ArticleGoogle Scholar
 Maechler M, Rousseeuw P, Struyf A, Hubert M: Cluster Analysis Basics and Extensions. 2005, [http://cran.rproject.org/web/packages/cluster/]Google Scholar
 Bender A, Beckers J, Schneider I, Hölter SM, Haack T, Ruthsatz T, VogtWeisenhorn DM, Becker L, Genius J, Rujescu D, Irmler M, Mijalski T, Mader M, QuintanillaMartinez L, Fuchs H, GailusDurner V, Hrabé de Angelis M, Wurst W, Schmidth J, Klopstock T: Creatine improves health and survival of mice. Neurobiology of Aging. 2008, 29 (9): 14041411. 10.1016/j.neurobiolaging.2007.03.001. [http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2765]View ArticlePubMedGoogle Scholar
 Liu T, Papagiannakopoulos T, Puskar K, Qi S, Santiago F, Clay W, Lao K, Lee Y, Nelson SF, Kornblum HI, Doyle F, Petzold L, Shraiman B, Kosik KS: Detection of a microRNA signal in an in vivo expression set of mRNAs. PLoS ONE. 2007, 2 (8): e80410.1371/journal.pone.0000804. [http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3069]PubMed CentralView ArticlePubMedGoogle Scholar
 Dinneny JR, Long TA, Wang JY, Jung JW, Mace D, Pointer S, Barron C, Brady SM, Schiefelbein J, Benfey PN: Cell identity mediates the response of Arabidopsis roots to abiotic stress. Science. 2008, 320 (5878): 942945. 10.1126/science.1153795. [http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3216]View ArticlePubMedGoogle Scholar
 Siek JG, Lee LQ, Lumsdaine A: The Boost Graph Library: User Guide and Reference Manual. 2002, C++ InDepth Series, New York, USA: AddisonWesley Publishing CoGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.