We compare the performance of the original[9] and new PFClust implementations by measuring execution time on the following configuration:
-
Hardware: 2.2 GHz Intel(R) Core(TM) i5-3470S CPU @ 2.90 GHz, 8.00 GB RAM
-
Operating system: Scientific Linux release 6.3 (Carbon)
-
JVM: 1.6.0_45-b06
The running times of each step of the PFClust algorithm (threshold estimation, clustering, and the main iteration that combines these steps) have been evaluated separately. Each step and the main iteration were executed 10 times, and average run times were obtained. The first step of the algorithm, involving random number generation, was initialized with the same seed in both implementations to keep the number of calculations approximately constant. The clustering was executed with the same set of threshold values for each dataset in both programs. The main iteration carried out the randomization step with the same seed and the clustering procedure with the same thresholds.
The performance improvement of the new implementation is primarily due to the representation of the similarity matrix and cluster objects. The old implementation used string objects as row and column names and looked up values in the similarity matrix based on these names. The names were stored in a vector, and searching for an element in a vector data type is O(n) where n is the number of elements. Many operations involved two nested loops to search for the corresponding row and column names, which resulted in O(n2) behaviour. The cluster objects in the old implementation were also backed by vectors of strings and involved intensive computations. There was an additional performance overhead related to synchronization of vectors, producing an overall performance bottleneck. The new implementation utilizes a two dimensional array of primitives to represent the similarity matrix and an ArrayList data type to represent cluster objects. The values are retrieved from the array or ArrayList based on the index, a constant time operation. Unlike the old implementation, the new code utilizes bookkeeping with HashSet and ArrayList data types, where applicable, to decrease the number of operations inside the loops. In the threshold estimation step, the data are now sorted before retrieving the required values from the array, whereas the values were selected in a brute-force fashion in the old implementation.
The evaluation results (Figure 1) show that the execution times are greatly improved. The clusterings resulting from the two implementations agree closely, with a very high average Rand Index[10] of 0.985 over the seven datasets from[9].