FlexDM: Simple, parallel and fault-tolerant data mining using WEKA
© Flannery et al. 2015
Received: 20 January 2015
Accepted: 9 November 2015
Published: 17 November 2015
With the continued exponential growth in data volume, large-scale data mining and machine learning experiments have become a necessity for many researchers without programming or statistics backgrounds. WEKA (Waikato Environment for Knowledge Analysis) is a gold standard framework that facilitates and simplifies this task by allowing specification of algorithms, hyper-parameters and test strategies from a streamlined Experimenter GUI. Despite its popularity, the WEKA Experimenter exhibits several limitations that we address in our new FlexDM software.
FlexDM addresses four fundamental limitations with the WEKA Experimenter: reliance on a verbose and difficult-to-modify XML schema; inability to meta-optimise experiments over a large number of algorithm hyper-parameters; inability to recover from software or hardware failure during a large experiment; and failing to leverage modern multicore processor architectures. Direct comparisons between the FlexDM and default WEKA XML schemas demonstrate a 10-fold improvement in brevity for a specification that allows finer control of experimental procedures. The stability of FlexDM has been tested on a large biological dataset (approximately 450 k attributes by 150 samples), and automatic parallelisation of tasks yields a quasi-linear reduction in execution time when distributed across multiple processor cores.
FlexDM is a powerful and easy-to-use extension to the WEKA package, which better handles the increased volume and complexity of data that has emerged during the 20 years since WEKA’s original development. FlexDM has been tested on Windows, OSX and Linux operating systems and is provided as a pre-configured virtual reference environment for trivial usage and extensibility. This software can substantially improve the productivity of any research group conducting large-scale data mining or machine learning tasks, in addition to providing non-programmers with improved control over specific aspects of their data analysis pipeline via a succinct and simplified XML schema.
Large-scale analysis is an integral component of modern life sciences research, and is becoming a necessary day-to-day task for many researchers lacking a strong programming or statistics background. This is epitomised by genomics and epigenetics studies (e.g. those leveraging microarray and next generation sequencing technologies), which require data for tens-of-thousands of genes to be quantified and simultaneously analysed [1–3]. Sophisticated machine learning and data mining tools have been introduced to address these requirements in the life sciences and other areas of quantitative research . Of these frameworks, WEKA (Waikato Environment for Knowledge Analysis) has been widely-adopted as a gold standard, having been cited more than 8000 times in academic literature .
Despite its proven success and widespread application, the WEKA Experimenter pipeline exhibits a number of limitations that make it both a) difficult to apply to non-trivial data mining challenges in modern research, and b) remain robust and reliable against the exponential growth of data volume in the two decades since its original development . Although other third-party extensions and plug-ins have been developed to address these limitations, these have been fragmentary solutions to a subset of the underlying issues [7, 8]. The following Section expands upon four major and inter-related limitations in WEKA and how these have been addressed in our new FlexDM software.
FlexDM addresses four fundamental limitations identified for machine learning/data mining experiments using the WEKA Experimenter and GUI (the most common approach for non-experts, as more advanced features are unintuitive and overall poorly documented). These limitations and their FlexDM solutions are described below. It is important to note that FlexDM-generated result summaries are fully compatible with advanced WEKA features including the Analyse tab, which allows for statistical post-processing and visualisation of results.
Reliance on WEKA experimenter GUI
Specification and optimisation of hyper-parameters
It is often unclear what combination of method hyper-parameters will yield the most insight from an experimental data-set, posing a need to evaluate many combinations of potential value assignments. WEKA provides only limited support for these feature through its CVParameterSelection and GridSearch functions, which are able to meta-optimise over 1-or-2 parameters respectively. FlexDM allows for numeric ranges to be specified for any number of combination of method hyper-parameters. Ranges may be represented in two forms: a set of categorical or numerical values; or a numeric interval with user-specified start, step and end values. FlexDM will automatically evaluate all combinations of parameter values and save the results as individual output files. The user can also specify their desired evaluation procedure (e.g. leave-one-out, k-fold cross-validation or test/training split).
Optimising over several hyper-parameters inherently worsens the computational time complexity of the experiment, necessitating improved fault-tolerance and the ability to parallelise tasks to leverage modern processor architectures. These features are addressed below.
Robustness and fault-tolerance
The WEKA Experimenter runs all experiments in series and does not generate a results file until the final experiment has concluded. Many large machine learning experiments (e.g. in the life sciences [1–3]) can take hours-or-days to conclude, particularly when meta-optimising over many combinations of hyper-parameters. Any software or hardware fault during this time will cause all intermediate results to be lost. By contrast, FlexDM automatically generates individual results files upon conclusion of each experiment, and provides functionality to allow a series of experiments to automatically resume from the most recent to successfully complete.
Asynchronous parallel processing
WEKA does not provide an intuitive means of utilising modern multi-core processing architectures, instead running a series of experiments on a single process and thread. FlexDM leverages the independent nature of individual experiments by introducing asynchronous and parallel processing. Similar features have been introduced in third-party packages including Weka-Parallel  and Grid-enabled Weka , but without the aforementioned FlexDM features necessary to maximise the value of this functionality.
This results section is separated into two parts: a practical example of the improved FlexDM XML schema when compared to its WEKA Experimenter equivalent; and an empirical analysis of the time taken to perform a large data mining task when distributed across multiple CPUs.
Figure 1 illustrates an example FlexDM XML input file for the execution of 20 experiments, which simultaneously (and in parallel) meta-optimises the ‘confidence factor’ (c) hyper-parameter for two classification algorithms (J48 and PART). Leave-one-out cross-validation was employed as the evaluation strategy, although k-fold cross-validation or percentage split could have been selected as appropriate for larger data-sets. FlexDM will load the XML file and specified data-set, asynchronously execute each experiment and summarise the results for each in individual files. FlexDM also creates a summary file reporting the overall experimental outcomes.
This easy-to-read XML specification takes only 11 lines to define a non-trivial series of experiments and hyper-parameter meta-optimisation tasks. By contrast, the equivalent WEKA Experimenter-interpreted XML specification requires 10-fold as many lines and is difficult to modify without reliance upon the GUI. These XML specifications are compared in Additional file 1.
FlexDM enables flexible, fault tolerant and computationally efficient data mining using WEKA. This framework has addressed four fundamental and interrelated limitations inherent to the standard WEKA Experimenter and GUI pipeline, while providing a novel XML-based specification of data mining experiments that are expressive, succinct and easily understood/extended by non-programmers. These improvements are necessary in the context of modern life sciences research, where the volume of data continues to increase at an exponential rate, and researchers without programming or statistics backgrounds are increasingly required to perform non-trivial data mining and machine learning experiments.
As we encourage other researchers to explore and adopt our software, FlexDM is implemented using exclusively open-source software and made available as a pre-configured virtual reference environment, using the approach recently described by Hurley et al. . A comprehensive FlexDM user guide, including reference environment instructions and examples/templates of XML experiment specifications, are available online at http://madiflannery.github.io/FlexDM/
Availability and requirements
Project Name: FlexDM
Project home page: http://madiflannery.github.io/FlexDM/
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 1.7 or higher
AM wishes to acknowledge the support of the School of Electrical Engineering and Computer Science, The University of Newcastle, Australia and its Work Integrated Learning program which made this study possible. The authors wish to thank Daniel G. Hurley, of the University of Melbourne’s Systems Biology Laboratory, for his assistance preparing and testing a virtual reference environment implementation of FlexDM.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Budden DM, Hurley DG, Cursons J, Markham JF, Davis MJ, Crampin EJ. Predicting expression: the complementary power of histone modification and transcription factor binding data. Epigenetics Chromatin. 2014;7(1):1–12.Google Scholar
- Budden, D.M., D.G. Hurley, and E.J. Crampin, Predictive modelling of gene expression from transcriptional regulatory elements. Briefings in bioinformatics. 2014;16(4):616–28.Google Scholar
- Budden DM, Hurley DG, Crampin EJ. Modelling the conditional regulatory activity of methylated and bivalent promoters. Epigenetics Chromatin. 2015;8(1):1–10.View ArticleGoogle Scholar
- Marsden J et al. Language individuation and marker words: Shakespeare and his maxwell’s demon. PLoS One. 2013;8(6):e66813.PubMed CentralView ArticlePubMedGoogle Scholar
- Hall M et al. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsl. 2009;11(1):10–8.View ArticleGoogle Scholar
- Holmes, G., A. Donkin, and I.H. Witten. Weka: A machine learning workbench. Proceedings of the IEEE 2nd Australian and New Zealand Conference on Intelligent Information Systems. Brisbane, Australia; 1994. pp. 357-61.Google Scholar
- Celis, S. and D.R. Musicant. Weka-parallel: machine learning in parallel. Northfield, USA: Technical Report, Carleton College, CS-TR; 2002.Google Scholar
- Khoussainov R, Zuo X, Kushmerick N. Grid-enabled Weka: A toolkit for machine learning on the grid. ERCIM News. 2004;59:47–8.Google Scholar
- Hurley, D.G., D.M. Budden, and E.J. Crampin, Virtual Reference Environments: a simple way to make research reproducible. Briefings in bioinformatics. 2014;16(5):901-3.Google Scholar