BioFlow: a web based workflow management software for design and execution of genomics pipelines
© Garner and Puthige; licensee BioMed Central Ltd. 2014
Received: 18 November 2013
Accepted: 5 September 2014
Published: 18 September 2014
Bioinformatics data analysis is usually done sequentially by chaining together multiple tools. These are created by writing scripts and tracking the inputs and outputs of all stages. Writing such scripts require programming skills. Executing multiple pipelines in parallel and keeping track of all the generated files is difficult and error prone. Checking results and task completion requires users to remotely login to their servers and run commands to identify process status. Users would benefit from a web-based tool that allows creation and execution of pipelines remotely. The tool should also keep track of all the files generated and maintain a history of user activities.
A software tool for building and executing workflows is described here. The individual tools in the workflows can be any command line executable or script. The software has an intuitive mechanism for adding new tools to be used in workflows. It contains a workflow designer where workflows can be creating by visually connecting various components. Workflows are executed by job runners. The outputs and the job history are saved. The tool is web based software tool and all actions can be performed remotely.
Users without scripting knowledge can utilize the tool to build pipelines for executing tasks. Pipelines can be modeled as workflows that are reusable. BioFlow enables users to easily add new tools to the database. The workflows can be created and executed remotely. A number of parallel jobs can be easily controlled. Distributed execution is possible by running multiple instances of the application. Any number of tasks can be executed and the output will be stored making it is easy to correlate the outputs to the jobs executed.
High throughput Next Generation Sequencing techniques are producing data at a very rapid pace. The large data scale has resulted in the creation of several tools for faster processing and analysis. Bioinformatics datasets are often processed in stages. Pipelines are created so that at each stage a software package (usually a command line tool) is executed and the output produced is passed as input to the next stage.
There are multiple tools available for use at any stage in the pipeline and these tools support their own command formats. Such sequence analysis pipelines require researchers to write scripts to control the pipeline execution. Writing these scripts require knowledge of a computer programming language such as Perl, Python or bash scripting. When multiple such pipelines have to be executed, users resort to writing more scripts to control the execution order of other scripts. Users should also be able to differentiate the output files generated by various tools and isolate any failed tasks so that they can be re-executed.
Bioinformatics pipelines can be modeled as workflows where each work item is a stage (executable) in the pipeline. Workflow management software allows for the creation and execution of workflows. They are available as both command line controlled software tools that enable users to program and build custom workflows or they can contain a user-interface for predefined use cases. Web based workflow managers provide great flexibility and enable users to access them from any remote location through a browser. These allow researchers to monitor all executing tasks or create new tasks with minimal programming requirements.
Taverna and Galaxy are commonly used for workflow automation in bioinformatics. Both support web based workflow execution through the browser. Creating workflows in Taverna is not supported through the web interface but can be performed by installing a standalone workflow designer. Galaxy allows the addition of new tools for executing locally installed command line executables and scripts by writing a tool configuration XML file. Writing this file requires knowledge of XML. bPipe and Ruffus are other software packages for working with workflows. bPipe is a java tool and Ruffus is available as a python module. bPipe allows users to define various execution blocks that can be joined together to create data pipelines. Ruffus module uses decorators to tag functions and create an ordering of tasks. But, using them requires programming and scripting skills.
To overcome these shortcomings we have created a software program called BioFlow to greatly expand the ability for non-programmers to build sophisticated data analysis systems. BioFlow is a web-based tool for creating and managing pipelines. It has been created to be a simple and easy to use workflow management tool. The aim is to reduce the effort required in writing scripts for creating pipelines and to enable remote execution of tasks from any browser connected to the network. It has been designed so that the tools can be added quickly and hence the interface has been kept simple. Once tools are added, they can be reused and users do not need to remember command line requirements. Creating workflows is made easy by using a workflow designer. The output and status of each command is saved and this greatly simplifies the task of identifying errors and rerunning pipelines.
To store the data in the backend, we use a MySQL database. A developer can easily modify the configuration files to use any database of their choice. BioFlow can be deployed on any operating system supported by the Ruby on Rails runtime library, including Windows, Linux and Mac. It is deployed on an Apache server using Phusion Passenger.
Model view controller
BioFlow follows a model view controller pattern for development. The three layers are kept separate and each layer can be independently modified without significantly affecting other layers.
The model layer mimics the database tables and it consists of different models for representing the tools and workflows. Each workflow consists of multiple job models and each job contains a result model. The view layer is built using HTML, CSS and jQuery. There are different views for adding tools, creating workflows and displaying outputs. Each view communicates with a different controller. This makes BioFlow modular and enables the layers to function independently of each other.
The data exchanged between the client and controller is in JSON format. Browsers are optimized for processing JSON and therefore it was a natural choice for the data format. When users create workflows in the browser, a duplicate workflow is created on the server side which is serialized and saved in the database. The order of passing outputs from one tool to another down the pipeline is stored along with the workflows.
By nature, bioinformatics tasks are long running. So, there are separate views for creating workflows and viewing outputs. Web applications are not tolerant of delays and so whenever requests for executing workflows are received, BioFlow only stores them in the job execution queue and returns. This job is later picked up by the background execution engine and executed. BioFlow uses a gem called delayed_job, which can execute tasks using job runners. A job runner is used to execute a single workflow. This allows users to control the number of parallel tasks that run on the server. This enables easy control over the load distribution to match the server’s computing capabilities. Controlling the number of parallel tasks helps in managing a server’s capacity and ensuring that it is not underutilized or overly strained.
BioFlow supports execution on multiple machines simultaneously by sharing the database among various instances. By configuring all instances to point to the same database, the workflow queue can be shared. If job runners are started on different machines, they will be able to pick up tasks from the database. But, the instances need to share the files on which the workflows are executed. This is easily achieved by using shared drives or network mounted drives. BioFlow is built for small environments and hence it allows users to manually select the server on which the task should be executed. This is useful when one of the servers is running computation intensive tasks while another server may be idle.
Results and discussion
Adding command line tools to the database
The workflows can be saved along with the parameters provided to enable reusing workflows for performing the same analysis on different input datasets.
BioFlow has a notifications panel which automatically displays the current status of the workflow. It provides constant feedback to the user regarding the status of the workflow. Updates are displayed when the parameters are changed or the job name is modified. Since a workflow consists of multiple tools executed in succession, a notification is displayed whenever a tool starts or completes. This gives an indication to the user about the completion progress status of the workflow.
Installing BioFlow requires the Ruby on Rails runtime library, MySQL database and an Internet connection to download the rubygems. After downloading BioFlow, users need to run "bundle install". This will automatically download all the required gems. Next, the config/database.yml file needs to be edited to specify the database host name. Once the database is seeded, the application can be started using "rails server". By default it runs on port 3000 and can be seen by browsing to it.
BioFlow has been designed to simplify the entire process of creating and executing workflows. The simple and easy user interface for adding tools enables users to quickly add executables and scripts to BioFlow. The mechanism of visually connecting tools to build workflows allows users without programming skills to create pipelines. The functionalities provided by scripting can be done using the workflow designer. The stored outputs and history help users in debugging errors and re run only the failed pipelines. The number of parallel jobs can be controlled using job runners. This eliminates the need to write parent scripts that control other child scripts. BioFlow also provides basic distributed execution capabilities allowing users to utilize multiple servers in their environment. Hence, it greatly simplifies the task of creating, executing and tracking the outputs of bioinformatics data processing pipelines.
Availability and requirements
HG is Professor at Virginia Bioinformatics Institute, Virginia Tech and Virginia Tech Carilion School of Medicine; AP is a Masters student in the department of Computer Science, Virginia Tech.
The project was funded by HG under Medical Informatics and Systems Division Directors Funds.
- Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 2004, 20 (17): 3045-3054. 10.1093/bioinformatics/bth361.View ArticlePubMedGoogle Scholar
- Goecks J, Nekrutenko A, Taylor J, Team TG: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010, 11 (8): R86-10.1186/gb-2010-11-8-r86.PubMed CentralView ArticlePubMedGoogle Scholar
- Galaxy Wiki – Adding tools to Galaxy.http://Wiki.galaxyproject.org/Admin/Tools.
- Sadedin SP, Pope B, Oshlack A: Bpipe: a tool for running and managing bioinformatics pipelines. Bioinformatics. 2012, 28 (11): 1525-1526. 10.1093/bioinformatics/bts167.View ArticlePubMedGoogle Scholar
- Goodstadt L: Ruffus: a lightweight Python library for computational pipelines. Bioinformatics. 2010, 26 (21): 2778-2779. 10.1093/bioinformatics/btq524.View ArticlePubMedGoogle Scholar
- Phusion Passenger – App server for ruby.https://www.phusionpassenger.com/.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.