In this section we describe two computational experiments performed in the BCJ environment, highlighting many of the features described above. The two examples were chosen from significantly different domains in order to demonstrate that the BCJ environment is sufficiently flexible to support a wide variety of applications. A user manual for the BCJ can be found in [12].
Sequence-based investigation
To demonstrate sequence analysis in the BCJ environment, a simple example experiment was implemented that tries to identify a protein that may be a potential target for drug therapy in treating malaria. If there is sufficient divergence between some protein of the parasite and the host, then inhibition of the parasite protein is a potential target for drug therapy. In this experiment, we are interested in the divergence between the mosquito Plasmodium falciparum casein kinase and its vertebrate hosts. There are three applications used by the experiment: BLAST, CLUSTALW, and HMMER. The experiment involves the following computations: (1) identify similar proteins to the human casein kinase enzyme using BLAST, (2) create a global alignment using the sequences of the similar proteins that were extracted from the BLAST report using CLUSTALW, (3) construct a hidden Markov model (HMM) based on this multiple alignment, using components of the HMMER program suite, (4) run the model against the proteins in the Plasmodium falciparum genomic database (PlasmoDB) to identify proteins that may be functionally equivalent to human casein kinase.
In this experiment, we build two workflows. The first workflow contains the HMMER components. The second workflow, which will contain the complete experiment, will incorporate the HMMER workflow and the BLAST and CLUSTALW components.
The HMMER [9] program suite for biosequence analysis using profile HMM contains several components that are frequently used in a fixed order. In this workflow, we define a reusable component that may find value in other studies involving HMM. This workflow will construct an HMM based on the multiple alignment input, calibrate the model, and finally search a sequence database using this model.
To increase the flexibility of this computational component, the database is provided as a second input. As a design alternative, we could have chosen to specifically include PlasmoDB in this workflow (as a data source block) and have a single input to this workflow. A picture of the final workflow can be seen in Figure 3.
This workflow could be useful to a wide variety of researchers and does not contain any sensitive data or novel computation techniques, thus we set the group access control for this entry to ALL, so any user in the BCJ environment will have access to this workflow. To make the entry easier for others to find we could place the entry in a journal designated for shared workflows, and/or add an annotation briefly describing its functionality, so that other users can locate it with a search query.
In the complete workflow, we use the previously defined HMMER workflow as a resource. It also includes BLAST, to find proteins similar to the input protein, and CLUSTALW, to perform a multiple alignment of these similar proteins. The CLUSTALW program does not accept input in the format generated by the BLAST program, so the block, Extract Sequence, was inserted between them to perform the appropriate data conversion. Resource definitions include type information for each of their ports; the Workflow Editor does not permit connections between ports with incompatible types. These constraints help users construct only meaningful and valid workflows. Resources such as Extract Sequences serve the necessary role of automatic resolution of data reformatting. In addition, users can define new resource definitions that perform data format conversions to meet their specific requirements. The completed workflow is shown in Figure 2.
To define an experiment based on this workflow, the user provides a protein sequence input and a database to be searched with HMMER. For this example, Amino Acid Sequence was used, for the input is human casein kinase. The Search DB input is PlasmoDB. The output file, Related Proteins, will contain the sequences from PlasmoDB that are most similar to the profile HMM built from the human casein kinase related proteins. Figure 5 shows the completed experiment definition.
When defining the experiment, the user is queried to specify access control for the experiment definition. For example, the user may want to restrict any other access to the experiment and results, by setting the access control to NONE, pending validation of results. Later, the owner can modify access control to allow access by a wider range of users. At the completion of the experiment definition, a collection of jobs to perform the tasks defined by the workflow are submitted to the computational cluster for execution. The user tracks the progress of the experiment using the Queue Manager view.
The output entry generated by the experiment contains the top sequences from PlasmoDB that match the profile created by HMMER. From the results (Figure 6), we see that the top result has the gene ID of PF11_0377. This gene happens to encode P. falciparum's casein kinase 1. Based on these results, we see that this kinase would not be a good candidate for drug therapy.
The user can summarize what has been learned and annotate the experiment as shown in Figure 7. These annotations communicate reviews to other users and provide a log for the author. Annotations also enable search capabilities, since neither the experiment definition nor the output entry directly contain the lessons learned from the experiment.
If the researcher has other candidate proteins to study, another experiment can reuse this workflow definition. Here, the user simply repeats the experiment definition step, and supplies another protein for the Amino Acid Sequence input. The BCJ environment emphasizes that providing a different sequence input is a new experiment definition, rather than a modification of the original experiment. It will be useful to keep the original experiment, even though it was not successful in locating a promising drug target. These results may be helpful months later if someone asks, "Have you considered casein kinase as a possible target for malaria?" The BCJ environment provides three alternatives to find an answer for this question:
-
1.
The dependent viewer in the BCJ environment is particularly helpful for answering such questions. The researcher could select the casein kinase sequence entry, and find all experiments in which it is used as an input;
-
2.
Alternatively, the researcher could use the dependency viewer, which is effectively the inverse of the dependent viewer. With this viewer, the user can select the workflow definition and immediately find all experiments, based on the workflow, that have been executed;
-
3.
Perform a content-based search to locate an annotation that summarizes the result of a previous experiment.
Molecular dynamics experiment
Typically, molecular dynamics (MD) simulations involve a large number of simulations, where each varies from the previous only in the input data. The organization of the computations is often tailored to some specific application such as drug design or methodology development. Analysis of the simulation results will vary widely between simulation goal and investigators' styles. In addition, although the simulations are repetitive, there are numerous places during the study where interactions with the process and output are needed. Here, we demonstrate the use of the BCJ on such a representative MD simulation. For this experiment, we use the GROMACS package [13] and an example based on a tutorial [14].
There are five major steps in this experiment: (1) converting the protein structure from a common database format, PDB (Protein Data Bank), into a format suitable for GROMACS, (2) minimizing the energy of the system, (3) solvating the protein in a simulation box with water, (4) performing an MD simulation from the initial state, and (5) analyzing the simulation results. (Often, proteins are solvated then minimized, but for purposes of demonstration, the protein is minimized first.)
Each step can be run as an independent experiment with the outputs of each step appearing as entries in the user's journal. This stepwise procedure is useful for reviewing intermediate results. Alternatively, the steps can be incorporated into a single workflow, which is useful for a tested simulation procedure. The current example represents an intermediate approach in order to demonstrate aspects of the BCJ environment; two sub-workflows are combined into another workflow to perform the end-to-end experiment.
The workflow "Energy Min" shown in Figure 1 combines the first two steps of the procedure, since these two steps are standard initialization steps. This workflow has been designed with reuse in mind. The precise parameter settings for minimization may depend on the simulated molecule (one of the inputs); thus, a second input is defined for this workflow "MD params," which will translate the simulation parameters into GROMACS format. In addition, several output ports are defined for this workflow that are not specifically required in the example here. However, these additional outputs may be of use in another application that chooses to reuse this workflow.
The workflow "MD Sim" shown in Figure 8 combines steps three through five of the procedure outlined above. As with the previous workflow, care was taken so that the workflow could be reused. The "Solvent" input for the resource "genbox" is provided by the data source labeled "spc216.gro." An entry for this data in the BCJ environment was created by using the "Create New Entry" interface to import this information. It contains a GROMACS-specific description of "Simple Point Charge water" that is often used as the solvent in GROMACS experiments.
The workflow "Tutorial Example" shown in Figure 9 represents the complete process from the GROMACS tutorial. It makes use of the two supporting workflows described in the previous sections, demonstrating the hierarchical workflow definition integral to the BCJ environment. As described above, the supporting workflows were not designed only for this example tutorial. Each was designed generally, so that it could be used as a component in other molecular dynamics studies. Outputs not needed can be discarded by connecting each of them to a "Sink" block. The Workflow Editor requires these explicit connections since the ports to which they are connected were required. This consistency check by the Workflow Editor helps to ensure workflow correctness.
The workflow "Tutorial Example" defines only the process abstracted from the GROMACS tutorial. To begin a new project (or experiment), the user must import the original input data from an external source such as a public database. When data is imported to the BCJ, an entry is created and its content is initialized to that of a file provided by the user. The BCJ physically copies the data into the environment with subsequent references to the data in the BCJ through the entry. Thus, this experiment began by having the user import a number of files into the BCJ. Consequently, the workflow, Tutorial Example, allows one to perform this experiment for any protein from the PDB. Figure 10 displays the specific experiment definition corresponding to the example in the tutorial. The protein supplied as input is lysozyme, a serine protease. The MD parameters for the energy minimization are those supplied in the tutorial.
Numerous specialized programs have been developed for molecular visualization and interaction with molecular dynamics simulation. Rather than attempt to develop another molecular visualization subsystem directly in the BCJ environment (and then require BCJ users to become familiar with a new user interface), the BCJ environment invokes an existing tool, integrated as an external viewer, to display these results. For example, VMD, Visual Molecular Dynamics [15], can be used to display the output structure.