Genome-wide expression profiling is achieved by high-throughput sequencing (RNA-Seq) or by expression microarrays. An initial analysis of the resulting data is typically achieved by sorting the entirety of all genes by statistical evidence for differential expression compared to a control condition. For downstream analysis, statistics about the functional properties enriched in a set of differentially expressed genes are needed as well as information on mutual dependencies between genes or larger molecular networks. The nematode worm Caenorhabditis elegans is a widely used model organism in developmental biology and genetics [1] and subject to multiple studies based on high-throughput gene expression data (e.g. [2]). In this paper, we describe Wormpath, a software enabling the discovery of complex networks among a set of differentially expressed genes which is specifically optimized for use in C. elegans studies. Recently, we used Wormpath as part of a larger study on the DNA damage response [3].
Since its inception in 1998, the Wormbase [4] has emerged as a widely used repository for genomic information on C. elegans and other nematodes, including comprehensive knowledge on previously reported molecular interactions. These data comprise genetic as well as physical interactions and are based on a large variety of experimental methods such as yeast one-, two-, and three-hybrid assays as well as RNAi in mutant backgrounds. The interactions are both manually and automatically curated with an emphasis on including large-scale data sets [5]. Wormpath takes advantage of this resource to search for networks between members of a user-provided list of differentially expressed genes. For any of these networks, the software reports statistical measures, graphical plots and functional annotation by GO terms.
Other tools for molecular network discovery from high-throughput data include Cytoscape [6] or web-based software like the Pathway Projector [7] – a selection of related tools was reviewed by Thomas and Bonchev [8]. As the aforementioned software programs, Wormpath is freely available but, in contrast to the others, it is specifically tailored for analyses in C. elegans and closely interacts with the comprehensive knowledge in the Wormbase. By adopting our software, researchers from the worm field can now use extensive information on molecular interactions of C. elegans without additional efforts like local software installation or database maintenance.
Implementation
Our software ignores interactions which are only computationally predicted, e.g. from orthologous interactions, but not experimentally confirmed in C. elegans. Furthermore, we call an indirect interaction a pair of interactions that connect two nodes representing genes from the list of differentially expressed genes to an additional node that acts as a linker gene which is not differentially expressed. This reflects the fact that a key gene might have been missed, for instance due to technical reasons. Given a list \( \mathbb{L} \) of differentially expressed genes, our approach relies on a large basic graph G = (V, E) with a set V of vertices and a set E of edges, representing the topology of the network between members of \( \mathbb{L} \). Genes are represented by nodes in V and interactions between genes or their proteins are represented by edges in E labeled as either direct or indirect.
Searching the graph G for networks is closely related to the problem of finding the connected components of G defined as all connected subgraphs which are not subgraphs of any larger connected graph, where a graph is called connected if each pair of its nodes is connected by a path. Searching the connected components is a classical problem for which a typical solution is formed e.g., by a depth-first search algorithm [9]. Members of larger candidate gene sets often form highly interconnected networks because it is likely that many of them are players in the same biological processes. Thus, the connected components of the basic graph easily become very large themselves. To avoid this, our approach is controlled by an iteration depth D defining the number of allowed steps from a start node. The software localizes the first gene of \( \mathbb{L} \) on G and searches all d-neighbours, i.e. all nodes that are reachable in at most d∈{1, …, D} steps from that position, with both direct and indirect interactions taken into account. After replacing the indirect interactions by the two respective interactions plus a new node for the linker gene, the software builds a new graph N = (V
N
, E
N
) composed of only the d-neighbours and the edges between them. This procedure is repeated with the remaining genes of the list. Statistical significance of the networks is assessed and the results are sorted according to their network scores (see below). The algorithm is then repeated with D replaced by D-1, …, 1 to generate additional networks which are of smaller complexity.
To assess the statistical features of a network N in the results list, N is assigned a score defined as the average number of citations supporting its edges, providing a measure of the strength of experimental evidence for N. To test whether this network score is larger than expected by chance alone, the score is re-calculated for a given number of randomly simulated networks and a p-value is calculated as the fraction of those simulated scores which are at least as large as the original score.
Furthermore, the induced sub-network composed of all 1-neighbours of N is given by N
1 = {(v, e) ∈ G | v ∈ V; ∃ ṽ∈V
N
: d(v, ṽ) = 1}. We consider the null hypothesis H
0
that the proportion of genes that are elements of N
1 is equal in \( \mathbb{L} \) and in the entirety of all C. elegans genes. If H
0
can be rejected, we conclude that the members of N
1 are significantly over-represented in \( \mathbb{L} \) and the presence of N in \( \mathbb{L} \) is unlikely to have occurred by chance alone.
If the database contains a total number of N genes, the number X of members of N
1 contained in \( \mathbb{L} \) is a hypergeometrically distributed random variable with parameters N, |N
1|, and |\( \mathbb{L} \)| and therefore has got the probability density function fX with
$$ {f}^X\left(\mathrm{k}\right):=\mathrm{P}\left(\mathrm{X}=\mathrm{x}\right)=\frac{\left(\begin{array}{c}\hfill N-\left|{\boldsymbol{N}}^1\right|\hfill \\ {}\hfill \left|\mathbb{L}\right|-k\hfill \end{array}\right)\left(\begin{array}{c}\hfill \left|{\boldsymbol{N}}^1\right|\hfill \\ {}\hfill k\hfill \end{array}\right)}{\left(\begin{array}{c}\hfill N\hfill \\ {}\hfill \left|\mathbb{L}\right|\hfill \end{array}\right)} $$
From this, the p-value for N can be calculated by \( p={\displaystyle {\sum}_{\tilde{k}=k}^n{f}^X\left(\tilde{k}\right)} \). If this p-value is sufficiently small, H
0
can be rejected at the predefined significance level.
Our approach to statistical assessment of the networks resulting from a Wormpath analysis covers two distinct scientific aspects. The network score and its corresponding score-based p-value provide a measure of the strength of experimental evidence for the interactions involved in the network. On the other hand, the result from the second significance test termed the list-based p-value tells us to what extent differentially expressed genes are over-represented in the network compared to the entire genome. In some cases, these two approaches can lead to opposite results for the same network (see Results and Discussion); however, their combined information draws a broader picture of the statistical features of the respective molecular network than one of them alone.
The Wormpath software itself has been implemented using Perl/CGI and takes advantage of GraphViz [10] to layout the networks and generate the graphical output. Users can browse their results online or save a compressed version of the output to their local computer. The computation is based on a local database synchronized with the Wormbase using the AcePerl API. The source code of Wormpath as of version 1.0.4 is provided as a supplement to this manuscript (Additional file 1).