A proof of the DBRF-MEGN method, an algorithm for deducing minimum equivalent gene networks

Background We previously developed the DBRF-MEGN (difference-based regulation finding-minimum equivalent gene network) method, which deduces the most parsimonious signed directed graphs (SDGs) consistent with expression profiles of single-gene deletion mutants. However, until the present study, we have not presented the details of the method's algorithm or a proof of the algorithm. Results We describe in detail the algorithm of the DBRF-MEGN method and prove that the algorithm deduces all of the exact solutions of the most parsimonious SDGs consistent with expression profiles of gene deletion mutants. Conclusions The DBRF-MEGN method provides all of the exact solutions of the most parsimonious SDGs consistent with expression profiles of gene deletion mutants.


Background
Identification of gene regulatory networks (hereafter called gene networks) is essential for understanding cellular functions. Large-scale gene deletion projects [1][2][3][4] and DNA microarrays [5,6] have enabled the creation of large-scale gene expression profiles of gene deletion mutants [7,8]; these large-scale profiles comprise the expression levels of thousands of genes measured in deletion mutants of those genes. Such profiles are invaluable sources for identifying gene networks. Many procedures have been developed for inferring gene networks from such profiles [9][10][11][12][13][14][15][16][17][18]. Kyoda et al. developed the DBRF-MEGN (differencebased regulation finding-minimum equivalent gene network) method, an algorithm for inferring gene networks from large-scale gene expression profiles of gene deletion mutants [14]. In this algorithm, gene networks are modeled as signed directed graphs (SDGs) in which a regulation between two genes is represented as a signed directed edge whose sign -positive or negative -represents whether the effect of the regulation is activation or inhibition and whose direction represents which gene regulates which other gene; the most parsimonious SDGs consistent with the expression profiles are thus deduced. Kyoda et al. showed that the method is applicable to large-scale gene expression profiles of gene deletion mutants and that networks deduced by the method are valid and useful for predicting functions of genes [14]. However, details of the method's algorithm and a proof of the algorithm have not previously been published.
Here we describe in detail the algorithm of the DBRF-MEGN method and prove that the algorithm provides all of the exact solutions of the most parsimonious gene networks consistent with expression profiles of gene deletion mutants.

Implementation
The software of the DBRF-MEGN method was written in C++ under Linux. The complete source code files, a binary Linux executable file, and the software manual are available [see Additional File 1].

Difference-based deduction of initially deduced edges and the minimum equivalent gene networks
The DBRF-MEGN method consists of five processes, namely (1) difference-based deduction of initially deduced edges, (2) removal of non-essential edges from * Correspondence: sonami@riken.jp 1 Laboratory for Developmental Dynamics, RIKEN Quantitative Biology Center, and Advanced Computational Sciences Department, RIKEN Advanced Science Institute, 1-7-22 Suehirocho, Tsurumi, Yokohama 230-0045, Japan Full list of author information is available at the end of the article the initially deduced edges, (3) selection of the uncovered edges in main components from the non-essential edges, (4) separation of the uncovered edges in main components into independent groups, and (5) restoration of the minimum number of edges from each independent group [14]. First, we define a gene network modeled as an SDG: Definition 1: A signed directed graph (SDG) is given by a tuple G = (V, E, f) with a set V of nodes (genes), a set E⊆V×V of directed edges, and an edge sign function f:E {± 1}, which is an integral part of an SDG.
The first process of the DBRF-MEGN method is "difference-based deduction of initially deduced edges" (Figure 1b), which uses an assumption that is commonly made in genetics and cell biology [14], i.e., there exists a positive (negative) regulation from gene A to gene B when the expression level of gene B in the deletion mutant of gene A is significantly lower (higher) than in the wild-type (Figure 1a). For each possible pair of genes in the profiles, the process determines whether positive (negative) regulations between those genes exist and deduces all edges consistent with both the assumption and the profiles by detecting the difference in expression levels between the wild type and deletion mutants; we call these edges initially deduced edges.
Definition 2: Let us assume the intervention experimentshave been performed for the gene set J, J ⊆V. Let D = (d jk ) R J×V be a matrix such thatd jk represents the expression of gene kafter an intervention in gene j (relative to wild-type expression). From this, we deduce the graph initially deduced edges, G ide = (V,E ide ,f). We assume a negative regulation of k by j if d jk > α for some suitably chosen constant α. Analogously, a positive regulation of k by j is postulated whenever d jk <b for some b (sensibly, we require b < 0 < α). Formally, E ide = j, k ∈ J × V|d jk > α or d jk < β and f:E ide {± 1} is given by f((j,k)) = 1 if there is a positive regulation of k by j, and otherwise f((j,k)) = -1.
The thresholds α and β determine the significance of the difference in expression levels between the wild type and deletion mutants. These thresholds can be specified by various procedures such as by using fold-change or the statistical significance of the expression level [7,8,14,19,20].
The DBRF-MEGN method deduces the most parsimonious SDGs consistent with the SDG that consists of the initially deduced edges. Before defining the most parsimonious SDGs, we need to introduce the function exp and the concept cover ( Figure 2). Definition 3: If, and only if, ∃ (i, j), (j, k), (i, k) | f(i, j) × f (j, k) = f(i, k), then exp(i, j, k) = 1; otherwise, exp(i, j, k) = 0.
The family of edge sets on V is partially ordered by set inclusion. If E 1 ⊆E 2 , note that by a trivial induction on r, E (r) . This means that the mapping . cov : E → E cov is monotonic. Let E ⊆ E ide . By construction, an edge (j, k) from E cov cov is an element of E (r) (s) = E (r+s) for suitable r,s N. This Thus E cov cov = E cov , and the mapping . cov → E cov is a so-called closure operation.
The remark proves lemma 1.
by monotonicity and closure of the mapping . cov .
cov = E cov 2 by monotonicity and closure of the mapping. cov . Now, we define the most parsimonious SDGs consistent with the expression profiles of gene deletion mutants. A most parsimonious SDG consists of the minimum number of edges that "cover" all initially deduced edges. By this definition, an edge can be redundant only when it is "explained" by two other initially deduced edges. Importantly, an edge is not redundant when it is "explained" by only three or more initially deduced edges ( Figure 3a). We call the most parsimonious SDGs minimum equivalent gene networks (MEGNs).
Definition 5: G 0 = V, E 0 , f E 0 (where f E 0 is the restriction of f to E 0 ) is a most parsimonious SDG, named a MEGN, of G = (V,E ide ,f) if and only if it satisfies the following conditions: Since we keep G = (V,E ide ,f) fixed for the rest of the paper, we often call G 0 simply a MEGN, without explicit reference to G.

Removal of non-essential edges from the initially deduced edges
The second process of the DBRF-MEGN method removes all non-essential edges from the initially deduced edges. The process removes all edges that are explained by two other initially deduced edges (Figure 1c). The resulting edges are called essential edges and the removed edges are called non-essential edges.
Definition 6: If there exist (i, j), (j, k), (i, k) E ide such that exp(i, j, k) = 1, then (i, k) is called a non-essential edge. Let E nes be the set of non-essential edges. The set E es of essential edges is the complement of E nes in E ide , E es = E ide \E nes . Essential edges and non-essential edges have the following properties.
Lemma 4: If E p ⊆ E ide and E cov p ⊇ E es, then E p ⊇ E es . Proof: Assume that there exists (i, j) E es such that (i, j) ∈ E cov p and (i, j) ∉ E p . Because (i, j) ∈ E cov p and (i, j) ∉ E p , there exist (i, k), (k, j)∈ E cov p such that exp(i, k, j) = 1. This contradicts our assumption (i, j) ∉ E es .
Lemma 5: Proof: E cov 0 = E ide ⊇ E es , hence E 0 ⊇ E es by lemma 4. When the essential edges cover all initially deduced edges, the SDG consisting of the essential edges is the only MEGN consistent with the profiles.
Proof: By hypothesis, conditions (1) E es ⊆ E ide , and (2) E cov es = E ide , of a MEGN are met. It remains to show the uniqueness and minimality of E es . (3) Let G 0 = V, E 0 , f E 0 be an arbitrary MEGN. Then by lemma 5, E es ⊆ E 0 , and by minimality of E 0 , it follows that E es = E 0 . The theorem is proved.

Selection of the uncovered edges in main components from the non-essential edges
The essential edges sometime fail to cover all initially deduced edges because some edges in the initially deduced edges represent direct gene regulations even when they are explained by two other edges ( Figure 1d). In this case, the method restores the minimum number of non-essential edges so that the resulting edges (essential edges and the restored non-essential edges) cover all initially deduced edges. The SDG, consisting of essential edges and of the restored non-essential edges, is a MEGN. Before selecting the sets of non-essential edges to be restored, the method distinguishes non-essential edges that have a chance to be included in the MEGNs from those that do not in order to reduce the number of non-essential edges to be considered for the restoration and thus to reduce the  computational cost to find non-essential edges to be restored. This third process of the DBRF-MEGN method consists of two sub-processes, namely (a) selection of uncovered edges and (b) selection of uncovered edges in main components. The resulting non-essential edges are called uncovered edges in main components, and from these edges the later processes of the DBRF-MEGN method select edges that are included in the MEGNs.

a) Selection of uncovered edges
The first sub-process distinguishes the non-essential edges that are covered by the essential edges from those that are not (Figure 1d). Those edges are called covered edges and uncovered edges, respectively.
Definition 7: Let E cv = (E es ) cov \ E es be the set of covered edges. Let E ucv = E ide \( E es ∪ E cv ) be the set of uncovered edges. The set of initially deduced edges is thereby partitioned into three disjoint edge sets: Here, we prove that the MEGNs do not include covered edges.

b) Selection of uncovered edges in main components
The second sub-process distinguishes uncovered edges that have a chance to be included in the MEGNs from those that do not (Figure 1e; Figure 4). Those edges are called uncovered edges in main components and uncovered edges in peripheral components. The uncovered edges in peripheral components are defined as follows: Definition 8: Define E (0) ucv be the set of uncovered edges (i,j) E ucv which cannot be used to directly explain another uncovered edge in E ucv with the other ucv . Proof: By definition 8, the edges in E (0) ucv cannot explain another uncovered edges in E ucv . Therefore, the edges in E (0) ucv can be explained by the edges in E ide \E (0) ucv . The lemma is proved.
Definition 9: Following the definition 8, define ucv which cannot be used to directly explain another uncovered edge in E ucv \E (r) ucv with the other edges (k,i) ucv be the set of uncovered edges in peripheral components. Let E mc ucv = E ucv \E pc ucv be the set of uncovered edges in main components. The set of initially deduced edges is thereby partitioned into four disjoint edge sets: In the following, we prove that the MEGNs do not include uncovered edges in peripheral components. First, we prove that uncovered edges in peripheral components have the following properties.
ucv . Proof: We prove lemma 8 by mathematical induction.
(1) By lemma 7, lemma 8 is true when r = 0. By ucv . By lemmas 2 and 7, Thus, lemma 8 is true when r = 1. (2) Assume that lemma 8 is true when r = m. This means that we assume that ucv . Thus, lemma 8 is true when r = m +1, if it is true when r = m. By (1) and (2), lemma 8 is true.
pc ucv , lemma 9 is true. Now we prove that the MEGNs do not include uncovered edges in peripheral components.
Lemma 10: Because  of  lemma  5  and  definition  7, This contradicts our assumption that G 0 = V, E 0 , f E 0 is a MEGN. Therefore, E 0 ∩ E pc ucv = φ . By definition 9 and lemma 6, this implies E 0 ⊆ E es ∪ E mc ucv , completing the proof.
Separation of the uncovered edges in main components into independent groups and restoration of the minimum number of edges from each independent group The fourth process of the DBRF-MEGN method separates uncovered edges in main components into "independent groups" so that edges to be restored can be deduced independently for each group (Figure 1f; Figure  5). For each group, the fifth process of the DBRF-MEGN method deduces the minimum number of edges with which essential edges can cover all edges in the group. All sets of such edges are deduced for each group. The essential edges and any possible combination of these sets from each group generate a MEGN of the profiles (Figure 1g).
The independent groups are generated so that the edges in one group do not cover those in other groups.
Definition 10: Define E mc(0) ucv be a set of an edge i, j ∈ E mc ucv , and by induction   We prove that the essential edges and a combination of sets of the minimum number of edges for each independent group generate a MEGN of the profiles as follows:  Remark: When there exist more than one solution of the minimum number of edges for independent groups, the SDGs each of which consists of the essential edges and a possible combination of sets of the solutions for each independent group are MEGNs because these SDGs must satisfy the conditions in definition 5.

Algorithms of the DBRF-MEGN method
We are concerned with algorithms that are computationally efficient for deducing MEGNs from expression profiles of single-gene deletion mutants. We list these in a form easily translatable into a computer program. [n] that differ from the non-zero entries of the resulted matrix e[n][n] represent uncovered edges. This algorithm iterates over the while loop to find edges in E nes that can be covered by the essential edges. Thus, the number of iterations is bounded by |E nes | · n 3 .  indgrp.init(); indgrp.set_el(el); // store edge list el in indgrp igl.append(indgrp); // indgrp : an independent group void append_group(int i, int j) int x;  The number of complete iterations is bounded by where G is the number of independent groups, R j is the number of edges in the jth independent group, n j is the number of genes in the jth independent group, and m j is the number of edges to be restored in the jth independent group. S j , where S j is the number of sets of minimum number of edges to be restored for the jth independent group.

Discussion
We have described in detail the algorithm of the DBRF-MEGN method and have proved that the algorithm provides all of the exact solutions of the most parsimonious gene networks consistent with expression profiles of gene deletion mutants. The resulting gene networks, called MEGNs, are the most parsimonious SDGs consistent with an SDG that consists of the initially deduced edges. In graph theory, many algorithms have been developed for deducing the most parsimonious unsigned directed graphs consistent with a given unsigned directed graph; these graphs are called minimum equivalent graphs (MEGs) [22][23][24][25]. MEGN is not just an "SDG version" of MEG, as is explained below. Although both MEGN and MEG are the most parsimonious graphs of a given graph, the parsimoniousness of the graph is defined differently between these graphs. MEGN consists of the minimum number of edges that cover all edges of a given graph (initially deduced edges), whereas MEG consists of the minimum number of edges that retain the reachability of a given graph [22]. MEGNs use the cover instead of the reachability because a MEGN is a prediction of a gene network consisting only of direct gene regulations [14]. When positive regulations from gene A to gene B, from gene B to gene C, from gene C to gene D, and from gene A to gene D are detected and regulation from gene A to gene C is not detected, the regulation from gene A to gene D is likely to be a direct regulation instead of an indirect regulation as a result of the other three regulations (Figure 3a). The use of cover makes MEGNs include edges representing such likely direct regulations (Figure 3a). In contrast, the MEGs, using reachability, do not include those edges (Figure 3b). Therefore, the DBRF-MEGN method, which deduces MEGNs, is fundamentally different from algorithms that deduce MEGs or algorithms for transitive reduction of SDG [16][17][18].
The selection of uncovered edges in main components (the third process) and the generation of independent groups (the fourth process) make the DBRF-MEGN method applicable to large-scale gene expression profiles. Without these processes, the computational cost for finding all sets of non-essential edges to be included in the where n is the number of genes and m is the number of non-essential edges to be included in a MEGN. This computation is impractical for large-scale gene expression profiles because |E nes | C m increases rapidly as |E nes | or m increase. The selection of uncovered edges in main components reduces the computational cost to n 3 m i=1 |E mc ucv | C i · E mc ucv − i and the generation of independent groups further reduces it to where t is the number of independent groups, n j is the number of genes in the jth independent group, and m j is the number of edges in the jth independent group to be included in a MEGN. E ig j ucv and m j are usually far smaller than |E nes | and m. Because of these reductions of the computational cost, the DBRF-MEGN method successfully deduced MEGNs from sets of large-scale gene expression profiles [14] [see Additional file 2, Table S1; Additional file 3]. Although there is no guarantee that the method will deduce MEGNs from any given expression profiles in an acceptable time, the method would most probably deduce MEGNs from most sets of expression profiles in an acceptable time.
Because MEGNs are deduced from initially deduced edges, the accuracy of MEGNs depends on that of initially deduced edges. The primary source for the inaccuracy in initially deduced edges is the noise of the expression profiles. Importantly, the number of falsepositive edges in MEGN depends more on that of falsely-detected edges than that of falsely-missed edges in initially deduced edges; the number of false-negative edges in MEGN depends more on that of falsely-missed edges than that of falsely-detected edges in initially deduced edges [see Additional file 2, Table S2; Additional file 2, Figure S1]. These dependencies suggest the following guideline for the thresholds α and β (Definition 2): when the number of false-positive edges is more important than that of false-negative edges in MEGN, α (β) should be a little higher (lower) than the optimal value; in contrast, when the number of falsenegative edges is more important than that of false-positive edges in MEGN, α (β) should be a little lower (higher) than the optimal value.
The DBRF-MEGN method is applicable not only to gene expression profiles of deletion mutants but also to those of gene overexpressions and conditional knockdowns/knock-outs [26][27][28]. We cannot obtain gene expression profiles of deletion mutants for essential genes. Thus, the method cannot deduce gene networks including essential genes when we use gene expression profiles of deletion mutants. A possible solution for this problem is to use the expression profiles of gene overexpressions or conditional knock-downs/knock-outs. Applications of the DBRF-MEGN method to those profiles will deduce gene regulations that cannot be deduced from gene expression profiles of gene deletion mutants.
A limitation of the DBRF-MEGN method is its inability to deduce (1) self-regulation of genes, and (2) combinatorial gene regulations such as regulation in which the expression of gene A is down-regulated only when both gene B and gene C are inactive. Self-regulation could be deduced by using chromatin immunoprecipitation [29]. Combinatorial gene regulations could be deduced by using the expression profiles of multiple gene deletion mutants [30]. Synthetic genetic arrays can systematically construct a collection of double-gene deletion mutants [31]. A combination of the DBRF-MEGN method and the above techniques would provide more accurate information about gene networks.
When the DBRF-MEGN method is applied to gene expression profiles measured by using DNA microarray, each of the deduced edges represents regulation of one gene's mRNA level by another gene's activity. Therefore, the deduced MEGNs do not include edges that represent post-transcriptional gene regulations although they play major roles in the cell. However, because the algorithm of the DBRF-MEGN method is based on logic that is most commonly used in genetics and cell biology to infer gene networks from small-scale experiments, we can predict post-transcriptional modulators of transcriptional activity from those MEGNs. We predicted total 72 transcriptional regulators and 232 post-transcriptional modulators of 18 transcriptional regulators from the MEGNs deduced from a set of gene expression profiles for 265 Saccharomyces cerevisiae genes [14]. The DBRF-MEGN method is applicable not only to gene expression profiles measured by using DNA microarray but also to those measured by using other technologies such as 2D-PAGE-MS [32] and protein chips [33]. MEGNs deduced from those non-DNA microarray expression profiles will include edges that represent post-transcriptional gene regulations in the cell.

Conclusions
We described in detail the processes of the DBRF-MEGN method and proved that these processes provide all of the exact solutions of the most parsimonious gene networks consistent with the expression profiles of gene deletion mutants, which are called MEGNs. The DBRF-MEGN method provides invaluable information for understanding cellular functions.