- Methodology
- Open Access

# The non-negative matrix factorization toolbox for biological data mining

- Yifeng Li
^{1}Email author and - Alioune Ngom
^{1}

**8**:10

https://doi.org/10.1186/1751-0473-8-10

© Li and Ngom; licensee BioMed Central Ltd. 2013

**Received:**30 November 2012**Accepted:**10 April 2013**Published:**16 April 2013

## Abstract

### Background

Non-negative matrix factorization (NMF) has been introduced as an important method for mining biological data. Though there currently exists packages implemented in R and other programming languages, they either provide only a few optimization algorithms or focus on a specific application field. There does not exist a complete NMF package for the bioinformatics community, and in order to perform various data mining tasks on biological data.

### Results

We provide a convenient MATLAB toolbox containing both the implementations of various NMF techniques and a variety of NMF-based data mining approaches for analyzing biological data. Data mining approaches implemented within the toolbox include data clustering and bi-clustering, feature extraction and selection, sample classification, missing values imputation, data visualization, and statistical comparison.

### Conclusions

A series of analysis such as molecular pattern discovery, biological process identification, dimension reduction, disease prediction, visualization, and statistical comparison can be performed using this toolbox.

## Keywords

- Non-negative matrix factorization
- Clustering
- Bi-clustering
- Feature extraction
- Feature selection
- Classification
- Missing values

## Background

*Non-negative matrix factorization* (NMF) is a matrix decomposition approach which decomposes a non-negative matrix into two low-rank non-negative matrices [1]. It has been successfully applied in the mining of biological data.

For example, Ref. [2, 3] used NMF as a clustering method in order to discover the metagenes (i.e., groups of similarly behaving genes) and interesting molecular patterns. Ref. [4] applied *non-smooth NMF* (NS-NMF) for the biclustering of gene expression data. *Least-squares NMF* (LS-NMF) was proposed to take into account the uncertainty of the information present in gene expression data [5]. Ref. [6] proposed kernel NMF for reducing dimensions of gene expression data.

Many authors indeed provide their respective NMF implementations along with their publications so that the interested community can use them to perform the same data mining tasks respectively discussed in those publications. However, there exists at least three issues that prevent NMF methods from being used by the much larger community of researchers and practitioners in the data mining, biological, health, medical, and bioinformatics areas. First, these NMF softwares are implemented in diverse programming languages, such as R, MATLAB, C++, and Java, and usually only one optimization algorithm is provided in their implementations. It is inconvenient for many researchers who want to choose a suitable NMF method or mining task for their data, among the many different implementations, which are realized in different languages with different mining tasks, control parameters, or criteria. Second, some papers only provide NMF optimization algorithms at a basic level rather than a data mining implementation at a higher level. For instance, it becomes hard for a biologist to fully investigate and understand his/her data when performing clustering or bi-clustering of his data and then visualize the results; because it should not be necessary for him/her to implement these three data mining methods based on a basic NMF. Third, the existing NMF implementations are application-specific, and thus, there exists no systematic NMF package for performing data mining tasks on biological data.

There currently exists NMF toolboxes (which we discuss in this paragraph), however, none of them addresses the above three issues altogether.

*NMFLAB*[7] is MATLAB toolbox for signal and image processing which provides a user-friendly interface to load and process input data, and then save the results. It includes a variety of optimization algorithms such as multiplicative rules, exponentiated gradient, projected gradient, conjugate gradient, and quasi-Newton methods. It also provides methods for visualizing the data signals and their components, but does not provide any data mining functionality. Other NMF approaches such as semi-NMF and kernel NMF are not implemented within this package.

*NMF:DTU Toolbox*[8] is a MATLAB toolbox with no data mining functionalities. It includes only five NMF optimization algorithms, such as multiplicative rules, projected gradient, probabilistic NMF, alternating least squares, and alternating least squares with *optimal brain surgery* (OBS) method.

*NMFN: Non-negative Matrix Factorization*[9] is an R package similar to *NMF:DTU* but with few more algorithms.

*NMF: Algorithms and framework for Nonnegative Matrix Factorization*[10] is another R package which implements several algorithms and allows parallel computations but no data mining functionalities.

*Text to Matrix Generator (TMG)* is a MATLAB toolbox for text mining only.

Ref. [11] provides a NMF plug-in for BRB-ArrayTools. This plug-in only implements the standard NMF and semi-NMF and for clustering gene expression profiles only.

*Coordinated Gene Activity in Pattern Sets* (CoGAPS) [12] is a new package implemented in C++ with R interface. In this package, the *Bayesian decomposition* (BD) algorithm is implemented and used in place of the NMF method for factorizing a matrix. Statistical methods are also provided for the inference of biological processes. CoGAPS can give more precise results than NMF methods [13]. However, CoGAPS uses a Markov chain Monte Carlo (MCMC) scheme for estimating the BD model parameters, which is slower than the NMFs optimization algorithms implemented with the block-coordinate gradient descent scheme.

- 1.
The NMF algorithms are relatively complete and implemented in MATLAB. Since it is impossible and unnecessary to implement all NMF algorithms, we focus only on well-known NMF representatives. This repository of NMFs allows users to select the most suitable one in specific scenarios.

- 2.
Our NMF toolbox includes many functionalities for mining biological data, such as clustering, bi-clustering, feature extraction, feature selection, and classification.

- 3.
The toolbox also provides additional functions for biological data visualization, such as heat-maps and other visualization tools. They are pretty helpful for interpreting some results. Statistical methods are also included for comparing the performances of multiple methods.

The rest of this paper is organized as below. The implementations of the basis level are first discussed in the next section. After that, examples of implemented data mining tasks at a high level are described. Finally, we conclude this paper and give possible avenues for future research directions.

## Implementation

**Algorithms of NMF variants**

Function | Description |
---|---|

nmfrule | The standard NMF optimized by gradient-descent-based multiplicative rules. |

nmfnnls | The standard NMF optimized by NNLS active-set algorithm. |

seminmfrule | Semi-NMF optimized by multiplicative rules. |

seminmfnnls | Semi-NMF optimized by NNLS. |

sparsenmfnnls | Sparse-NMF optimized by NNLS. |

sparsenmfNNQP | Sparse-NMF optimized by NNQP. |

sparseseminmfnnls | Sparse semi-NMF optimized by NNLS. |

kernelnmfdecom | Kernel NMF through decomposing the kernel matrix of input data. |

kernelseminmfrule | Kernel semi-NMF optimized by multiplicative rule. |

kernelseminmfnnls | Kernel semi-NMF optimized by NNLS. |

kernelsparseseminmfnnls | Kernel sparse semi-NMF optimized by NNLS. |

kernelSparseNMFNNQP | Kernel sparse semi-NMF optimized by NNQP. |

convexnmfrule | Convex-NMF optimized by multiplicative rules. |

kernelconvexnmf | Kernel convex-NMF optimized by multiplicative rules. |

orthnmfrule | Orth-NMF optimized by multiplicative rules. |

wnmfrule | Weighted-NMF optimized by multiplicative rules. |

sparsenmf2rule | Sparse-NMF on both factors optimized by multiplicative rules. |

sparsenmf2nnqp | Sparse-NMF on both factors optimized by NNQP. |

vsmf | Versatile sparse matrix factorization optimized by NNQP and |

nmf | The omnibus of the above algorithms. |

computeKernelMatrix | Compute the kernel matrix k(A,B) given a kernel function. |

### Standard-NMF

*standard-NMF*decomposes a non-negative matrix $\mathit{X}\in {\mathbb{R}}^{m\times n}$ into two non-negative factors $\mathit{A}\in {\mathbb{R}}^{m\times k}$ and $\mathit{Y}\in {\mathbb{R}}^{k\times n}$ (where

*k*< min{

*m*,

*n*}), that is

_{+}indicates the matrix M is non-negative. Its optimization in the Euclidean space is formulated as

Statistically speaking, this formulation is obtained from the log-likelihood function under the assumption of a Gaussian error. If multivariate data points are arranged in the columns of X, then A is called the *basis matrix* and Y is called the *coefficient matrix*; each column of A is thus a *basis vector*. The interpretation is that each data point is a (sparse) non-negative linear combination of the basis vectors. It is well-known that the optimization objective is a non-convex optimization problem, and thus, *block-coordinate descent* is the main prescribed optimization technique for such problem. Multiplicative update rules were introduced in [14] for solving Equation (2). Though simple to implement, this algorithm is not guaranteed to converge to a stationary point [15]. Essentially the optimizations above, with respect to A and Y, are *non-negative least squares* (NNLS). Therefore we implemented the alternating NNLS algorithm proposed in [15]. It can be proven that this algorithm converges to a stationary point. In our toolbox, functions nmfrule and nmfnnls are the implementations of the two algorithms above.

### Semi-NMF

*emi-NMF*which removes the non-negative constraints on the data X and basis matrix A. It can be expressed in the following equation:

where Y^{
‡
}=Y^{T}(Y Y^{T})^{−1} is Moore-Penrose pseudoinverse. Updating Y while fixing A is a NNLS problem essentially as above. Therefore we implemented the fast NNLS based algorithm to optimize semi-NMF in function seminmfnnls.

### Sparse-NMF

*sparse-NMF*proposed in [3] is expressed in the following equation

_{ i }is the

*i*-th column of Y. From the Bayesian perspective, this formulation is obtained from the log-posterior probability under the assumptions of Gaussian error, Gaussian-distributed basis vectors, and Laplace-distributed coefficient vectors. Keeping one matrix fixed and updating the other matrix can be formulated as a NNLS problem. In order to improve the interpretability of the basis vectors and speed up the algorithm, we implemented the following model instead:

*l*

_{2}norm. The first and second steps can be solved using

*non-negative quadratic programming*(NNQP), whose general formulation is

_{ i }is the

*i*-th column of the variable matrix Z. It is easy to prove that NNLS is a special case of NNQP. For example, Equation (7) can be rewritten as

The implementations of the method in [3] and our method are given in functions sparsenmfnnls and sparseNMFNNQP, respectively. We also implemented the sparse semi-NMF in functionl sparseseminmfnnls.

### Versatile sparse matrix factorization

*l*

_{1}-norm on A to induce sparsity. The drawback of

*l*

_{1}-norm is that correlated variables may not be simultaneously non-zero in the

*l*

_{1}-induced sparse result. This is because

*l*

_{1}-norm is able to produce sparse but non-smooth results. It is known that

*l*

_{2}-norm is able to obtain smooth but non-sparse results. When both norms are used together, then correlated variables can be selected or removed simultaneously [18]. When smoothness is required on Y, we may also use

*l*

_{2}-norm on it in some scenarios. We thus generalize the aforementioned NMF models into a versatile form as expressed below

where, parameters: *α*_{1}≥0 controls the sparsity of the basis vectors; *α*_{2}≥0 controls the smoothness and the scale of the basis vectors; *λ*_{1}≥0 controls the sparsity of the coefficient vectors; *λ*_{2}≥0 controls the smoothness of the coefficient vectors; and, parameters *t*_{1} and *t*_{2} are boolean variables (0: false, 1: true) which indicate if non-negativity needs to be enforced on A or Y, respectively. We can call this model *versatile sparse matrix factorization* (VSMF). It can be easily seen that the standard NMF, semi-NMF, and the sparse-NMFs are special cases of VSMF.

*t*

_{1}=

*t*

_{2}=1 (implemented in function sparsenmf2rule):

where, A∗B and $\frac{\mathit{A}}{\mathit{B}}$ are the element-wise multiplication and division operators of matrices A and B, respectively. Alternatively, we also devise an active-set algorithm for VSMF (implemented in function vsmf). When *t*_{1}(or*t*_{2})=1, A (or Y) can be updated by NNQP (this case is also implemented in sparsenmf2nnqp). When *t*_{1}(or*t*_{2})=0, A (or Y) can be updated using 1_{1}QP.

### Kernel-NMF

Two features of a kernel approach are that i) it can represent complex patterns, and ii) the optimization of the model is dimension-free. We now show that NMF can also be kernelized.

The basis matrix is dependent on the dimension of the data, and it is difficult to represent it in a very high (even infinite) dimensional space. We notice that in the NNLS optimization, updating Y in Equation (10) needs only the inner products A^{T}A, A^{T}X, and X^{T}X. From Equation (4), we obtain A^{T}A=(Y^{
‡
})^{T}X^{T}X Y^{
‡
}, A^{T}X=(Y^{
‡
})^{T}X^{T}X. Therefore, we can see that only the inner product X^{T}X is needed in the optimization of NMF. Hence, we can obtain the kernel version, *kernel-NMF*, by replacing the inner product X^{T}X with a kernel matrix *K*(X,X). Interested readers can refer to our recent paper [6] for further details. Based on the above derivations, we implemented the kernel semi-NMF using multiplicative update rule (in kernelseminmfrule) and NNLS (in kernelseminmfnnls). The sparse kernel semi-NMFs are implemented in functions kernelsparseseminmfnnls and kernelSparseNMFNNQP which are equivalent to each other. The kernel method of decomposing a kernel matrix proposed in [19] is implemented in kernelnmfdecom.

### Other variants

Ref. [16] proposed the *Convex-NMF*, in which the columns of A are constrained to be the convex combinations of data points in X. It is formulated as X_{±}=X_{±}W_{+}Y_{+}+E, where M_{±} indicates that matrix M is of mixed signs. X W=A and each column of W contains the convex coefficients of all the data points to get the corresponding column of A. It has been demonstrated that the columns of A obtained with the convex-NMF are close to the real centroids of clusters. Convex-NMF can be kernelized as well [16]. We implemented the convex-NMF and its kernel version in convexnmfrule and kernelconvexnmf, respectively.

*orthogonal NMF*(ortho-NMF) imposes the orthogonality constraint in order to enhance sparsity [20]. Its formulation is

where, the input X is non-negative, S absorbs the magnitude due to the normalization of A and Y. Function orthnmfrule is its implementation in our toolbox. Ortho-NMF is very similar with the *non-negative sparse PCA* (NSPCA) proposed in [21]. The disjoint property on ortho-NMF may be too restrictive for many applications, therefore this property is relaxed in NSPCA. Ortho-NMF does not guarantee the maximum-variance property which is also relaxed in NSPCA. However NSPCA only enforces non-negativity on the basis vectors, even when the training data have negative values. We plan to devise a model in which the disjoint property, the maximum-variance property, the non-negativity and sparsity constraints can be controlled on both basis vectors and coefficient vectors.

There are two efficient ways of applying NMF on data containing missing values. First, the missing values can be estimated prior to running NMF. Alternatively, *weighted-NMF*[22] can be directly applied to decompose the data. Weighted-NMF puts a zero weight on the missing elements and hence only the non-missing data contribute to the final result. An expectation-maximization (EM) based missing value estimation during the execution of NMF may not be efficient. The weighted-NMF is given in our toolbox in function wnmfrule.

## Results and discussion

**NMF-based data mining approaches**

Function | Description |
---|---|

NMFCluster | Take the coefficient matrix produced by a NMF algorithm, and output the clustering result. |

chooseBestk | Search the best number of clusters based on dispersion Coefficients. |

biCluster | The biclustering method using one of the NMF algorithms. |

featureExtractionTrain | General interface. Using training data, generate the bases of the NMF feature space. |

featureExtractionTest | General interface. Map the test/unknown data into the feature space. |

featureFilterNMF | On training data, select features by various NMFs. |

featSel | Feature selection methods. |

nnlsClassifier | The NNLS classifier. |

perform | Evaluate the classifier performance. |

changeClassLabels01 | Change the class labels to be in {0,1,2,⋯, |

gridSearchUniverse | A framework to do line or grid search. |

classificationTrain | Train a classifier, many classifiers are included. |

classificationPredict | Predict the class labels of unknown samples via the model learned by classificationTrain. |

multiClassifiers | Run multiple classifiers on the same training data. |

cvExperiment | Conduct experiment of k-fold cross-validation on a data set. |

significantAcc | Check if the given data size can obtain significant accuracy. |

learnCurve | Fit the learning curve. |

FriedmanTest | Friedman test with post-hoc Nemenyi test to compare multiple classifiers on multiple data sets. |

plotNemenyiTest | Plot the CD diagram of Nemenyi test. |

NMFHeatMap | Draw and save the heat maps of NMF clustering. |

NMFBicHeatMap | Draw and save the heat maps of NMF biclustering. |

plotBarError | Plot Bars with STD. |

writeGeneList | Write the gene list into a.txt file. |

normmean0std1 | Normalization to have mean 0 and STD 1. |

sparsity | Calculate the sparsity of a matrix. |

MAT2DAT | Write a data set from MATLAB into.dat format in order to be readable by other languages. |

### Clustering and bi-clustering

NMF has been applied for clustering. Given data X with multivariate data points in the columns, the idea is that, after applying NMF on X, a multivariate data point, say x_{
i
} is a non-negative linear combination of the columns of A; that is x_{
i
}≈A y_{
i
}=*y*_{1i}a_{1}+⋯+*y*_{
k
i
}a_{
k
}. The largest coefficient in the *i*-th column of Y indicates the cluster this data point belongs to. The reason is that if the data points are mainly composed with the same basis vectors, they should therefore be in the same group. A basis vector is usually viewed as a cluster centroid or prototype. This approach has been used in [2] for clustering microarray data and in order to discover tumor subtypes. We implemented function NMFCluster through which various NMF algorithms can be selected. An example is provided in exampleCluster file in the folder of our toolbox.

### Basis vector analysis for biological process discovery

We can obtain interesting and detailed interpretations via an appropriate analysis of the basis vectors. When applying NMF on a microarray data, the basis vectors are interpreted as potential biological processes [3, 13, 24]. In the following, we give one example for finding biological factors on gene-sample data, and two examples on time-series data. Please note they only serve as simple examples. Fine tuning of the parameters of NMF is necessary for accurate results.

#### First example

*k*=3,

*α*

_{1}=0.01,

*α*

_{2}=0.01,

*λ*

_{1}=0,

*λ*

_{2}=0.01,

*t*

_{1}=1, and

*t*

_{2}=1. Next, we obtain 81, 37, and 448 genes for the three factors, respectively. As in [3], we then performed gene set enrichment analysis (GSEA) by applying Onto-Express [25] on each of these sets of genes. Part of the result is shown in Table 3. We can see that the factor-specific genes selected by NMF correspond to some biological processes significantly. Please see file exampleBioProcessGS in the toolbox for details. GSEA can also be done using other tools, such as MIPS [26], GOTermFinder [27], and DAVID [28, 29].

**Gene set enrichment analysis using Onto-Express for the factor specific genes identified by NMF**

Factor 1 | Factor 2 | Factor 3 | |||
---|---|---|---|---|---|

biological process | p-value | biological process | p-value | biological process | p-value |

reproduction (5) | 0 | response to stimulus (15) | 0.035 | regulation of bio. proc. (226) | 0.009 |

metabolic process (41) | 0 | biological regulation (14) | 0.048 | multi-organism proc. (39) | 0.005 |

cellular process (58) | 0 | biological regulation (237) | 0.026 | ||

death (5) | 0 | ||||

developmental process (19) | 0 | ||||

regulation of biological process (19) | 0 |

#### Second example

#### Third example

*k*=2. Because this data set has negative values we set

*t*

_{1}=0 and

*t*

_{2}=1. We set

*α*

_{1}=0.01,

*α*

_{2}=0,

*λ*

_{1}=0, and

*λ*

_{2}=0.01. The basis vectors of both wild-type and mutant data are compared in Figure 4. From the wild-type time-series data, we can successfully identify two patterns. The rising pattern corresponds to the induced signature and the falling pattern corresponding to the repressed signature in [31]. It is reported in [31] that the MYC target genes contributes to both patterns. From the mutant time-series, we can obtain two flat processes, which are reasonable. The source code of this example can be found in exampleBioProcessMYC. We also recommend the readers to see the methods based on matrix decompositions which are proposed in [13, 32] and devised for identifying signaling pathways.

### Basis vector analysis for gene selection

The columns of A for a gene expression data set are called *metasamples* in [2]. They can be interpreted as biological processes, because their values imply the activation or inhibition of some the genes. Gene selection aims to find marker genes for disease prediction and to understand the pathways they contribute to. Rather than selecting genes on the original data, the novel idea is to conduct gene selection on the metasamples. The reason is that the discovered biological processes via NMF are biologically meaningful for class discrimination in disease prediction, and the genes expressed differentially across these processes contribute to better classification performance in terms of accuracy. In Figure 1 for example, three biological processes are discovered and only the selected genes are shown. We have implemented the information-entropy-based gene selection approach proposed in [3] in function featureFilterNMF. We give an example on how to call this function in file exampleFeatureSelection. It has been reported that it can select meaningful genes, which has been verified with gene ontology analysis. Feature selection based on supervised NMF will also be implemented.

### Feature extraction

*curse of dimensionality*. For example, it is impossible to estimate the parameters of some statistical models since the number of their parameters grow exponentially as the dimension increases. Another issue is that biological data are usually noisy; which crucially affects the performances of classifiers applied on the data. In cancer study, a common hypothesis is that only a few biological factors (such as the oncogenes) play a crucial role in the development of a given cancer. When we generate data from control and sick patients, the high-dimensional data will contain a large number of irrelevant or redundant information. Orthogonal factors obtained with

*principal component analysis*(PCA) or

*independent component analysis*(ICA) are not appropriate in most cases. Since NMF generates non-orthogonal (and non-negative) factors, therefore it is much reasonable to extract important and interesting features from such data using NMF. As mentioned above, training data X

_{m×n}, with

*m*features and

*n*samples, can be decomposed into

*k*metasamples A

_{m×k}and Y

_{k×n}, that is

_{tr}means that Y is obtained from the training data. The

*k*columns of A span the

*k*-dimensional

*feature space*and each column of Y

_{tr}is the representation of the corresponding original training sample in the feature space. In order to project the

*p*unknown samples S

_{m×p}into this feature space, we have to solve the following non-negative least squares problem:

where, Y_{uk} means the Y is obtained from the unknown samples. After obtaining Y_{tr} and Y_{uk}, the learning and prediction steps can be done quickly in the *k*-dimensional feature space instead of the *m*-dimensional original space. A classifier can learn over Y_{tr}, and then predicts the class labels of the representations of unknown samples, that is Y_{uk}.

From the aspect of interpretation, the advantage of NMF over PCA and ICA is that the metasamples are very useful in the understanding of the underlying biological processes, as mentioned above.

We have implemented a pair of functions featureExtractionTrain and featureExtractionTest including many linear and kernel NMF algorithms. The basis matrix (or, the inner product of basis matrices in the kernel case) is learned from the training data via the function featureExtractionTrain, and the unknown samples can be projected onto the feature space via the function featureExtractionTest. We give examples of how to use these functions in files exampleFeatureExtraction and exampleFeatureExtractionKernel.

*without*dimension reduction and SVM

*with*dimension reduction using linear NMF, kernel NMF with

*radial basis function*(RBF) kernel, and PCA on two data sets, SRBCT [33] and Breast [34]. Since ICA is computationally costly, we did not include it in the comparisons. The bars represent the averaged 4-fold cross-validation accuracies using

*support vector machine*(SVM) as classifier over 20 runs. We can see that NMF is comparable to PCA on SRBCT, and is slightly better than PCA on Breast data. Also, with only few factors, the performance after dimension reduction using NMF is at least comparable to that without using any dimension reduction. As future work, supervised NMF will be investigated and implemented in order to extract discriminative features.

### Classification

If we make the assumption that every unknown sample is a sparse non-negative linear combination of the training samples, then we can directly derive a classifier from NMF. Indeed, this is a specific case of NMF in which the training samples are the basis vectors. Since the optimization process within NMF is a NNLS problem, we call this classification approach the *NNLS classifier*[35]. A NNLS problem is essentially a quadratic programming problem as formulated in Equation (9), therefore, only inner products are needed for the optimization. We thus can naturally extend the NNLS classifier to kernel version. Two features of this approach are that: i) the sparsity regularization help avoid overfitting the model; and ii) the kernelization allows a dimension-free optimization and also linearizes the non-linear complex patterns within the data. The implementation of the NNLS classifier is in file nnlsClassifier. Our toolbox also provides many other classification approaches including SVM classifier. Please see file exampleClassification for demonstration. In our experiment of 4-fold cross-validation, accuracies of 0.7865 and 0.7804 are respectively obtained with linear and kernel (RBF) NNLS classifier on Breast data set. They achieved accuracies of 0.9762 and 0.9785, respectively, over SRBCT data.

Biological data are usually noisy and sometimes contain missing values. A strength of the NNLS classifier are that it is robust to noise and to missing values, making NNLS classifiers quite suitable for classifying biological data [35].

*1-nearest neighbor*(1-NN) classifiers using this noisy data. It can be seen that as the noise increases, NNLS outperforms SVM and 1-NN significantly.

_{ i }and x

_{ j }, we normalize them to have unit

*l*

_{2}-norm using only the features present in both samples, and then we take their inner product. As an example, we randomly removed between 10% to 70% of data values in STBCT data. Using such incomplete data, we compared our method with the zero-imputation method (that is, estimating all missing values as 0). In Figure 7, we can see that the NNLS classifier using our missing value approach outperforms the zero-imputation method in the case of large missing rate. Also, the more sophisticated

*k*-nearest neighbor imputation (KNNimpute) method [36] will fail on data with in high percentage of missing values.

### Statistical comparison

*crucial difference*(CD) diagram as implemented in function plotNemenyiTest. CD is determined by significant level

*α*. Figure 9 is an example of the result of the Nemenyi test for comparing 8 classifiers over 13 high dimensional biological data sets. This example can be found in file exampleFriedmanTest. If the distance of two methods is greater than the CD then we conclude that they differ significantly.

## Conclusions

In order to address the issues of the existing NMF implementations, we propose a NMF Toolbox written in MATLAB, which includes a basic NMF optimization level and an advanced data mining level. It enable users to analyze biological data via NMF-based data mining approaches, such as clustering, bi-clustering, feature extraction, feature selection, and classification.

The following are the future works planned in order to improve and augment the toolbox. First, we will include more NMF algorithms such as nsNMF, LS-NMF, and supervised NMF. Second, we are very interested in implementing and speeding up the Bayesian decomposition method which is actually a probabilistic NMF introduced independently in the same period as the standard NMF. Third, we will include more statistical comparison and evaluation methods. Furthermore, we will investigate the performance of NMF for denoising and for data compression.

## Availability and requirements

**Project name:** The NMF Toolbox in MATLAB**Project home page:**https://sites.google.com/site/nmftool and http://cs.uwindsor.ca/~li11112c/nmf**Operating system(s):** Platform independent**Programming language:** MATLAB**Other requirements:** MATLAB 7.11 or higher**License:** GNU GPL Version 3**Any restrictions to use by non-academics:** Licence needed

## Declarations

### Acknowledgements

This research has been partially supported by IEEE CIS Walter Karplus Summer Research Grant 2010, Ontario Graduate Scholarship 2011–2013, and Canadian NSERC Grants #RGPIN228117–2011.

## Authors’ Affiliations

## References

- Lee DD, Seung S: Learning the parts of objects by non-negative matrix factorization. Nature. 1999, 401: 788-791. 10.1038/44565.View ArticlePubMedGoogle Scholar
- Brunet J, Tamayo P, Golub T, Mesirov J: Metagenes and molecular pattern discovery using matrix factorization. PNAS. 2004, 101 (12): 4164-4169. 10.1073/pnas.0308531101.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim H, Park H: Sparse non-negatice matrix factorization via alternating non-negativity-constrained least aquares for microarray data analysis. SIAM J Matrix Anal Appl. 2007, 23 (12): 1495-1502.Google Scholar
- Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pascual-Montano A: Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics. 2006, 7: 78-10.1186/1471-2105-7-78.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang G, Kossenkov A, Ochs M: LS-NMF: A modified non-negative matrix factorization algorithm utilizing uncertainty estimates. BMC Bioinformatics. 2006, 7: 175-10.1186/1471-2105-7-175.PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Ngom A: A new kernel non-negative matrix factorization and its application in microarray data analysis. CIBCB. 2012, IEEE CIS Society Piscataway: IEEE Press, 371-378.Google Scholar
- Cichocki A, Zdunek R: NMFLAB - MATLAB toolbox for non-negative matrix factorization. 2006, Tech. rep., [http://www.bsp.brain.riken.jp/ICALAB/nmflab.html]Google Scholar
- The NMF: DTU toolbox. Tech. rep., Technical University of Denmark [http://cogsys.imm.dtu.dk/toolbox/nmf]
- Liu S: NMFN: non-negative matrix factorization. 2011, Tech. rep., Duke University,, [http://cran.r-project.org/web/packages/NMFN]Google Scholar
- Gaujoux R, Seoighe C: A flexible R package for nonnegative matrix factorization. BMC Bioinformatics. 2010, 11: 367-10.1186/1471-2105-11-367. [http://cran.r-project.org/web/packages/NMF]PubMed CentralView ArticlePubMedGoogle Scholar
- Qi Q, Zhao Y, Li M, Simon R: non-negative matrix factorization of gene expression profiles: A plug-in for BRB-ArrayTools. Bioinformatics. 2009, 25 (4): 545-547. 10.1093/bioinformatics/btp009.View ArticlePubMedGoogle Scholar
- Fertig E, Ding J, Favorov A, Parmigiani G, Ochs M: CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics. 2010, 26 (21): 2792-2793. 10.1093/bioinformatics/btq503.PubMed CentralView ArticlePubMedGoogle Scholar
- Ochs M, Fertig E: Matrix factorization for transcriptional regulatory network inference. CIBCB, IEEE CIS Society. Piscataway: IEEE Press;. 2012, 387-396.Google Scholar
- Lee D, Seung S: Algorithms for non-negative matrix factorization. NIPS. 2001, Cambridge: MIT Press, 556-562.Google Scholar
- Kim H, Park H: Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J Matrix Anal Appl. 2008, 30 (2): 713-730. 10.1137/07069239X.View ArticleGoogle Scholar
- Ding C, Li T, Jordan MI: Convex and semi-nonnegative matrix factorizations. TPAMI. 2010, 32: 45-55.View ArticleGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol). 1996, 58: 267-288.Google Scholar
- Zou H, Hastie T: Regularization and variable selection via the elastic Net. J R Stat Soc- Ser B: Stat Methodol. 2005, 67 (2): 301-320. 10.1111/j.1467-9868.2005.00503.x.View ArticleGoogle Scholar
- Zhang D, Zhou Z, Chen S: Non-negative matrix factorization on kernels. LNCS. 2006, 4099: 404-412.Google Scholar
- Ding C, Li T, Peng W, Park H: Orthogonal nonnegative matrix tri-factorizations for clustering. KDD. 2006, New York: ACM, 126-135.Google Scholar
- Zass R, Shashua A: Non-negative sparse PCA. NIPS. 2006, Cambridge: MIT PressGoogle Scholar
- Ho N: Nonnegative matrix factorization algorithms and applications. PhD thesis,. Louvain-la-Neuve: Belgium; 2008Google Scholar
- Madeira S, Oliveira A: Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans Comput Biol Bioinformatics. 2004, 1: 24-45. 10.1109/TCBB.2004.2.View ArticleGoogle Scholar
- Kim P, Tidor B: Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res. 2003, 13: 1706-1718. 10.1101/gr.903503.PubMed CentralView ArticlePubMedGoogle Scholar
- Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz S, Tainsky M: Onto-tools, the toolkit of the modern biologist: onto-express, onto-compare, onto-design and onto-translate. Nucleic Acids Res. 2003, 31 (13): 3775-3781. 10.1093/nar/gkg624.PubMed CentralView ArticlePubMedGoogle Scholar
- Mewes H, Frishman D, Gruber C, Geier B, Haase D, Kaps A, Lemcke K, Pfeiffer F, Schuller C, Stocker S, Mannhaupt G: MIPS: A database for genomes and protein sequences. Nucleic Acids Res. 2000, 28: 37-40. 10.1093/nar/28.1.37.PubMed CentralView ArticlePubMedGoogle Scholar
- Boyle E, Weng S, Gollub J, Jin H, Botstein D, Cherry J, Sherlock G: GO::TermFinder – open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics. 2004, 20: 3710-3715. 10.1093/bioinformatics/bth456.PubMed CentralView ArticlePubMedGoogle Scholar
- Sherman B, Huang D: Systematic and integrative analysis of large gene lists using DAVID Bioinformatics resources. Nature Protoc. 2009, 4: 44-57.Google Scholar
- Huang D, Sherman B, Lempicki R: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37: 1-13. 10.1093/nar/gkn923.PubMed CentralView ArticleGoogle Scholar
- Tu B, Kudlicki A, Rowicka M, McKnight S: Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes. Science. 2005, 310: 1152-1158. 10.1126/science.1120499.View ArticlePubMedGoogle Scholar
- Chandriani S, Frengen E, Cowling V, Pendergrass S, Perou C, Whitfield M, Cole M: A core MYC gene expression signature is prominient in basal-like breast cancer but only partially overlaps the core serum response. PloS ONE. 2009, 4 (5): e6693.PubMed CentralView ArticlePubMedGoogle Scholar
- Ochs M, Rink L, Tarn C, Mburu S, Taguchi T, Eisenberg B, Godwin A: Detection of treatment-induced changes in signaling pathways in sastrointestinal stromal tumors using transcripttomic data. Cancer Res. 2009, 69 (23): 9125-9132. 10.1158/0008-5472.CAN-09-1709.PubMed CentralView ArticlePubMedGoogle Scholar
- Khan J: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001, 7 (6): 673-679. 10.1038/89044.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu Z: The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics. 2006, 7: 96-10.1186/1471-2164-7-96.PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Ngom A: Classification approach based on non-negative least squares. Neurocomputing. 2013,, in pressGoogle Scholar
- Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R: Missing value estimation methods for DNA microarrays. Bioinformatics. 2001, 17 (6): 520-525. 10.1093/bioinformatics/17.6.520.View ArticlePubMedGoogle Scholar
- Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, Golub T, Mesirov J: Estimating dataset size requirements for classifying DNA microarray data. J Comput Biol. 2003, 10 (2): 119-142. 10.1089/106652703321825928.View ArticlePubMedGoogle Scholar
- Demsar J: Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006, 7: 1-30.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.