- Open Access
MmPalateMiRNA, an R package compendium illustrating analysis of miRNA microarray data
Source Code for Biology and Medicine volume 8, Article number: 1 (2013)
MicroRNAs (miRNAs) constitute the largest family of noncoding RNAs involved in gene silencing and represent critical regulators of cell and tissue differentiation. Microarray expression profiling of miRNAs is an effective means of acquiring genome-level information of miRNA activation and inhibition, as well as the potential regulatory role that these genes play within a biological system. As with mRNA expression profiling arrays, miRNA microarrays come in a variety of platforms from numerous manufacturers, and there are a multitude of techniques available for reducing and analyzing these data.
In this paper, we present an analysis of a typical two-color miRNA microarray experiment using publicly available packages from R and Bioconductor, the open-source software project for the analysis of genomic data. Covered topics include visualization, normalization, quality checking, differential expression, cluster analysis, miRNA target identification, and gene set enrichment analysis. Many of these tools carry-over from the analysis of mRNA microarrays, but with some notable differences that require special attention. The paper is presented as a “compendium” which, along with the accompanying R package MmPalateMiRNA, contains all of the experimental data and source code to reproduce the analyses contained in the paper.
The compendium presented in this paper will provide investigators with an access point for applying the methods available in R and Bioconductor for analysis of their own miRNA array data.
Much of the recent bioinformatics literature has focused on the role small RNA molecules, termed microRNAs (miRNAs), play in regulating gene expression within plant and animal systems . Mature miRNAs are typically 18-25 bases in length and have been found to execute key functions in silencing expression of specific target genes . MicroRNAs regulate expression of genes post-transcriptionally, by binding the target mRNA molecule and either directly inhibiting translation or destabilizing the target mRNA . MicroRNA microarray technology has been successfully exploited to generate microRNA gene expression profiles of the cell cycle , cell differentiation , cell death , embryonic development , stem cell differentiation , different types of cancers [9, 10], the diseased heart  and diseased neural tissue . Thus, microRNA gene expression profiling offers an effective means of acquiring novel and valuable information regarding the expression and regulation of genes, under the control of miRNAs, in a variety of biological systems.
The R software programming language  has gained wide popularity among the scientific research community, along with its extension to the realm of genomics applications via the Bioconductor [14, 15] software for bioinformatics project. The Bioconductor project contains a variety of R packages for application to high-throughput “omics” data, including array preprocessing and normalization, identification of differentially expressed genes, clustering, classification, gene-set enrichment analysis, and other down-stream analysis methods. Hence, the R packages available at Bioconductor can provide a complete suite of tools for analyzing array data from the initial preprocessing steps through the final determination of interesting genes and gene sets. Several publications have addressed how to perform and reproduce an analysis of mRNA expression array data using software from R and Bioconductor [16–18]. An integrated way to present the analysis from these experiments is in the form of a compendium [17, 18], which encapsulates the primary data, supporting software, statistical analysis, and document text in a manner that allows other investigators to completely reproduce the results of the experiment.
While many of the same tools for analyzing mRNA expression arrays can be applied to the analysis of miRNA data, there are distinct differences between the two platforms which necessitate special use of some methods (see overviews by Sarver  and Thomson et al.). In particular, miRNA arrays typically have far fewer genes that are spotted on the array compared to mRNA arrays and require careful consideration of the assumptions behind array pre-processing methods prior to their application. Several recent publications have compared various normalization methods for microRNA microarray data [21–23], while others have developed novel methods specifically for miRNA data [24–27]. Though certain methods were found to outperform others in each case, in general there is still no consensus on the best normalization method. Therefore, investigators are encouraged to perform their own assessments to determine an appropriate normalization method for their data. A second unique aspect of miRNA analysis relative to mRNA analysis is that differentially expressed miRNAs are subsequently evaluated for potential gene targets that are regulated by the miRNAs. A number of databases can be used for this purpopse, and many of these have been ported to R in the form of Bioconductor packages. It is these putative regulatory targets that are typically evaluated for biological and molecular functionality, e.g. by gene set enrichment analysis.
In this article, we illustrate how to analyze a two-color miRNA experiment using available packages from Bioconductor and the Comprehensive R Archive Network (CRAN). Example code is provided for the complete analysis including preprocessing of arrays, normalization, identification of differentially expressed miRNAs, clustering, miRNA target identification, and gene set enrichment analysis. The analysis presented here follows closely to what was presented by Mukhopadhyay et al.. Aspects of miRNA analysis which require special attention are highlighted, as are particular advantages of using specific R and Bioconductor packages. Although the analysis is specific to the Miltenyi Biotech miRXplore platform , the general steps outlined here can easily be extended to other platforms as well. To ensure reproducibility of the results, the entire analysis is presented as a compendium [17, 18], in the form of an accompanying R package called MmPalateMiRNA , which has been made freely available on Bioconductor. The package also includes several functions to produce diagnostic plots for evaluating probe intensity distributions on miRNA microarrays, as discussed in Sarkar et al.. The experimental data used in this manuscript are freely available as part of the compendium package (GEO DataSets , accession number GPL10179).
In the following subsections, we discuss the methodologies used for the analysis of the miRNA data in this compendium. We refer the reader to the original papers for detailed methods, here just providing an overview.
An important first step in the analysis of microarray data is to check the array quality by inspecting for outliers, spatial artifacts, and for differences in array intensity distributions which may require normalization. Several software packages exist for this purpose; in particular, the arrayQualityMetrics package  available from Bioconductor provides a comprehensive report for both one and two-color microarray data. However, the diagnostic plots in that package for two-color arrays are constructed from ratios of the two channels (M values), and for miRNA data plots focused solely on the control / reference channel may be more relevant. Specifically, Sarkar et al. introduced novel diagnostic plots for miRNA data for the purpose of evaluating and comparing different normalization methods, which serve as useful indicators for array quality and outlyling arrays. In addition to evaluating array quality, other important pre-processing steps include identifying outlying values for specific probes, performing non-specific filtering of probes, and imputing probes that are missing or are extreme outliers.
Several recent publications have drawn attention to the normalization of miRNA data as distinct from that of mRNA data. In particular, methods that assume some level of symmetry in differential expression, such as loess and quantile normalization, may be inappropriate when global changes associated with phenotypes are present . As such, normalization methods that use a set of invariant probes [23, 26], or use single-channel normalization methods  may outperform so-called “global” normalization methods. Recent comparisons of normalization methods for miRNA microarray data have resulted in differing conclusions [21–23], with top performing methods ranging from quantile normalization for single-channel array data  to print-tip loess for two-channel data . However, Sarkar et al. evaluated several different normalization methods, incluuding variance stabilizing normalization (VSN) , spike-in VSN, and print-tip loess, and found no statistically significant differences between them based on correlation with qRT-PCR measurements. As is typical with array data in general, investigators are encouraged to try several different normalization methods and evaluate the differences betweeen them on the basis of diagnostic plots .
A variety of methods exist to determine differential expression between two or more groups of expression data, including the classic t-test and the more recent ‘moderated’ variants. Members of the latter category include the Significant Analysis of Microarrays (SAM) , and empirical Bayes methods [35, 36]. In particular, the methodology developed by Smyth  extends these concepts to apply to general microarray experiments with arbitrary numbers of treatments and samples, in the context of a hierarchical linear model. A model is fitted to the expression values for each gene/transcript, and used to evaluate differential expression for contrasts (comparisons between treatment groups) of interest. A ‘shrinkage’ estimate of the variability is obtained by a weighted average of the a pooled estimate of variation and the per-gene estimate of variation. This lessens the occurrence of large t-statistics due to exceptionally small variance estimates, and effectively introduces a “fold-change” criterion into the statistic. The methods are available in the Bioconductor package limma .
Clustering of array profiles is helpful for determining underlying structure in the changes of gene expression, especially for time course data. Common methods include hierarichical clustering, divisive hierarchical clustering (DIANA), K-means, self-organizing maps (SOM), the self-organizing tree algorithm (SOTA), partitioning around medoids (PAM), and model-based clustering [38–40]. With the diversity of methods available for the investigator to try, a commonly encountered difficulty is determining which clustering algorithm to use for a particular data set. This problem can be partially overcome using clustering validation measures, as found in the clValid package . The clValid package allows the user to select from among ten different clustering algorithms and uses three different sets of validation measures (internal, stability, and biological) to evaluate the performance of each algorithm for a range of cluster numbers.
Identification of miRNA target genes
After a subset of miRNAs of interest has been determined, e.g by differential expression or clustering, the next step is to determine the potential regulatory targets of the miRNA molecules. Algorithms for predicting miRNA target molecules are fundamentally based on sequence complimentarity (between the mature miRNA transcript and the 3’-untranslated regions of potential target mRNAs), species conservation, thermodynamic stability, and site accessibility (see Alex et al. for an overview). The Bioconductor package RmiR.Hs.miRNA  contains six databases for human miRNA targets, while the database of targets in miRBase  is available through the Bioconductor packages mirbase.db  and microRNA . The TargetScan database of miRNA targets  is also available in targetscan.Hs.eg.db  for humans and targetscan.Mm.eg.db  for mouse.
Gene set analysis
Once putative regulatory targets of the differentially expressed miRNAs have been identified, a logical next step is to identify what biological or functional pathways the targets have in common with each other. This can be accomplished by gene set analysis, or gene set enrichment analysis . The regulatory targets are compared with predefined gene sets such as GO classifications , KEGG pathways , chromosome bands, and protein complexes. Gene set analysis is based on the hypergeometric test and identifies which biological gene sets have an under- and over-representation of the identified miRNA targets. Bioconductor packages which provide gene set analysis include GOstats  and Category .
Results and discussion
R packages that are needed for running the example code in this manuscript are MmPalateMiRNA  and its dependencies, and the additional packages latticeExtra , clValid , targetscan.Mm.eg.db , microRNA , org.Mm.eg.db , and GOstats . The full list of dependencies is given in the Availability and requirements. To begin, we download and install all of the needed packages for running the code in this compendium. In the following, text after the R> prompt denotes an R command, and a “+” denotes a continuation in code. The R code from this compendium is available as Additional file 1 (“MmPalateMiRNA_SCBM.R”).
Next, we load the MmPalateMiRNA package, which additionally loads the required packages Biobase , limma , vsn , statmod , lattice , and xtable . Further, we load the remaining needed packages for running the code in this compendium.
The microRNA microarray data in this compendium were obtained as previously described in Mukhopadhyay et al., and the data are publicly available from GEO  (accession number GPL10179). Briefly, mouse embryonic tissue was obtained on gestational days (GD) 12, 13, and 14, which represents the critical period of palate development in the mouse. Total RNA (containing miRNAs) was isolated using standard RNA extraction protocols. RNA samples (1 μ g) isolated from mouse embryonic orofacial tissues (GD-12 - GD-14) as well as the miRXplore Universal Reference (UR, control channel) were fluorescently labeled with Hy5 (red) or Hy3 (green), respectively, and hybridized to Miltenyi Biotec miRXplore Microarrays using the a-Hyb Hybridization Station . For each gestational day, three distinct pools of RNA were independently processed and applied to microarray chips. Probes for a total of 1336 mature miRNAs (from human, mouse, rat and virus), including positive control and calibration probes, were spotted in quadruplicate on each microarray. Each array included probes for 588 murine miRNAs. The miRXplore Universal Reference (UR) controls, provided by Miltenyi, represent a defined pool of synthetic miRNAs for comparison of multiple samples. Fluorescence signals of the hybridized miRXplore Microarrays were detected using a laser scanner from Agilent Technologies. Mean and median signal and local background intensities for the Hy3 and Hy5 channels were obtained for each probe on each of the nine microarray images using the ImaGene software . The experimental data is included in the MmPalateMiRNA package in a compiled format, as an RGList object (a class in package limma  for two-color microarray data) called PalateData. The data is loaded into the R session using the code below. To see how PalateData was created from the source data files, see Additional file 2 (“ReadingTwoColorData.pdf”) and the corresponding R code in Additional file 3 (“ReadingTwoColorData.R”). For more information on the data in PalateData, use ?PalateData or see Additional file 2.
Sarkar et al. described several diagnostic plots for miRNA data that can be used to evaluate the need and effectiveness of normalization procedures. These plots can also serve as aids to determine outlying arrays and batch effects. One such plot is the kernel density estimate for each array, for different types of probes. Figure 1 plots the density estimates of the log2 intensity values in the control channel for the unnormalized data, separated into “MMU miRNAs” (MMU = Mus musculus, i.e. mouse), “Other miRNAs”, and “Control” probes (other probes were non-informative). The plot requires use of the lattice package, and the MmPalateMiRNA package contains methods to produce plots for RGList objects based on the generic functions in lattice. The code below illustrates the use of the function densityplot to produce Figure 1. To access the documentation file for this function, use ?densityplot (in general, the documentation file for function fun is accessed through ?fun, and the documentation file for S4 class obj is accessed through class?obj).
Figure 1 indicates three possible outlying arrays, GD 12-1, 13-2, and 14-3. A second figure (Figure 2) can be constructed based on the pairwise “distance” between arrays, as measured by the median of the absolute differences in log2 intensity values for miRNAs in the green channel . The plot is created using the levelplot method for RGList objects, which is included in the package. Here we separate the plots according to the type of probe, and the arrays are reordered so that the outlying arrays are grouped together. The three arrays are clearly outliers based on the control probes, but to a lesser extent based on the other types of probes.
Figures 1 and 2 demonstrate the potential need for normalization or removal of several of the arrays. In the Normalization subsection, we will evaluate the effectiveness of several normalization methods in correcting these systematic differences between the arrays.
In addition to checking for outlying arrays, it is important to check for outlying values on individual probes. To accomplish this, we evaluated for each probe whether there were any extreme values (greater than 2.665 standard deviations above the mean). The checkOutliers function checks this for each of the red and green channels in an RGList object and returns the indices of array probes with extreme values.
The probes with outlying arrays can be visualized using boxplots with the code below.
The figure is omitted but clearly shows that the identified outlying values are nearly two orders of magnitude above the rest of the intensity values. Rather than omitting these values, we exploit the replicated design of the arrays and substitute the mean of the other replicates on the array for the extreme values using the fixOutliers function.
In addition to checking for outlying values, we also check for any missing values in the two channels using the checkMVs function. Here, we only find two probes on the array with missing values in the background channels, so we again impute these values using the means of the backgrounds from the other three replicates on the chip using the fixMVs function.
Prior to running the normalization methods, we filter the probes and keep only those which correspond to miRNAs and calibration probes. Additionally, probes that are not sufficiently above the background intensity level may be unreliable and represent noise that can interfere with subsequent analysis, including normalization . Prefiltering also reduces the number of statistical comparisons being performed and improves overall power . Here, we filter probes whose foreground intensity values are below 1.1 times their background intensity level. To allow for probes which may be expressed for a particular experimental condition (here, gestational day), we keep all probes which have at least 3 samples above the filtering threshold. Lastly, only those genes with all four replicates passing the filtering step are retained. After all pre-processing steps, a total of 956 probes, corresponding to 175 mouse miRNAs, 42 other miRNAs, and 22 calibration probes each replicated 4 times, remain.
Based on the literature [21–23, 26], we evaluated several normalization procedures on the filtered data, including none, median, loess, quantile, VSN, and spike-in VSN. The limma package  includes various options for both within (normalizeWithinArrays) and between (normalizeBetweenArrays) array normalization, and the vsn package  has functions for performing VSN and spike-in VSN. In all cases, a simple background correction was performed by subtracting background from the foreground intensities.
Several diagnostic plots can be used to contrast the effectiveness of each normalization procedure. The MmPalateMiRNA package contains several methods to produce these plots for lists of class MAList or NChannelSet objects, based on functions in the lattice package. Figure 3, rows one through five, plots the intensity distribution for the reference channels after each of the normalization procedures (use of the useOuterStrips function requires the latticeExtra package). Note that the order of panels in lattice plots is from the bottom left panel to the right and up, according to the rules used for graphs. The quantile normalization procedure is clearly the most successful in removing the intensity bias that was apparent for three of the arrays (12-1, 13-2, and 14-3), while loess and median normalization appear to be the least successful. Notably, normalization based on the spike-in probes was unsuccessful, perhaps since these probes were shifted differently in the reference channel relative to the other probe types.
An additional plot based on the median absolute difference between probes in the reference channel can be used to compare relative success of the normalization procedures in removing the array effect (Figure 4). Here again, quantile normalization appears to be the best, while loess and median normalization are the least effective.
To investigate the effect of the normalization procedure on the experimental channel, plots of the spread (median absolute deviation) versus the location (median) of all probes can be used. Plots of this type can be produced using the MADvsMedianPlot function in the MmPalateMiRNA package. Probes of different types are highlighted, with particular focus on the spike-in probes, which should have low variability across all the arrays. In Figure 5, spike-in VSN has the lowest variability among the spike-in probes, compared to the other normalization methods. However, spike-in VSN has also dramatically decreased the variation among all the probes in the experimental channel, making the normalization procedure questionable in this case. Quantile normalization has resulted in large variations for some of the probes with lower intensity values.
Plots of the log2 intensity ratios (M values) versus the mean log2 intensity values (A values) for each probe can be used to evaluate whether there is a bias associated with overall intenstity level for each array. This so-called “MA” plot is illustrated in Figure 6 for quantile normalization. MA plots for the other normalization methods are not shown, though code to produce the plots is available in the accompanying R script “MmPalateMiRNA_SCBM.R”. Quantile normalization has removed any association between the M and A values, while for VSN normalization there is still a trend which is similar to the unnormalized data. The MA plot for spike-in VSN shows a dramatic effect on the intensity ratios.
As a final evaluation, we inspected heatmaps along with hierarchical clustering of the arrays. Figure 7 displays the heatmap after quantile normalization and reveals that the previously identified outlying arrays (samples 12-1, 13-2, and 14-3) still do not cluster with the other replicates for that day.
Table 1 gives the correlations between each pair of arrays, based on the log2 intensity ratios. Since the other two replicates for each day were highly correlated (r≥0.95), we decided to use only those two replicates from each day for subsequent statistical analysis. Normalization was redone omitting the arrays 12-1, 13-2, and 14-3, using quantile normalization.
Sixteen probes from the six arrays exhibited negative intensities after the background procedure and resulted in missing values for subsequent calculation of the log2 intensity ratios. A significant percentage of missing values can have a substantial impact on downstream analysis of array data , and in such cases choice of a imputation procedure should be carefully considered. Here, with a relatlively small percentage of missing values, the impact on data analysis will be relatively minimal. Hence we use the K-nearest neighbor imputation scheme  as a fast and effective approach, implemented in the imputeKNN function included in package MmPalateMiRNA.
Determining differentially expressed miRNAs
To test for differential expression of miRNAs between different gestational days (GD-12, 13, and 14), the limma package [36, 37] was used. Use of the limma package requires the user to create a design matrix, which defines the possible levels for each experimental factor, and is used to construct a model matrix and contrasts to test for differential expression between factor levels. The model matrix consists of indicator variables for the levels of each experimental factor in our design, which in our case corresponds to each of the gestational days.
Estimates of gene expression are based on the log2 Red/Green intensity ratios, hereafter referred to as ‘expression values’. Contrasts defined here estimate the differences in mean expression between each gestational day. The makeContrasts function in limma will generate these for you.
Some advantages of using limma over other methods include the ability to incorporate probe quality weights and to handle duplicate probes for each miRNA on the chip via the duplicateCorrelation function . These advantages are particularly evident in small sample sizes, as in this experiment. To make use of the duplicated probes, we first order the normalized data so that replicated probes are adjacent to each other. The probe quality weights are incorporated in the calculation of the correlation matrix for the duplicated probes.
Next, the lmFit function is used to fit the hierarchical linear model, and the contrasts.fit function used to get contrast estimates. The eBayes function generates the moderated (empirical Bayesian) t-statistics corresponding to each of the contrast estimates.
The topTable function calculates and reports fold change, moderated t-statistics, unadjusted and adjusted p-values for the comparison of interest. P-values are adjusted by the method of Benjamini & Hochberg , which controls the expected false discovery rate. Code below shows the calculation for the comparison between gestational days 13 and 12, and the results are given in Table 2. Results for comparisons between the other gestational days are omitted but code to calculate them is included in the R script “MmPalateMiRNA_SCBM.R”.
A nice summary of the results for the comparisons between gestational days is a Venn diagram, which gives the number of up- and down-regulated genes for each comparison, along with the number in the intersection of these sets (Figure 8).
Although we have focused on the calculation of test statistics corresponding to pairwise comparisons between gestational days, it is easy to obtain estimates for other contrasts of interests between the experimental conditions. For example, the the contr.poly function will provide contrasts to test for linear and quadratic trends, and the contr.helmert function gives the Helmert contrasts. To illustrate, we calculate analysis of variance (ANOVA) F-statistics for testing for differential expression between all three gestational days by combining two orthogonal contrasts, here using the contr.helmert function.
Next, the miRNAs with significant F-statistics (adjusted p<0.05) are identified for follow up examination, e.g. by clustering. The duplicates are averaged prior to further analysis.
Clustering expression profiles
After identifying the differentially expressed miRNAs, clustering analysis can be performed to group genes with similar trends over time. A common difficultly is deciding which clustering algorithm to use and how many clusters to create. Cluster validation measures, as contained in the R package clValid , can help in this regard. Below, the clValid function is used to evaluate hierarchical clustering, SOTA, DIANA, and K-means clustering algorithms, for a range of one to six clusters in each case. The expression values for each day are averaged over the two replicates prior to clustering (object aveExpr). The internal validation measures (connectivity, Dunn Index, and Silhouette Width) are used with a correlation metric. A summary of the result indicates that hierarchical clustering with six clusters provides the optimal connectivity and Dunn Index measures, while DIANA with six clusters gives the optimal Silhouette Width.
The results from hierarchical clustering with six clusters was subsequently selected for visually displaying the data, using the clustPlot function available in package MmPalateMiRNA. The expression values for each miRNA are scaled to mean zero and standard deviation one for ease of visualization. The display is given in Figure 9. The two predominant clusters are cluster one and cluster two, which correspond to those miRNAs which exhibit a linear upward and downward trend over the time course, respectively.
Determining miRNA target genes
To follow-up the results from the differentially expression and clustering analysis, the next step is to determine putative regulatory targets of the differentially expressed miRNAs. To illustrate, we identify the putative targets of the miRNAs contained in the first cluster in Figure 9. The miRNAs in the first cluster are evaluated for putative targets using the databases TargetScan  (package targetscan.Mm.eg.db ) and miRBase  (package microRNA ). The mouse specific miRNA names are first extracted and then converted to the standard nomenclature using the function miRNames, which is included in the accompanying R script.
Targetscan targets are obtained using the code below. The objects in the targetscan.Mm.eg.db package are Bimap objects, which are mappings from one set of keys (the left keys or Lkeys) to another (the right keys or Rkeys). We start by mapping the miRBase identifiers to their miRNA family names, then map the miRNA families to Entrez Gene identifiers of the targets in the TargetScan database. Several of the miRNAs of interest required slight modifications to their names prior to their mapping. The TargetScan database identifies 4,640 unique Entrez Gene identifiers as putative targets.
Mouse miRNA targets in the miRBase database are in the data frame mmTargets within the microRNA package and can be obtained using the code below. The targets are stored as Ensembl transcript identifiers. A total of 13,126 Ensembl transcripts are identified as putative targets.
Lastly, we take the intersection of the targets from TargetScan and miRBase as our set of putative targets. Ensembl transcript identifiers are firstly converted to Entrez Gene identifiers using the org.Mm.eg.db  Bioconductor package. The final list contains 2,080 Entrez Gene identifiers.
Gene set analysis
As a final step in our analysis, we take the putative miRNA targets from the intersection of the TargetScan and miRBase databases and perform gene set enrichment analysis on them, using the hypergeometric test from the GOstats package . Terms in the GO hierarchy are analyzed for over-representation of genes from our miRNA target list, relative to the total number from the mouse genome having that annotation. A GOHyperGParams object is created which contains the list of targets (selectedEntrezIds), the gene “universe” (entrezUniverse), the annotation database to use, the GO ontology, and direction and significance level of the test.
After the GOHyperGParams object has been created, the test can be conducted using the hyperGTest function. An html file summarizing the results can be created using the htmlReport function, which is available as Additional file 4 (“hgResult.pdf”). Particular categories of interest include GO:0060021 (palate development), GO:0048008 (platelet-derived growth factor receptor signaling pathway), GO:0060429 (epithelium development), GO:0030855 (epithelial cell differentiation), GO:0016331 (morphogenesis of embryonic epithelium), GO:0016055 (Wnt receptor signaling pathway), GO:0060828 (regulation of canonical Wnt receptor signaling pathway), GO:0008277 (regulation of G-protein coupled receptor protein signaling pathway), and GO:0007179 (transforming growth factor beta receptor signaling pathway).
As a final step, we evaluate the mature miRNA sequences and seed regions of the miRNAs which target the genes in a particular GO category. To illustrate, the GO category 0007179, transforming growth factor beta receptor signaling pathway, is used. Entrez Gene IDs belonging to this category are identified and intersected with the selected Entrez Gene IDs corresponding to cluster one of Figure 9. This results in 21 identified Entrez Gene IDs.
Next, these Entrez Gene IDs are reverse mapped back to the set of miRNAs which putatively target these genes. This produces a total of 19 identified miRNAs.
Lastly, the mature sequences and seed regions of these miRNAs are determined, using the mmSeqs database and seedRegions function in package microRNA. These sequences can be evaluated for any commonalities, to be used in determining potential targets for follow-up luciferase assays and other functional experiments . In this case, the sequences are rather heterogeneous, although the seed region “GAGGUA” does show up in four of the nineteen identified miRNAs.
It is important to note that some of the presented results may depend on the versions of the software packages that were used to produce them. The following gives the complete information of the R session under which the presented results were obtained.
In this paper, we present a complete analysis of miRNA data using R and Bioconductor, including preprocessing, normalization, differential expression, clustering, identification of target genes, and gene set enrichment analysis of putative miRNA gene targets. Though there are several papers in the literature which give an overview of the analysis of miRNA data, the MmPalateMiRNA package is unique in presenting a comprehensive analysis of miRNA data which is completely reproducible. Further, while the number of packages for analyzing miRNA array data in Bioconductor is continuing to expand (see, e.g., packages LVSmiRNA , miRNApath , RmiR , and ExiMiR ), the distinguishing characteristic of this package is that it integrates many of these recent advances into one central document. Thus, this article can serve as a template for other investigators to conduct their own analysis. Important aspects of selecting a normalization algorithm for miRNA data are illustrated, along with code for producing useful diagnostic plots to select an appropriate procedure . These functions are not readily accessible to users other than through the MmPalateMiRNA package. Advantages of using the limma package to fit advanced hierarchical models for testing differential expression are documented, along with code for testing comparisons between experimental groups of interest. Lastly, we illustrate the use of miRNA target databases which have been recently ported to Bioconductor for identifying putative gene targets of selected miRNAs, as well as how to test for enrichment in biological and functional categories among the putative miRNA targets. While the analysis we present here is fairly comprehensive, it is straightforward to use other software, such as Ingenuity Pathway Analysis  to build from the results presented in this article (see  as an example). The complete analysis in this article is freely available as a compendium in the form of an R package (MmPalateMiRNA, downloadable from Bioconductor ), along with accompanying documentation, code, and functions to perform all of the analysis.
Availability and requirements
Project name: MmPalateMiRNA: An R package compendium for murine palate miRNA expression analysisProject home page: http://www.bioconductor.org/packages/release/bioc/html/MmPalateMiRNA.htmlOperating system(s): Platform independentProgramming language: ROther requirements: R version 2.13.1 or higher , R packages lattice, latticeExtra, xtable, cluster, RSQLite, DBI, class, statmod, RColorBrewer, and clValid (available from CRAN ), and Bioconductor packages Biobase, limma, vsn, GOstats, Category, org.Mm.eg.db, microRNA, targetscan.Mm.eg.db, graph, AnnotationDbi, and multtest (available from Bioconductor )License: GNU GPL-3
Comprehensive R Archive Network
Gene Expression Omnibus
Kyoto Encyclopedia of Genes and Genomes
Mus Musculus (Mouse)
Partitioning Around Medoids
Significant Analysis of Microarrays
Self-Organizing Tree Algorithm
Zhang B, Wang Q, Pan X: MicroRNAs and their regulatory roles in animals and plants. J Cell Physiol. 2007, 210 (2): 279-289. 10.1002/jcp.20869.
Wang Y, Stricker HM, Gou D, Liu L: MicroRNA: past and present. Front Biosci. 2007, 12: 2316-2329. 10.2741/2234.
Bartel DP: MicroRNAs: target recognition and regulatory functions. Cell. 2009, 136 (2): 215-233. 10.1016/j.cell.2009.01.002.
Corney DC, Flesken-Nikitin A, Godwin AK, Wang W, Nikitin AY: MicroRNA-34b and microRNA-34c are targets of P53 and cooperate in control of cell proliferation and adhesion-independent growth. Cancer Res. 2007, 67 (18): 8433-8438. 10.1158/0008-5472.CAN-07-1585.
Zhan M, Miller CP, Papayannopoulou T, Stamatoyannopoulos G, Song CZ: MicroRNA expression dynamics during murine and human erythroid differentiation. Exp Hematol. 2007, 35 (7): 1015-1025. 10.1016/j.exphem.2007.03.014.
Kren BT, Wong PY, Sarver A, Zhang X, Zeng Y, Steer CJ: MicroRNAs identified in highly purified liver-derived Mitochondria may play a role in Apoptosis. RNA Biol. 2009, 6: 65-72. 10.4161/rna.6.1.7534.
Hicks JA, Tembhurne P, Liu HC: MicroRNA expression in chicken embryos. Poult Sci. 2008, 87 (11): 2335-2343. 10.3382/ps.2008-00114.
Lakshmipathy U, Love B, Goff LA, Jornsten R, Graichen R, Hart RP, Chesnut JD: MicroRNA expression pattern of undifferentiated and differentiated human embryonic stem cells. Stem Cells Dev. 2007, 16 (6): 1003-1016. 10.1089/scd.2007.0026.
Gottardo F, Liu CG, Ferracin M, Calin GA, Fassan M, Bassi P, Sevignani C, Byrne D, Negrini M, Pagano F, Gomella LG, Croce CM, Baffa R: Micro-RNA profiling in kidney and bladder cancers. Urol Oncol. 2007, 25 (5): 387-392. 10.1016/j.urolonc.2007.01.019.
Wang V, Wu W: MicroRNA-Based therapeutics for cancer. BioDrugs. 2009, 23: 15-23. 10.2165/00063030-200923010-00002.
Tatsuguchi M, Seok HY, Callis TE, Thomson JM, Chen JF, Newman M, Rojas M, Hammond SM, Wang DZ: Expression of microRNAs is dynamically regulated during Cardiomyocyte Hypertrophy. J Mol Cell Cardiol. 2007, 42 (6): 1137-1141. 10.1016/j.yjmcc.2007.04.004.
Ferretti E, De Smaele, Po A, Di Marcotullio, Tosi E, Espinola MS, Di Rocco, Riccardi R, Giangaspero F, Farcomeni A, Nofroni I, Laneve P, Gioia U, Caffarelli E, Bozzoni I, Screpanti I, Gulino A: MicroRNA profiling in human medulloblastoma. Int J Cancer. 2009, 124 (3): 568-577. 10.1002/ijc.23948.
R Development Core Team: R: A Language and Environment for Statistical Computing. 2011, Vienna: R Foundation for Statistical Computing, [http://www.R-project.org]
Gentleman RC, Carey VJ, Bates DM: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80-10.1186/gb-2004-5-10-r80. [http://www.bioconductor.org]
Bioconductor - Home. [http://www.bioconductor.org/]
Gregory Alvord, Roayaei JA, Quinones OA, Schneider KT: A microarray analysis for differential gene expression in the soybean genome using bioconductor and R. Brief Bioinform. 2007, 8 (6): 415-431. 10.1093/bib/bbm043.
Gentleman R: Reproducible research: a bioinformatics case study. Stat Appl Genet Mol Biol. 2005, 4: Article2.
Ruschhaupt M, Huber W, Poustka A, Mansmann U: A compendium to ensure computational reproducibility in high-dimensional classification tasks. Stat Appl Genet Mol Biol. 2004, 3: Article37.
Sarver AL: Toward understanding the informatics and statistical aspects of Micro-RNA profiling. J Cardiovasc Transl Res. 2010, 3 (3): 204-211. 10.1007/s12265-010-9180-z.
Thomson JM, Parker JS, Hammond SM: Microarray analysis of miRNA gene expression. Methods Enzymol. 2007, 427: 107-122.
Hua YJ, Tu K, Tang ZY, Li YX, Xiao HS: Comparison of normalization methods with MicroRNA microarray. Genomics. 2008, 92 (2): 122-128. 10.1016/j.ygeno.2008.04.002.
Rao Y, Lee Y, Jarjoura D, Ruppert AS, Liu CG, Hsu JC, Hagan JP: A comparison of normalization techniques for microRNA microarray data. Stat Appl Genet Mol Biol. 2008, 7: Article22
Pradervand S, Weber J, Thomas J, Bueno M, Wirapati P, Lefort K, Dotto GP, Harshman K: Impact of normalization on miRNA microarray expression profiling. RNA. 2009, 15 (3): 493-501. 10.1261/rna.1295509.
Risso D, Massa MS, Chiogna M, Romualdi C: A modified LOESS normalization applied to microRNA arrays: a comparative evaluation. Bioinformatics. 2009, 25 (20): 2685-2691. 10.1093/bioinformatics/btp443.
Wang B, Wang XF, Howell P, Qian X, Huang K, Riker AI, Ju J, Xi Y: A personalized microRNA microarray normalization method using a logistic regression model. Bioinformatics. 2010, 26 (2): 228-234. 10.1093/bioinformatics/btp655.
Sarkar D, Parkin R, Wyman S, Bendoraite A, Sather C, Delrow J, Godwin AK, Drescher C, Huber W, Gentleman R, Tewari M: Quality assessment and data analysis for microRNA expression arrays. Nucleic Acids Res. 2009, 37 (2): e17.
Bargaje R, Hariharan M, Scaria V, Pillai B: Consensus MiRNA expression profiles derived from interplatform normalization of microarray data. RNA. 2010, 16: 16-25. 10.1261/rna.1688110.
Mukhopadhyay P, Brock G, Pihur V, Webb C, Pisano MM, Greene RM: Developmental microRNA expression profiling of murine embryonic orofacial tissue. Birth Defects Res Part A. 2010, 88 (7): 511-534. 10.1002/bdra.20684.
Miltenyi Biotec GmbH: Miltenyi Biotec: Products & Services for Biomedical Research. 2011, Bergisch Gladbach, [http://www.miltenyibiotec.com]
Brock G, Mukhopadhyay P, Pihur V, Greene RM, Pisano MM: MmPalateMiRNA: Murine Palate miRNA Expression Analysis. 2012, [http://www.bioconductor.org/packages/release/bioc/html/MmPalateMiRNA.html]. [R package version 1.8.0].
GEO DataSets home. [http://www.ncbi.nlm.nih.gov/gds/]
Kauffmann A, Gentleman R, Huber W: arrayQualityMetrics–a bioconductor package for quality assessment of microarray data. Bioinformatics. 2009, 25 (3): 415-6. 10.1093/bioinformatics/btn647.
Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002, 18 (Suppl 1): S96—S104.
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98 (9): 5116-5121. 10.1073/pnas.091062498.
Efron B, Tibshirani R, Storey JD, V T: Empirical bayes analysis of a microarray experiment. J Am Stat Assoc. 2001, 96: 1151-1160. 10.1198/016214501753382129.
Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: Article 3.
Smyth GK: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. 2005, New York: Springer-Verlag, 397-420.
Kaufman L, Rousseeuw PJ: Finding Groups in Data. An Introduction to Cluster Analysis. 1990, New York: John Wiley & Sons
Fraley C, Raftery AE: Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2001, 17: 126-136.
Dopazo J, Carazo JM: Phylogenetic reconstruction using a growing neural network that adopts the topology of a phylogenetic tree. J Mol Evol. 1997, 44: 226-233. 10.1007/PL00006139.
Brock G, Pihur V, Datta S, Datta S: ,clValid, an R package for cluster validation. J Stat Softw. 2008, 25 (4):
Alexiou P, Maragkakis M, Papadopoulos GL, Reczko M, Hatzigeorgiou AG: Lost in translation: an assessment and perspective for computational microRNA target identification. Bioinformatics. 2009, 25 (23): 3049-3055. 10.1093/bioinformatics/btp565.
Faverom F: RmiR.Hs.miRNA: Various Databases of microRNA Targets. 2012, [http://www.bioconductor.org/packages/release/data/annotation/html/RmiR.Hs.miRNA.html]. [R package version 1.0.6]
Griffiths-Jones S, Saini HK, van Dongen, Enright AJ: miRBase: Tools for microRNA genomics. Nucleic Acids Res. 2008, 36 (Database issue): D154-D1558. [http://www.ebi.ac.uk/enright-srv/microcosm/htdocs/targets/v5/]
Reid J: miRBase: The microRNA Database. 2012, [http://www.bioconductor.org/packages/release/data/annotation/html/mirbase.db.html]. [R package version 1.1.0]
Gentleman R, Falcon S: , microRNA: Data and Functions for Dealing with microRNAs. 2012, [http://www.bioconductor.org/packages/release/bioc/html/microRNA.html]. [R package version 1.16.0]
Lewis BP, Burge CB, Bartel DP: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005, 120: 15-20. 10.1016/j.cell.2004.12.035. [http://www.targetscan.org/]
Csardi G: targetscan.Hs.eg.db: TargetScan miRNA Target Predictions for Human. 2012, [http://www.bioconductor.org/packages/release/data/annotation/html/targetscan.Hs.eg.db.html]. [R package version 0.6.0]
Csardi G: , targetscan.Mm.eg.db: TargetScan miRNA Target Predictions for Mouse. 2012, [http://www.bioconductor.org/packages/release/data/annotation/html/targetscan.Mm.eg.db.html]. [R package version 0.6.0]
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102 (43): 15545-15550. 10.1073/pnas.0506580102.
The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nature Genet. 2000, 25: 25-29. 10.1038/75556. [http://www.geneontology.org/]
Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27. [http://www.genome.jp/kegg/.]
Falcon S, Gentleman R: Using GOstats to test gene lists for GO term association. Bioinformatics. 2007, 23 (2): 257-258. 10.1093/bioinformatics/btl567.
Gentleman R, with contributions from Falcon S, Sarkar D: Category: Category Analysis. 2012, [http://www.bioconductor.org/packages/release/bioc/html/Category.html]. [R package version 2.24.0]
Sarkar D, Andrews F: latticeExtra: Extra Graphical Utilities Based on Lattice. 2012, [http://CRAN.R-project.org/package=latticeExtra]. [R package version 0.6-24]
Carlson M, Falcon S, Pages H, Li N: org.Mm.eg.db: Genome Wide Annotation for Mouse. 2012, http://www.bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html. [R package version 2.8.0].
Smyth G, with contributions from Hu Y, Dunn P, Phipson B: statmod: Statistical Modeling. 2012, [http://CRAN.R-project.org/package=statmod]. [R package version 1.4.16].
Sarkar D: Lattice: Multivariate Data Visualization with R. 2008, New York: Springer, [http://lmdvr.r-forge.r-project.org]. [ISBN 978-0-387-75968-5].
Dahl DB: , xtable: Export tables to LaTeX or HTML. 2012, [http://CRAN.R-project.org/package=xtable]. [R package version 1.7-0]
BioDiscovery Inc: ImaGene: Leading-Edge Microarray Analysis Software. 2011, El Segundo, CA, [http://www.biodiscovery.com/software/imagene/]. [Version 9.0].
Gentleman R, Carey V, Hahne F, Huber W: Genefilter: Methods for Filtering Genes from Microarray Experiments. 2012, [http://www.bioconductor.org/packages/release/bioc/html/genefilter.html]. [R package version 1.40.0].
Smyth GK, Michaud J, Scott HS: Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005, 21 (9): 2067-2075. 10.1093/bioinformatics/bti270.
Oh S, Kang DD, Brock GN, Tseng GC: Biological impact of missing-value imputation on downstream analyses of gene expression profiles. Bioinformatics. 2010, 27: 78-86.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics. 2001, 17 (6): 520-205. 10.1093/bioinformatics/17.6.520.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995, 57: 289-300.
Gentleman R, Falcon S: GOstats: Tools for Manipulating GO and Microarrays. 2012, [http://www.bioconductor.org/packages/release/bioc/html/GOstats.html]. [R package version 2.24.0].
Zhang Z, Florez S, Gutierrez-Hartmann A, Martin JF, Amendt BA: MicroRNAs regulate pituitary development, and MicroRNA 26b specifically targets lymphoid enhancer factor 1 (Lef-1), which modulates pituitary transcription factor 1 (Pit-1) expression. J Biol Chem. 2010, 285 (45): 34718-28. 10.1074/jbc.M110.126441.
Calza S, Chen S, Pawitam Y: LVSmiRNA: LVS Normalization for Agilent MiRNA Data. 2012, [http://www.bioconductor.org/packages/release/bioc/html/LVSmiRNA.html]. [R package version 1.8.0].
Cogswell JP, Ward JM, Taylor IA, Waters M, Shi Y, Cannon B, Kelnar K, Kemppainen J, Brown D, Chen C, Prinjha RK, Richardson JC, Saunders AM, Roses AD, Richards CA: Identification of miRNA changes in Alzheimer’s Disease Brain and CSF yields putative biomarkers and insights into disease pathways. J Alzheimer’s Dis. 2008, 14: 27-41.
Favero F: , RmiR: Package to Work with miRNAs and miRNA Targets with R. 2012, [http://www.bioconductor.org/packages/release/bioc/html/RmiR.html]. [R package version 1.14.0]
Gubian S, SA P, Sewer A: ExiMiR: R Functions for the Normalization of Exiqon miRNA Array Data. 2012, [http://www.bioconductor.org/packages/release/bioc/html/ExiMiR.html]. [R package version 2.0.0].
Ingenuity Systems: Ingenuity Pathway Analysis. 2011, Redwood City, [http://www.ingenuity.com]
This research was supported in part by NIH grants DE018215, HD053509, and P20 RR017702 from the COBRE program of the National Center for Research Resources, and DOE grant 10EM00542.
The authors declare that they have no competing interests.
GB produced the compendium MmPalateMiRNA, drafted the manuscript, and guided the statistical analysis. VP conducted the original statistical analysis. CW assisted with the biological experiments. MMP and RMG oversaw the project, and helped to draft the manuscript. PM performed the biological experiments to obtain the miRNA data, wrote the original paper on which the compendium is based, and helped draft the manuscript. All authors proofread and approved the final version of the manuscript.
Electronic supplementary material
Additional file 4: “hgResult.pdf”. Significantly enriched GO biological process (BP) categories, based on the putative set of targets of differentially expressed miRNAs. P-value was based on the hypergeometric test, with all murine Entrez Gene ID entries used as the gene “universe” for comparison. For more details on how to obtain the results, see the subsection Gene Set Analysis under Results and Discussion. (PDF 310 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Brock, G.N., Mukhopadhyay, P., Pihur, V. et al. MmPalateMiRNA, an R package compendium illustrating analysis of miRNA microarray data. Source Code Biol Med 8, 1 (2013). https://doi.org/10.1186/1751-0473-8-1
- miRNA Target
- Quantile Normalization
- Diagnostic Plot
- Bioconductor Package
- Transform Growth Factor Beta Receptor